processing/chapter.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469

\chapter{Processing}

% TODO: cool quote, if I can think of one

\clearpage

From a data science perspective, CMDS has several unique challenges:
\begin{ditemize}
  \item Dimensionality of datasets can typically be greater than two, complicating
    \textbf{representation}.
  \item Shape and dimensionality change...
  \item Data can be large (over one million points).  % TODO: contextualize large (not BIG DATA)
\end{ditemize}
I have designed a software package that directly addresses these issues.  %

WrightTools is a software package at the heart of all work in the Wright Group.  %

% TODO: more intro

WrightTools is written in Python, and endeavors to have a ``pythonic'', explicit and ``natural''
application programming interface (API).  %
To use WrightTools, simply import:
\begin{codefragment}{python}
>>> import WrightTools as wt
>>> wt.__version__
3.0.0
\end{codefragment}
I'll discuss more about how exactly WrightTools packaging, distribution, and instillation works in
\autoref{sec:processing_distbribution}.

We can use the builtin Python function \python{dir} to interrogate the contents of the
WrightTools package.  %
\begin{codefragment}{python}
>>> dir(wt)
['Collection',
 'Data',
 '__branch__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '__wt5_version__',
 '_dataset',
 '_group',
 '_open',
 '_sys',
 'artists',
 'collection',
 'data',
 'diagrams',
 'exceptions',
 'kit',
 'open',
 'units']
\end{codefragment}  % TODO: consider adding fit to this list
Many of these are dunder (double underscore) attributes---Python internals that are not normally
used directly.  %
The ten attributes that do not start with underscore are the public API that users of WrightTools
typically use.  %
Within the public API are two classes, \python{Collection} \&
\python{Data}, which are the two main classes in the WrightTools object model.  %
\python{Data} stores spectra directly as multidimensional arrays, and
\python{Collection} stores \textit{groups} of data objects (and other collection
objects) in a hierarchical way for internal organization purposes.  %

\section{Data object model}  % ====================================================================

WrightTools uses a programming strategy called object oriented programming (OOP).  %
% TODO: introduce HDF5
% TODO: elaborate on the concept of OOP and how it relates to WrightTools

It contains a central data ``container'' that is capable of storing all of the information about
each multidimensional (or one-dimensional) spectra: the \python{Data} class.  %
It also defines a \python{Collection} class that contains data objects, collection
objects, and other pieces of metadata in a hierarchical structure.  %
Let's first discuss \mitinline{python}{Data}.

All spectra are stored within WrightTools as multidimensional arrays.  %
Arrays are containers that store many instances of the same data type, typically numerical
datatypes.  %
These arrays have some \python{shape}, \python{size}, and
\python{dtype}.  %
In the context of WrightTools, they can contain floats, integers, complex numbers and NaNs.  %

The \python{Data} class contains everything that is needed to define a single spectra
from a single experiment (or simulation).  %
To do this, each data object contains several multidimensional arrays (typically 2 to 50 arrays,
depending on the kind of data).  %
There are two kinds of arrays, instances of \python{Variable} and \python{Channel}.  %
Variables are coordinate arrays that define the position of each pixel in the multidimensional
spectrum, and channels are each a particular kind of signal within that spectrum.  %
Typical variables might be \python{[w1, w2, w3, d1, d2]}, and typical channels
\python{[pmt, pyro1, pyro2, pyro3]}.  %

As an overview, the following lexicographically lists the attributes and methods of
\python{Data}.  %
\begin{ditemize}
  \item method \python{collapse}: Collapse along one dimension in a well-defined way.
  \item method \python{convert}: Convert all axes of a certain kind.
  \item method \python{create_channel}: Create a new channel.
  \item method \python{create_variable}: Create a new variable.
  \item method \python{fullpath}
  \item method \python{get_nadir}
  \item method \python{get_zenith}
  \item method \python{heal}
  \item attribute \python{kind}
  \item method \python{level}
  \item method \python{map_variable}
  \item attribute \python{natural_name}
  \item attribute \python{ndim}
  \item method \python{offset}
  \item method \python{print_tree}
  \item method \python{remove_channel}
  \item method \python{remove_variable}
  \item method \python{rename_channels}
  \item method \python{rename_variables}
  \item attribute \python{shape}
  \item method \python{share_nans}
  \item attribute \python{size}
  \item method \python{smooth}
  \item attribute \python{source}
  \item method \python{split}
  \item method \python{transform}
  \item attribute \python{units}
  \item attribute \python{variable_names}
  \item attribute \python{variables}
  \item method \python{zoom}
\end{ditemize}

Each data object contains instances of \python{Channel} and \python{Variable} which represent the
principle multidimensional arrays.  %
The following lexicographically lists the attributes of these instances.  %
Certain methods and attributes are unique to only one type of dataset, and are marked as such.  %
\begin{ditemize}
  \item method \python{argmax}
  \item method \python{argmin}
  \item method \python{chunkwise}
  \item method \python{clip}
  \item method \python{convert}
  \item attribute \python{full}
  \item attribute \python{fullpath}
  \item attribute \python{label} (variable only)
  \item method \python{log}
  \item method \python{log10}
  \item method \python{log2}
  \item method \python{mag}
  \item attribute \python{major_extent} (channel only)
  \item method \python{max}
  \item method \python{min}
  \item attribute \python{minor_extent} (channel only)
  \item attribute \python{natural_name}
  \item method \python{normalize} (channel only)
  \item attribute \python{null} (channel only)
  \item attribute \python{parent}
  \item attribute \python{points}
  \item attribute \python{signed} (channel only)
  \item method \python{slices}
  \item method \python{symmetric_root}
  \item method \python{trim} (channel only)
\end{ditemize}
Channels and variables also support direct indexing / slicing using \python{__getitem__}, as
discussed more in...  % TODO: where is it discussed more?
 
Axes are ways to organize data as functional of particular variables (and combinations thereof).  %
The \python{Axis} class does not directly contain the respective arrays---it merely refers to the
associated variables.  %
The flexibility of this association is one of the main new features in WrightTools 3.  %
It enables data transformation, discussed in section ...  % TODO: link to section
Axis expressions are simple human-friendly strings made up of numbers and variable
\python{natural_name}s.  %
Given 5 variables with names \python{['w1', 'w2', 'wm', 'd1', 'd2']}, example valid expressions
include \python{'w1'}, \python{'w1=wm'}, \python{'w1+w2'}, \python{'2*w1'}, \python{'d1-d2'}, and
\python{'wm-w1+w2'}.  %
Axes can be directly indexed / sliced into using \python{__getitem__}, and they support many of the
``numpy-like'' attributes.  %
A lexicographical list of axis attributes and methods follows.
\begin{ditemize}
  \item attribute \python{full}
  \item attribute \python{label}
  \item attribute \python{natural_name}
  \item attribute \python{ndim}
  \item attribute \python{points}
  \item attribute \python{shape}
  \item attribute \python{size}
  \item attribute \python{units_kind}
  \item attribute \python{variables}
  \item method \python{convert}
  \item method \python{min}
  \item method \python{max}
\end{ditemize}  % TODO: actually lexicographical

\subsection{Creating a data object}  % ------------------------------------------------------------

WrightTools data objects are capable of storing arbitrary multidimensional spectra, but how can we
actually get data into WrightTools?  %
If you start with a wt5 file, the answer is easy: \python{wt.open(<filepath>)}.  %
But what if you have data that was written using some other software?  %
WrightTools offers data conversion functions (``from'' functions) that do the hard work of creating
data objects from other files.  %
These from-functions are as parameter free as possible, which means they recognize details like
shape and units from each specific file format without manual user intervention.  %

The most important thing about from-functions is that they are extensible: that is, that more
from-functions can be easily added as needed.  %
This modular approach to data creation means that individuals who want to use WrightTools for new
data sources can simply add one function to unlock the capabilities of the entire package as
applied to their data.  %

Following are the current from-functions, and the types of data that they support.
\begin{ditemize}
  \item Cary (collection creation)
  \item COLORS
  \item KENT
  \item PyCMDS
  \item Ocean Optics
  \item Shimadzu
  \item Tensor27
\end{ditemize}  % TODO: complete list, update wright.tools to be consistent
  
\subsubsection{Discover dimensions}

Certain older Wright Group file types (COLORS and KENT) are particularly difficult to import using
a parameter-free from-function.  %
There are two problems:
\begin{ditemize}
  \item Dimensionality limitation to individual files (1D for KENT, 2D for COLORS).
  \item Lack of self-describing metadata (headers).
\end{ditemize}
The way that WrightTools handles data creation for these file-types deserves special discussion.  %

Firstly, WrightTools contains hardcoded column information for each filetype.
Data from Kent Meyer's ``picosecond control'' software had consistent columns over the lifetime of
the software, so only one dictionary is needed to store these correspondences.  %
Skye Kain's ``COLORS'' software used at least 7 different formats, and unfortunately these format
types were not fully documented.  %
WrightTools attempts to guess the COLORS data format by counting the number of columns.  %

Because these file-types are dimensionality limited, there are many acquisitions that span over
multiple files.  %
COLORS offered an explicit queue manager which allowed users to repeat the same 2D scan (often a
Wigner scan) many times at different coordinates in non-scanned dimensions.  %
ps\_control scans were done more manually.  %
To account for this problem of multiple files spanning a single acquisition, the functions
\python{from_COLORS} and \python{from_KENT} optionally accept \emph{lists} of filepaths.  %
Inside the function, WrightTools simply appends the arrays from all given files into one long array
with many more rows.  %

The final and most challenging problem of parameter-free importing for these filetypes is
\emph{dimensionality recognition}.  %
Because the files contain no metadata, the shape and coordinates of the original acquisition must
be guessed by simply inspecting the columnar arrays.  %
In general, this problem can become very hard.  %
Luckily, each of these previous instrumental software packages was only used on one instrument with
limited flexibility in acquisition type, so it is possible to make educated guesses for almost all
acquisitions.  %

The function \python{wt.kit.discover_dimensions} handles the work of dimensionality recognition for
both COLORS and ps\_control arrays.  %
This function may be used for more filetypes in the future.  %
Roughly, the function does the following:
\begin{denumerate}
  \item Remove dimensions containing nan(s).
  \item Find which dimensions are equal (within tolerance), condense into single dimensions.
  \item Find which dimensions are scanned (move beyond tolerance).
  \item For each scanned dimension, find how many unique (outside of toelerance) points were taken.
  \item Linearize each scanned dimension between smallest and largest unique point.
  \item Return scanned dimension names, column indices and points.
\end{denumerate}
The \python{from_COLORS} and \python{from_KENT} functions then linearly interpolate each row in the
channels onto the grid defined by \python{discover_dimensions}.  %
This interpolation uses \python{scipy.interpolate.griddata}, which in turn relies upon the C++
library Qhull.  %

This strategy can be copied in the future if other non-self-describing data sources are added into
WrightTools.  %

\subsubsection{From directory}

The \python{wt.collection.from_directory} function can be used to automatically import all of the
data sources in an entire directory tree.  %
It returns a WrightTools collection with the same internal structure of the directory tree, but
with WrightTools data objects in the place of raw data source files.  %
Users can configure which files are routed to which from-function.  %

% TODO (also document on wright.tools)

\subsection{Math}  % ------------------------------------------------------------------------------

Now that we know the basics of how the WrightTools \python{Data} class stores data, it's time to do
some data manipulation.  %
Let's start with some elementary algebra.  %

\subsubsection{In-place operators}

In Python, operators are symbols that carry out some computation.  %
Consider the following:
\begin{codefragment}{python, label=abcdefg}
>>> import numpy as np
>>> a = np.array([4, 5, 6])
>>> b = np.array([-1, -2, -3])
>>> c = a + b
>>> c
array([3, 3, 3])
\end{codefragment}
Here, \python{a} and \python{b} are operands and \python{+} is an operator.  %
When used in this simple way, operators typically create and return a \emph{new} object in the
computers memory.  %
We can verify this by using Python's built-in \python{id} function on the objects created in
\ref{abcdefg}.  %
\begin{codefragment}{python}
>>> id(a), id(b), id(c)
(139712529580400, 139712333712320, 139712333713040)
\end{codefragment}
This is usually fine, but sometimes the operands are unwieldy large objects that take a lot of
memory to store.  %
In other cases operators are used millions of times such that, used as above, millions of new
arrays will be created.  %
One way to avoid these problems is to use \emph{in-place} operators

Because the \python{Data} object is mostly stored outside of memory, it is better to do
in-place... % TODO

Broadcasting... % TODO

\subsubsection{Clip}

% TODO

\subsubsection{Symmetric root}

% TODO

\subsubsection{Log}

% TODO

\subsection{Dimensionality manipulation}  % -------------------------------------------------------

WrightTools offers several strategies for reducing the dimensionality of a data object.  %
Also consider using the fit sub-package.  % TODO: more info, link to section

\subsubsection{Chop}

Chop is one of the most important methods of data, although it is typically not called directly by
users of WrightTools.  %
Chop takes n-dimensional data and ``chops'' it into all of it's lower dimensional components.  %
Consider a 3D dataset in \python{('wm', 'w2', 'w1')}.  %
This dataset can be chopped to it's component 2D \python{('wm', 'w1')} spectra.  %
\begin{codefragment}{python, label=test_label}
>>> import WrightTools as wt; from WrightTools import datasets
>>> data = wt.data.from_PyCMDS(datasets.PyCMDS.wm_w2_w1_000)
data created at /tmp/lzyjg4au.wt5::/
  axes ('wm', 'w2', 'w1')
  shape (35, 11, 11)
>>> chopped = data.chop('wm', 'w1')  
chopped data into 11 piece(s) in ('wm', 'w1')
>>> chopped.chop000
<WrightTools.Data 'chop000' ('wm', 'w1') at /tmp/935c2v5a.wt5::/chop000>
\end{codefragment}
\python{chopped} is a collection containing 11 data objects: \python{chop000, chop001 ...
  chop010}.  %
Note that, by default, the collection is made at the root level of a new tempfile.  %
An optional keyword argument \python{parent} allows users to specify the destination for this new
collection.   %
These lower dimensional data objects can then be used in plotting routines, fitting routines etc.  %

By default, chop returns \emph{all} of the lower dimensional slices.  %
Considering the same data object from \autoref{test_label}, we can choose to get all 1D wm
slices.  %
\begin{codefragment}{python}
>>> chopped = data.chop('wm')
chopped data into 121 piece(s) in ('wm',)
>>> chopped.chop000
<WrightTools.Data 'chop000' ('wm',) at /tmp/pqkbc0qr.wt5::/chop000>
\end{codefragment}

If desired, users may use the \python{at} keyword argument to specify a particular coordinate in
the un-retained dimensions.  %
For example, suppose that you want to plot the data from \ref{test_label} as an wm, w1 plot at
w2 = 1580 wn.  %
\begin{codefragment}{python}
>>> chopped = data.chop('wm', 'w1', at={'w2': [1580, 'wn']})[0]
chopped data into 1 piece(s) in ('wm', 'w1')
>>> chopped
<WrightTools.Data 'chop000' ('wm', 'w1') at /tmp/_yhrdprp.wt5::/chop000>
>>> chopped.w2.points
array([1580.0])
\end{codefragment}
Note the [0]...  % TODO
This same syntax used in artists...  % TODO

\subsubsection{Collapse}

\subsubsection{Split}

\subsubsection{Join}

\subsection{The wt5 file format}  % ---------------------------------------------------------------

Since WrightTools is based on the hdf5 file format...  % TODO

\section{Artists}  % ==============================================================================

After importing and manipulating data, one typically wants to create a plot.  %
The artists sub-package contains everything users need to plot their data objects.  %
This includes both ``quick'' artists, which generate simple plots as quickly as possible, and a
full figure layout toolkit that allows users to generate full publication quality figures.  %
It also includes ``specialty'' artists which are made to perform certain popular plotting
operations, as I will describe below.  %

Currently the artists sub-package is built on-top of the wonderful matplotlib library.  %
In the future, other libraries (e.g. mayavi), may be incorporated.  %

\subsection{Quick}  % -----------------------------------------------------------------------------

\subsubsection{1D}

\begin{figure}
  \includegraphics[width=0.5\textwidth]{"processing/quick1D 000"}
  \includepython{"processing/quick1D.py"}
  \caption[CAPTION TODO]
    {CAPTION TODO}
\end{figure}

\subsubsection{2D}

\begin{figure}
  \includegraphics[width=0.5\textwidth]{"processing/quick2D 000"}
  \includepython{"processing/quick2D.py"}
  \caption[CAPTION TODO]
    {CAPTION TODO}
\end{figure}

\subsection{Specialty}  % -------------------------------------------------------------------------

\subsection{Artists API}  % -----------------------------------------------------------------------

The artists sub-package offers a thin wrapper on the default matplotlib object-oriented figure
creation API.  %
The wrapper allows WrightTools to add the following capabilities on top of matplotlib:
\begin{ditemize}
  \item More consistent multi-axes figure layout.
  \item Ability to plot data objects directly.
\end{ditemize}
Each of these is meant to lower the barrier to plotting data.  %
Without going into every detail of matplotlib figure generation capabilities, this section
introduces the unique strategy that the WrightTools wrapper takes.  %

% TODO: finish discussion

\subsection{Colormaps}  % -------------------------------------------------------------------------

\subsection{Interpolation}  % ---------------------------------------------------------------------

\section{Fitting}  % ==============================================================================

\section{Distribution and licensing} \label{sec:processing_disbribution}  % =======================

WrightTools is MIT licensed.  %

WrightTools is distributed on PyPI and conda-forge.

\section{Future directions}  % ====================================================================