\chapter{Processing} % TODO: cool quote, if I can think of one \clearpage From a data science perspective, CMDS has several unique challenges: \begin{ditemize} \item Dimensionality of datasets can typically be greater than two, complicating \textbf{representation}. \item Shape and dimensionality change... \item Data can be large (over one million points). % TODO: contextualize large (not BIG DATA) \end{ditemize} I have designed a software package that directly addresses these issues. % WrightTools is a software package at the heart of all work in the Wright Group. % % TODO: more intro WrightTools is written in Python, and endeavors to have a ``pythonic'', explicit and ``natural'' application programming interface (API). % To use WrightTools, simply import: \begin{codefragment}{python} >>> import WrightTools as wt >>> wt.__version__ 3.0.0 \end{codefragment} I'll discuss more about how exactly WrightTools packaging, distribution, and instillation works in \autoref{sec:processing_distbribution}. We can use the builtin Python function \python{dir} to interrogate the contents of the WrightTools package. % \begin{codefragment}{python} >>> dir(wt) ['Collection', 'Data', '__branch__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '__wt5_version__', '_dataset', '_group', '_open', '_sys', 'artists', 'collection', 'data', 'diagrams', 'exceptions', 'kit', 'open', 'units'] \end{codefragment} % TODO: consider adding fit to this list Many of these are dunder (double underscore) attributes---Python internals that are not normally used directly. % The ten attributes that do not start with underscore are the public API that users of WrightTools typically use. % Within the public API are two classes, \python{Collection} \& \python{Data}, which are the two main classes in the WrightTools object model. % \python{Data} stores spectra directly as multidimensional arrays, and \python{Collection} stores \textit{groups} of data objects (and other collection objects) in a hierarchical way for internal organization purposes. % \section{Data object model} % ==================================================================== WrightTools uses a programming strategy called object oriented programming (OOP). % % TODO: introduce HDF5 % TODO: elaborate on the concept of OOP and how it relates to WrightTools It contains a central data ``container'' that is capable of storing all of the information about each multidimensional (or one-dimensional) spectra: the \python{Data} class. % It also defines a \python{Collection} class that contains data objects, collection objects, and other pieces of metadata in a hierarchical structure. % Let's first discuss \mitinline{python}{Data}. All spectra are stored within WrightTools as multidimensional arrays. % Arrays are containers that store many instances of the same data type, typically numerical datatypes. % These arrays have some \python{shape}, \python{size}, and \python{dtype}. % In the context of WrightTools, they can contain floats, integers, complex numbers and NaNs. % The \python{Data} class contains everything that is needed to define a single spectra from a single experiment (or simulation). % To do this, each data object contains several multidimensional arrays (typically 2 to 50 arrays, depending on the kind of data). % There are two kinds of arrays, instances of \python{Variable} and \python{Channel}. % Variables are coordinate arrays that define the position of each pixel in the multidimensional spectrum, and channels are each a particular kind of signal within that spectrum. % Typical variables might be \python{[w1, w2, w3, d1, d2]}, and typical channels \python{[pmt, pyro1, pyro2, pyro3]}. % As an overview, the following lexicographically lists the attributes and methods of \python{Data}. % \begin{ditemize} \item method \python{collapse}: Collapse along one dimension in a well-defined way. \item method \python{convert}: Convert all axes of a certain kind. \item method \python{create_channel}: Create a new channel. \item method \python{create_variable}: Create a new variable. \item method \python{fullpath} \item method \python{get_nadir} \item method \python{get_zenith} \item method \python{heal} \item attribute \python{kind} \item method \python{level} \item method \python{map_variable} \item attribute \python{natural_name} \item attribute \python{ndim} \item method \python{offset} \item method \python{print_tree} \item method \python{remove_channel} \item method \python{remove_variable} \item method \python{rename_channels} \item method \python{rename_variables} \item attribute \python{shape} \item method \python{share_nans} \item attribute \python{size} \item method \python{smooth} \item attribute \python{source} \item method \python{split} \item method \python{transform} \item attribute \python{units} \item attribute \python{variable_names} \item attribute \python{variables} \item method \python{zoom} \end{ditemize} Each data object contains instances of \python{Channel} and \python{Variable} which represent the principle multidimensional arrays. % The following lexicographically lists the attributes of these instances. % Certain methods and attributes are unique to only one type of dataset, and are marked as such. % \begin{ditemize} \item method \python{argmax} \item method \python{argmin} \item method \python{chunkwise} \item method \python{clip} \item method \python{convert} \item attribute \python{full} \item attribute \python{fullpath} \item attribute \python{label} (variable only) \item method \python{log} \item method \python{log10} \item method \python{log2} \item method \python{mag} \item attribute \python{major_extent} (channel only) \item method \python{max} \item method \python{min} \item attribute \python{minor_extent} (channel only) \item attribute \python{natural_name} \item method \python{normalize} (channel only) \item attribute \python{null} (channel only) \item attribute \python{parent} \item attribute \python{points} \item attribute \python{signed} (channel only) \item method \python{slices} \item method \python{symmetric_root} \item method \python{trim} (channel only) \end{ditemize} Channels and variables also support direct indexing / slicing using \python{__getitem__}, as discussed more in... % TODO: where is it discussed more? Axes are ways to organize data as functional of particular variables (and combinations thereof). % The \python{Axis} class does not directly contain the respective arrays---it merely refers to the associated variables. % The flexibility of this association is one of the main new features in WrightTools 3. % It enables data transformation, discussed in section ... % TODO: link to section Axis expressions are simple human-friendly strings made up of numbers and variable \python{natural_name}s. % Given 5 variables with names \python{['w1', 'w2', 'wm', 'd1', 'd2']}, example valid expressions include \python{'w1'}, \python{'w1=wm'}, \python{'w1+w2'}, \python{'2*w1'}, \python{'d1-d2'}, and \python{'wm-w1+w2'}. % Axes can be directly indexed / sliced into using \python{__getitem__}, and they support many of the ``numpy-like'' attributes. % A lexicographical list of axis attributes and methods follows. \begin{ditemize} \item attribute \python{full} \item attribute \python{label} \item attribute \python{natural_name} \item attribute \python{ndim} \item attribute \python{points} \item attribute \python{shape} \item attribute \python{size} \item attribute \python{units_kind} \item attribute \python{variables} \item method \python{convert} \item method \python{min} \item method \python{max} \end{ditemize} % TODO: actually lexicographical \subsection{Creating a data object} % ------------------------------------------------------------ WrightTools data objects are capable of storing arbitrary multidimensional spectra, but how can we actually get data into WrightTools? % If you start with a wt5 file, the answer is easy: \python{wt.open()}. % But what if you have data that was written using some other software? % WrightTools offers data conversion functions (``from'' functions) that do the hard work of creating data objects from other files. % These from-functions are as parameter free as possible, which means they recognize details like shape and units from each specific file format without manual user intervention. % The most important thing about from-functions is that they are extensible: that is, that more from-functions can be easily added as needed. % This modular approach to data creation means that individuals who want to use WrightTools for new data sources can simply add one function to unlock the capabilities of the entire package as applied to their data. % Following are the current from-functions, and the types of data that they support. \begin{ditemize} \item Cary (collection creation) \item COLORS \item KENT \item PyCMDS \item Ocean Optics \item Shimadzu \item Tensor27 \end{ditemize} % TODO: complete list, update wright.tools to be consistent \subsubsection{Discover dimensions} Certain older Wright Group file types (COLORS and KENT) are particularly difficult to import using a parameter-free from-function. % There are two problems: \begin{ditemize} \item Dimensionality limitation to individual files (1D for KENT, 2D for COLORS). \item Lack of self-describing metadata (headers). \end{ditemize} The way that WrightTools handles data creation for these file-types deserves special discussion. % Firstly, WrightTools contains hardcoded column information for each filetype. Data from Kent Meyer's ``picosecond control'' software had consistent columns over the lifetime of the software, so only one dictionary is needed to store these correspondences. % Skye Kain's ``COLORS'' software used at least 7 different formats, and unfortunately these format types were not fully documented. % WrightTools attempts to guess the COLORS data format by counting the number of columns. % Because these file-types are dimensionality limited, there are many acquisitions that span over multiple files. % COLORS offered an explicit queue manager which allowed users to repeat the same 2D scan (often a Wigner scan) many times at different coordinates in non-scanned dimensions. % ps\_control scans were done more manually. % To account for this problem of multiple files spanning a single acquisition, the functions \python{from_COLORS} and \python{from_KENT} optionally accept \emph{lists} of filepaths. % Inside the function, WrightTools simply appends the arrays from all given files into one long array with many more rows. % The final and most challenging problem of parameter-free importing for these filetypes is \emph{dimensionality recognition}. % Because the files contain no metadata, the shape and coordinates of the original acquisition must be guessed by simply inspecting the columnar arrays. % In general, this problem can become very hard. % Luckily, each of these previous instrumental software packages was only used on one instrument with limited flexibility in acquisition type, so it is possible to make educated guesses for almost all acquisitions. % The function \python{wt.kit.discover_dimensions} handles the work of dimensionality recognition for both COLORS and ps\_control arrays. % This function may be used for more filetypes in the future. % Roughly, the function does the following: \begin{denumerate} \item Remove dimensions containing nan(s). \item Find which dimensions are equal (within tolerance), condense into single dimensions. \item Find which dimensions are scanned (move beyond tolerance). \item For each scanned dimension, find how many unique (outside of toelerance) points were taken. \item Linearize each scanned dimension between smallest and largest unique point. \item Return scanned dimension names, column indices and points. \end{denumerate} The \python{from_COLORS} and \python{from_KENT} functions then linearly interpolate each row in the channels onto the grid defined by \python{discover_dimensions}. % This interpolation uses \python{scipy.interpolate.griddata}, which in turn relies upon the C++ library Qhull. % This strategy can be copied in the future if other non-self-describing data sources are added into WrightTools. % \subsubsection{From directory} The \python{wt.collection.from_directory} function can be used to automatically import all of the data sources in an entire directory tree. % It returns a WrightTools collection with the same internal structure of the directory tree, but with WrightTools data objects in the place of raw data source files. % Users can configure which files are routed to which from-function. % % TODO (also document on wright.tools) \subsection{Math} % ------------------------------------------------------------------------------ Now that we know the basics of how the WrightTools \python{Data} class stores data, it's time to do some data manipulation. % Let's start with some elementary algebra. % \subsubsection{In-place operators} In Python, operators are symbols that carry out some computation. % Consider the following: \begin{codefragment}{python, label=abcdefg} >>> import numpy as np >>> a = np.array([4, 5, 6]) >>> b = np.array([-1, -2, -3]) >>> c = a + b >>> c array([3, 3, 3]) \end{codefragment} Here, \python{a} and \python{b} are operands and \python{+} is an operator. % When used in this simple way, operators typically create and return a \emph{new} object in the computers memory. % We can verify this by using Python's built-in \python{id} function on the objects created in \ref{abcdefg}. % \begin{codefragment}{python} >>> id(a), id(b), id(c) (139712529580400, 139712333712320, 139712333713040) \end{codefragment} This is usually fine, but sometimes the operands are unwieldy large objects that take a lot of memory to store. % In other cases operators are used millions of times such that, used as above, millions of new arrays will be created. % One way to avoid these problems is to use \emph{in-place} operators Because the \python{Data} object is mostly stored outside of memory, it is better to do in-place... % TODO Broadcasting... % TODO \subsubsection{Clip} % TODO \subsubsection{Symmetric root} % TODO \subsubsection{Log} % TODO \subsection{Dimensionality manipulation} % ------------------------------------------------------- WrightTools offers several strategies for reducing the dimensionality of a data object. % Also consider using the fit sub-package. % TODO: more info, link to section \subsubsection{Chop} Chop is one of the most important methods of data, although it is typically not called directly by users of WrightTools. % Chop takes n-dimensional data and ``chops'' it into all of it's lower dimensional components. % Consider a 3D dataset in \python{('wm', 'w2', 'w1')}. % This dataset can be chopped to it's component 2D \python{('wm', 'w1')} spectra. % \begin{codefragment}{python, label=test_label} >>> import WrightTools as wt; from WrightTools import datasets >>> data = wt.data.from_PyCMDS(datasets.PyCMDS.wm_w2_w1_000) data created at /tmp/lzyjg4au.wt5::/ axes ('wm', 'w2', 'w1') shape (35, 11, 11) >>> chopped = data.chop('wm', 'w1') chopped data into 11 piece(s) in ('wm', 'w1') >>> chopped.chop000 \end{codefragment} \python{chopped} is a collection containing 11 data objects: \python{chop000, chop001 ... chop010}. % Note that, by default, the collection is made at the root level of a new tempfile. % An optional keyword argument \python{parent} allows users to specify the destination for this new collection. % These lower dimensional data objects can then be used in plotting routines, fitting routines etc. % By default, chop returns \emph{all} of the lower dimensional slices. % Considering the same data object from \autoref{test_label}, we can choose to get all 1D wm slices. % \begin{codefragment}{python} >>> chopped = data.chop('wm') chopped data into 121 piece(s) in ('wm',) >>> chopped.chop000 \end{codefragment} If desired, users may use the \python{at} keyword argument to specify a particular coordinate in the un-retained dimensions. % For example, suppose that you want to plot the data from \ref{test_label} as an wm, w1 plot at w2 = 1580 wn. % \begin{codefragment}{python} >>> chopped = data.chop('wm', 'w1', at={'w2': [1580, 'wn']})[0] chopped data into 1 piece(s) in ('wm', 'w1') >>> chopped >>> chopped.w2.points array([1580.0]) \end{codefragment} Note the [0]... % TODO This same syntax used in artists... % TODO \subsubsection{Collapse} \subsubsection{Split} \subsubsection{Join} \subsection{The wt5 file format} % --------------------------------------------------------------- Since WrightTools is based on the hdf5 file format... % TODO \section{Artists} % ============================================================================== After importing and manipulating data, one typically wants to create a plot. % The artists sub-package contains everything users need to plot their data objects. % This includes both ``quick'' artists, which generate simple plots as quickly as possible, and a full figure layout toolkit that allows users to generate full publication quality figures. % It also includes ``specialty'' artists which are made to perform certain popular plotting operations, as I will describe below. % Currently the artists sub-package is built on-top of the wonderful matplotlib library. % In the future, other libraries (e.g. mayavi), may be incorporated. % \subsection{Quick} % ----------------------------------------------------------------------------- \subsubsection{1D} \begin{figure} \includegraphics[width=0.5\textwidth]{"processing/quick1D 000"} \includepython{"processing/quick1D.py"} \caption[CAPTION TODO] {CAPTION TODO} \end{figure} \subsubsection{2D} \begin{figure} \includegraphics[width=0.5\textwidth]{"processing/quick2D 000"} \includepython{"processing/quick2D.py"} \caption[CAPTION TODO] {CAPTION TODO} \end{figure} \subsection{Specialty} % ------------------------------------------------------------------------- \subsection{Artists API} % ----------------------------------------------------------------------- The artists sub-package offers a thin wrapper on the default matplotlib object-oriented figure creation API. % The wrapper allows WrightTools to add the following capabilities on top of matplotlib: \begin{ditemize} \item More consistent multi-axes figure layout. \item Ability to plot data objects directly. \end{ditemize} Each of these is meant to lower the barrier to plotting data. % Without going into every detail of matplotlib figure generation capabilities, this section introduces the unique strategy that the WrightTools wrapper takes. % % TODO: finish discussion \subsection{Colormaps} % ------------------------------------------------------------------------- \subsection{Interpolation} % --------------------------------------------------------------------- \section{Fitting} % ============================================================================== \section{Distribution and licensing} \label{sec:processing_disbribution} % ======================= WrightTools is MIT licensed. % WrightTools is distributed on PyPI and conda-forge. \section{Future directions} % ====================================================================