diff options
Diffstat (limited to 'processing')
| -rw-r--r-- | processing/chapter.tex | 92 | 
1 files changed, 80 insertions, 12 deletions
| diff --git a/processing/chapter.tex b/processing/chapter.tex index 72d1d27..fa449fe 100644 --- a/processing/chapter.tex +++ b/processing/chapter.tex @@ -168,9 +168,10 @@ Channels and variables also support direct indexing / slicing using \python{__ge  discussed more in...  % TODO: where is it discussed more?
  Axes are ways to organize data as functional of particular variables (and combinations thereof).  %
 -The \python{Axis} class does not directly contain the respective arrays---it refers to the
 +The \python{Axis} class does not directly contain the respective arrays---it merely refers to the
  associated variables.  %
  The flexibility of this association is one of the main new features in WrightTools 3.  %
 +It enables data transformation, discussed in section ...  % TODO: link to section
  Axis expressions are simple human-friendly strings made up of numbers and variable
  \python{natural_name}s.  %
  Given 5 variables with names \python{['w1', 'w2', 'wm', 'd1', 'd2']}, example valid expressions
 @@ -229,21 +230,64 @@ a parameter-free from-function.  %  There are two problems:
  \begin{ditemize}
    \item Dimensionality limitation to individual files (1D for KENT, 2D for COLORS).
 -  \item Lack of self-describing metadata.
 +  \item Lack of self-describing metadata (headers).
  \end{ditemize}
  The way that WrightTools handles data creation for these file-types deserves special discussion.  %
 -Firstly, WrightTools contains hardcoded column information for each filetype...
 -For COLORS...  % TODO
 -
 -Secondly, WrightTools accepts a list of files which it stacks together to form a single large
 -array.  %
 -
 -Finally, the \python{wt.kit.discover_dimensions} function is called.  %
 -This function does its best to recognize the parameters of the original scan...  % TODO
 +Firstly, WrightTools contains hardcoded column information for each filetype.
 +Data from Kent Meyer's ``picosecond control'' software had consistent columns over the lifetime of
 +the software, so only one dictionary is needed to store these correspondences.  %
 +Skye Kain's ``COLORS'' software used at least 7 different formats, and unfortunately these format
 +types were not fully documented.  %
 +WrightTools attempts to guess the COLORS data format by counting the number of columns.  %
 +
 +Because these file-types are dimensionality limited, there are many acquisitions that span over
 +multiple files.  %
 +COLORS offered an explicit queue manager which allowed users to repeat the same 2D scan (often a
 +Wigner scan) many times at different coordinates in non-scanned dimensions.  %
 +ps\_control scans were done more manually.  %
 +To account for this problem of multiple files spanning a single acquisition, the functions
 +\python{from_COLORS} and \python{from_KENT} optionally accept \emph{lists} of filepaths.  %
 +Inside the function, WrightTools simply appends the arrays from all given files into one long array
 +with many more rows.  %
 +
 +The final and most challenging problem of parameter-free importing for these filetypes is
 +\emph{dimensionality recognition}.  %
 +Because the files contain no metadata, the shape and coordinates of the original acquisition must
 +be guessed by simply inspecting the columnar arrays.  %
 +In general, this problem can become very hard.  %
 +Luckily, each of these previous instrumental software packages was only used on one instrument with
 +limited flexibility in acquisition type, so it is possible to make educated guesses for almost all
 +acquisitions.  %
 +
 +The function \python{wt.kit.discover_dimensions} handles the work of dimensionality recognition for
 +both COLORS and ps\_control arrays.  %
 +This function may be used for more filetypes in the future.  %
 +Roughly, the function does the following:
 +\begin{denumerate}
 +  \item Remove dimensions containing nan(s).
 +  \item Find which dimensions are equal (within tolerance), condense into single dimensions.
 +  \item Find which dimensions are scanned (move beyond tolerance).
 +  \item For each scanned dimension, find how many unique (outside of toelerance) points were taken.
 +  \item Linearize each scanned dimension between smallest and largest unique point.
 +  \item Return scanned dimension names, column indices and points.
 +\end{denumerate}
 +The \python{from_COLORS} and \python{from_KENT} functions then linearly interpolate each row in the
 +channels onto the grid defined by \python{discover_dimensions}.  %
 +This interpolation uses \python{scipy.interpolate.griddata}, which in turn relies upon the C++
 +library Qhull.  %
 +
 +This strategy can be copied in the future if other non-self-describing data sources are added into
 +WrightTools.  %
  \subsubsection{From directory}
 +The \python{wt.collection.from_directory} function can be used to automatically import all of the
 +data sources in an entire directory tree.  %
 +It returns a WrightTools collection with the same internal structure of the directory tree, but
 +with WrightTools data objects in the place of raw data source files.  %
 +Users can configure which files are routed to which from-function.  %
 +
  % TODO (also document on wright.tools)
  \subsection{Math}  % ------------------------------------------------------------------------------
 @@ -252,9 +296,33 @@ Now that we know the basics of how the WrightTools \python{Data} class stores da  some data manipulation.  %
  Let's start with some elementary algebra.  %
 -\subsubsection{In place operators}
 +\subsubsection{In-place operators}
 +
 +In Python, operators are symbols that carry out some computation.  %
 +Consider the following:
 +\begin{codefragment}{python, label=abcdefg}
 +>>> import numpy as np
 +>>> a = np.array([4, 5, 6])
 +>>> b = np.array([-1, -2, -3])
 +>>> c = a + b
 +>>> c
 +array([3, 3, 3])
 +\end{codefragment}
 +Here, \python{a} and \python{b} are operands and \python{+} is an operator.  %
 +When used in this simple way, operators typically create and return a \emph{new} object in the
 +computers memory.  %
 +We can verify this by using Python's built-in \python{id} function on the objects created in
 +\ref{abcdefg}.  %
 +\begin{codefragment}{python}
 +>>> id(a), id(b), id(c)
 +(139712529580400, 139712333712320, 139712333713040)
 +\end{codefragment}
 +This is usually fine, but sometimes the operands are unwieldy large objects that take a lot of
 +memory to store.  %
 +In other cases operators are used millions of times such that, used as above, millions of new
 +arrays will be created.  %
 +One way to avoid these problems is to use \emph{in-place} operators
 -Operators are...  % TODO
  Because the \python{Data} object is mostly stored outside of memory, it is better to do
  in-place... % TODO
 | 
