2018-03-24 13:48

author: Blaise Thompson <blaise@untzag.com> 2018-03-24 13:48:26 -0500
committer: Blaise Thompson <blaise@untzag.com> 2018-03-24 13:48:26 -0500
commit: 7d602505a8b84d6c3743dd3cc0c9ac0a421f07b2 (patch)
tree: 24814ac13250f8ef0837d75e9488c889ad386c82 /processing/chapter.tex
parent: c87c0a95649ed795ae15f343b5a7ce98645e4dc5 (diff)
1 files changed, 80 insertions, 12 deletions
diff --git a/processing/chapter.tex b/processing/chapter.tex
index 72d1d27..fa449fe 100644
--- a/processing/chapter.tex
+++ b/processing/chapter.tex
@@ -168,9 +168,10 @@ Channels and variables also support direct indexing / slicing using \python{__ge
 discussed more in...  % TODO: where is it discussed more?
  
 Axes are ways to organize data as functional of particular variables (and combinations thereof).  %
-The \python{Axis} class does not directly contain the respective arrays---it refers to the
+The \python{Axis} class does not directly contain the respective arrays---it merely refers to the
 associated variables.  %
 The flexibility of this association is one of the main new features in WrightTools 3.  %
+It enables data transformation, discussed in section ...  % TODO: link to section
 Axis expressions are simple human-friendly strings made up of numbers and variable
 \python{natural_name}s.  %
 Given 5 variables with names \python{['w1', 'w2', 'wm', 'd1', 'd2']}, example valid expressions
@@ -229,21 +230,64 @@ a parameter-free from-function.  %
 There are two problems:
 \begin{ditemize}
   \item Dimensionality limitation to individual files (1D for KENT, 2D for COLORS).
-  \item Lack of self-describing metadata.
+  \item Lack of self-describing metadata (headers).
 \end{ditemize}
 The way that WrightTools handles data creation for these file-types deserves special discussion.  %
 
-Firstly, WrightTools contains hardcoded column information for each filetype...
-For COLORS...  % TODO
-
-Secondly, WrightTools accepts a list of files which it stacks together to form a single large
-array.  %
-
-Finally, the \python{wt.kit.discover_dimensions} function is called.  %
-This function does its best to recognize the parameters of the original scan...  % TODO
+Firstly, WrightTools contains hardcoded column information for each filetype.
+Data from Kent Meyer's ``picosecond control'' software had consistent columns over the lifetime of
+the software, so only one dictionary is needed to store these correspondences.  %
+Skye Kain's ``COLORS'' software used at least 7 different formats, and unfortunately these format
+types were not fully documented.  %
+WrightTools attempts to guess the COLORS data format by counting the number of columns.  %
+
+Because these file-types are dimensionality limited, there are many acquisitions that span over
+multiple files.  %
+COLORS offered an explicit queue manager which allowed users to repeat the same 2D scan (often a
+Wigner scan) many times at different coordinates in non-scanned dimensions.  %
+ps\_control scans were done more manually.  %
+To account for this problem of multiple files spanning a single acquisition, the functions
+\python{from_COLORS} and \python{from_KENT} optionally accept \emph{lists} of filepaths.  %
+Inside the function, WrightTools simply appends the arrays from all given files into one long array
+with many more rows.  %
+
+The final and most challenging problem of parameter-free importing for these filetypes is
+\emph{dimensionality recognition}.  %
+Because the files contain no metadata, the shape and coordinates of the original acquisition must
+be guessed by simply inspecting the columnar arrays.  %
+In general, this problem can become very hard.  %
+Luckily, each of these previous instrumental software packages was only used on one instrument with
+limited flexibility in acquisition type, so it is possible to make educated guesses for almost all
+acquisitions.  %
+
+The function \python{wt.kit.discover_dimensions} handles the work of dimensionality recognition for
+both COLORS and ps\_control arrays.  %
+This function may be used for more filetypes in the future.  %
+Roughly, the function does the following:
+\begin{denumerate}
+  \item Remove dimensions containing nan(s).
+  \item Find which dimensions are equal (within tolerance), condense into single dimensions.
+  \item Find which dimensions are scanned (move beyond tolerance).
+  \item For each scanned dimension, find how many unique (outside of toelerance) points were taken.
+  \item Linearize each scanned dimension between smallest and largest unique point.
+  \item Return scanned dimension names, column indices and points.
+\end{denumerate}
+The \python{from_COLORS} and \python{from_KENT} functions then linearly interpolate each row in the
+channels onto the grid defined by \python{discover_dimensions}.  %
+This interpolation uses \python{scipy.interpolate.griddata}, which in turn relies upon the C++
+library Qhull.  %
+
+This strategy can be copied in the future if other non-self-describing data sources are added into
+WrightTools.  %
 
 \subsubsection{From directory}
 
+The \python{wt.collection.from_directory} function can be used to automatically import all of the
+data sources in an entire directory tree.  %
+It returns a WrightTools collection with the same internal structure of the directory tree, but
+with WrightTools data objects in the place of raw data source files.  %
+Users can configure which files are routed to which from-function.  %
+
 % TODO (also document on wright.tools)
 
 \subsection{Math}  % ------------------------------------------------------------------------------
@@ -252,9 +296,33 @@ Now that we know the basics of how the WrightTools \python{Data} class stores da
 some data manipulation.  %
 Let's start with some elementary algebra.  %
 
-\subsubsection{In place operators}
+\subsubsection{In-place operators}
+
+In Python, operators are symbols that carry out some computation.  %
+Consider the following:
+\begin{codefragment}{python, label=abcdefg}
+>>> import numpy as np
+>>> a = np.array([4, 5, 6])
+>>> b = np.array([-1, -2, -3])
+>>> c = a + b
+>>> c
+array([3, 3, 3])
+\end{codefragment}
+Here, \python{a} and \python{b} are operands and \python{+} is an operator.  %
+When used in this simple way, operators typically create and return a \emph{new} object in the
+computers memory.  %
+We can verify this by using Python's built-in \python{id} function on the objects created in
+\ref{abcdefg}.  %
+\begin{codefragment}{python}
+>>> id(a), id(b), id(c)
+(139712529580400, 139712333712320, 139712333713040)
+\end{codefragment}
+This is usually fine, but sometimes the operands are unwieldy large objects that take a lot of
+memory to store.  %
+In other cases operators are used millions of times such that, used as above, millions of new
+arrays will be created.  %
+One way to avoid these problems is to use \emph{in-place} operators
 
-Operators are...  % TODO
 Because the \python{Data} object is mostly stored outside of memory, it is better to do
 in-place... % TODO
author	Blaise Thompson <blaise@untzag.com>	2018-03-24 13:48:26 -0500
committer	Blaise Thompson <blaise@untzag.com>	2018-03-24 13:48:26 -0500
commit	7d602505a8b84d6c3743dd3cc0c9ac0a421f07b2 (patch)
tree	24814ac13250f8ef0837d75e9488c889ad386c82 /processing/chapter.tex
parent	c87c0a95649ed795ae15f343b5a7ce98645e4dc5 (diff)