aboutsummaryrefslogtreecommitdiff
path: root/processing/chapter.tex
diff options
context:
space:
mode:
authorBlaise Thompson <blaise@untzag.com>2018-03-24 13:48:26 -0500
committerBlaise Thompson <blaise@untzag.com>2018-03-24 13:48:26 -0500
commit7d602505a8b84d6c3743dd3cc0c9ac0a421f07b2 (patch)
tree24814ac13250f8ef0837d75e9488c889ad386c82 /processing/chapter.tex
parentc87c0a95649ed795ae15f343b5a7ce98645e4dc5 (diff)
2018-03-24 13:48
Diffstat (limited to 'processing/chapter.tex')
-rw-r--r--processing/chapter.tex92
1 files changed, 80 insertions, 12 deletions
diff --git a/processing/chapter.tex b/processing/chapter.tex
index 72d1d27..fa449fe 100644
--- a/processing/chapter.tex
+++ b/processing/chapter.tex
@@ -168,9 +168,10 @@ Channels and variables also support direct indexing / slicing using \python{__ge
discussed more in... % TODO: where is it discussed more?
Axes are ways to organize data as functional of particular variables (and combinations thereof). %
-The \python{Axis} class does not directly contain the respective arrays---it refers to the
+The \python{Axis} class does not directly contain the respective arrays---it merely refers to the
associated variables. %
The flexibility of this association is one of the main new features in WrightTools 3. %
+It enables data transformation, discussed in section ... % TODO: link to section
Axis expressions are simple human-friendly strings made up of numbers and variable
\python{natural_name}s. %
Given 5 variables with names \python{['w1', 'w2', 'wm', 'd1', 'd2']}, example valid expressions
@@ -229,21 +230,64 @@ a parameter-free from-function. %
There are two problems:
\begin{ditemize}
\item Dimensionality limitation to individual files (1D for KENT, 2D for COLORS).
- \item Lack of self-describing metadata.
+ \item Lack of self-describing metadata (headers).
\end{ditemize}
The way that WrightTools handles data creation for these file-types deserves special discussion. %
-Firstly, WrightTools contains hardcoded column information for each filetype...
-For COLORS... % TODO
-
-Secondly, WrightTools accepts a list of files which it stacks together to form a single large
-array. %
-
-Finally, the \python{wt.kit.discover_dimensions} function is called. %
-This function does its best to recognize the parameters of the original scan... % TODO
+Firstly, WrightTools contains hardcoded column information for each filetype.
+Data from Kent Meyer's ``picosecond control'' software had consistent columns over the lifetime of
+the software, so only one dictionary is needed to store these correspondences. %
+Skye Kain's ``COLORS'' software used at least 7 different formats, and unfortunately these format
+types were not fully documented. %
+WrightTools attempts to guess the COLORS data format by counting the number of columns. %
+
+Because these file-types are dimensionality limited, there are many acquisitions that span over
+multiple files. %
+COLORS offered an explicit queue manager which allowed users to repeat the same 2D scan (often a
+Wigner scan) many times at different coordinates in non-scanned dimensions. %
+ps\_control scans were done more manually. %
+To account for this problem of multiple files spanning a single acquisition, the functions
+\python{from_COLORS} and \python{from_KENT} optionally accept \emph{lists} of filepaths. %
+Inside the function, WrightTools simply appends the arrays from all given files into one long array
+with many more rows. %
+
+The final and most challenging problem of parameter-free importing for these filetypes is
+\emph{dimensionality recognition}. %
+Because the files contain no metadata, the shape and coordinates of the original acquisition must
+be guessed by simply inspecting the columnar arrays. %
+In general, this problem can become very hard. %
+Luckily, each of these previous instrumental software packages was only used on one instrument with
+limited flexibility in acquisition type, so it is possible to make educated guesses for almost all
+acquisitions. %
+
+The function \python{wt.kit.discover_dimensions} handles the work of dimensionality recognition for
+both COLORS and ps\_control arrays. %
+This function may be used for more filetypes in the future. %
+Roughly, the function does the following:
+\begin{denumerate}
+ \item Remove dimensions containing nan(s).
+ \item Find which dimensions are equal (within tolerance), condense into single dimensions.
+ \item Find which dimensions are scanned (move beyond tolerance).
+ \item For each scanned dimension, find how many unique (outside of toelerance) points were taken.
+ \item Linearize each scanned dimension between smallest and largest unique point.
+ \item Return scanned dimension names, column indices and points.
+\end{denumerate}
+The \python{from_COLORS} and \python{from_KENT} functions then linearly interpolate each row in the
+channels onto the grid defined by \python{discover_dimensions}. %
+This interpolation uses \python{scipy.interpolate.griddata}, which in turn relies upon the C++
+library Qhull. %
+
+This strategy can be copied in the future if other non-self-describing data sources are added into
+WrightTools. %
\subsubsection{From directory}
+The \python{wt.collection.from_directory} function can be used to automatically import all of the
+data sources in an entire directory tree. %
+It returns a WrightTools collection with the same internal structure of the directory tree, but
+with WrightTools data objects in the place of raw data source files. %
+Users can configure which files are routed to which from-function. %
+
% TODO (also document on wright.tools)
\subsection{Math} % ------------------------------------------------------------------------------
@@ -252,9 +296,33 @@ Now that we know the basics of how the WrightTools \python{Data} class stores da
some data manipulation. %
Let's start with some elementary algebra. %
-\subsubsection{In place operators}
+\subsubsection{In-place operators}
+
+In Python, operators are symbols that carry out some computation. %
+Consider the following:
+\begin{codefragment}{python, label=abcdefg}
+>>> import numpy as np
+>>> a = np.array([4, 5, 6])
+>>> b = np.array([-1, -2, -3])
+>>> c = a + b
+>>> c
+array([3, 3, 3])
+\end{codefragment}
+Here, \python{a} and \python{b} are operands and \python{+} is an operator. %
+When used in this simple way, operators typically create and return a \emph{new} object in the
+computers memory. %
+We can verify this by using Python's built-in \python{id} function on the objects created in
+\ref{abcdefg}. %
+\begin{codefragment}{python}
+>>> id(a), id(b), id(c)
+(139712529580400, 139712333712320, 139712333713040)
+\end{codefragment}
+This is usually fine, but sometimes the operands are unwieldy large objects that take a lot of
+memory to store. %
+In other cases operators are used millions of times such that, used as above, millions of new
+arrays will be created. %
+One way to avoid these problems is to use \emph{in-place} operators
-Operators are... % TODO
Because the \python{Data} object is mostly stored outside of memory, it is better to do
in-place... % TODO