From 7d602505a8b84d6c3743dd3cc0c9ac0a421f07b2 Mon Sep 17 00:00:00 2001 From: Blaise Thompson Date: Sat, 24 Mar 2018 13:48:26 -0500 Subject: 2018-03-24 13:48 --- processing/chapter.tex | 92 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 80 insertions(+), 12 deletions(-) (limited to 'processing') diff --git a/processing/chapter.tex b/processing/chapter.tex index 72d1d27..fa449fe 100644 --- a/processing/chapter.tex +++ b/processing/chapter.tex @@ -168,9 +168,10 @@ Channels and variables also support direct indexing / slicing using \python{__ge discussed more in... % TODO: where is it discussed more? Axes are ways to organize data as functional of particular variables (and combinations thereof). % -The \python{Axis} class does not directly contain the respective arrays---it refers to the +The \python{Axis} class does not directly contain the respective arrays---it merely refers to the associated variables. % The flexibility of this association is one of the main new features in WrightTools 3. % +It enables data transformation, discussed in section ... % TODO: link to section Axis expressions are simple human-friendly strings made up of numbers and variable \python{natural_name}s. % Given 5 variables with names \python{['w1', 'w2', 'wm', 'd1', 'd2']}, example valid expressions @@ -229,21 +230,64 @@ a parameter-free from-function. % There are two problems: \begin{ditemize} \item Dimensionality limitation to individual files (1D for KENT, 2D for COLORS). - \item Lack of self-describing metadata. + \item Lack of self-describing metadata (headers). \end{ditemize} The way that WrightTools handles data creation for these file-types deserves special discussion. % -Firstly, WrightTools contains hardcoded column information for each filetype... -For COLORS... % TODO - -Secondly, WrightTools accepts a list of files which it stacks together to form a single large -array. % - -Finally, the \python{wt.kit.discover_dimensions} function is called. % -This function does its best to recognize the parameters of the original scan... % TODO +Firstly, WrightTools contains hardcoded column information for each filetype. +Data from Kent Meyer's ``picosecond control'' software had consistent columns over the lifetime of +the software, so only one dictionary is needed to store these correspondences. % +Skye Kain's ``COLORS'' software used at least 7 different formats, and unfortunately these format +types were not fully documented. % +WrightTools attempts to guess the COLORS data format by counting the number of columns. % + +Because these file-types are dimensionality limited, there are many acquisitions that span over +multiple files. % +COLORS offered an explicit queue manager which allowed users to repeat the same 2D scan (often a +Wigner scan) many times at different coordinates in non-scanned dimensions. % +ps\_control scans were done more manually. % +To account for this problem of multiple files spanning a single acquisition, the functions +\python{from_COLORS} and \python{from_KENT} optionally accept \emph{lists} of filepaths. % +Inside the function, WrightTools simply appends the arrays from all given files into one long array +with many more rows. % + +The final and most challenging problem of parameter-free importing for these filetypes is +\emph{dimensionality recognition}. % +Because the files contain no metadata, the shape and coordinates of the original acquisition must +be guessed by simply inspecting the columnar arrays. % +In general, this problem can become very hard. % +Luckily, each of these previous instrumental software packages was only used on one instrument with +limited flexibility in acquisition type, so it is possible to make educated guesses for almost all +acquisitions. % + +The function \python{wt.kit.discover_dimensions} handles the work of dimensionality recognition for +both COLORS and ps\_control arrays. % +This function may be used for more filetypes in the future. % +Roughly, the function does the following: +\begin{denumerate} + \item Remove dimensions containing nan(s). + \item Find which dimensions are equal (within tolerance), condense into single dimensions. + \item Find which dimensions are scanned (move beyond tolerance). + \item For each scanned dimension, find how many unique (outside of toelerance) points were taken. + \item Linearize each scanned dimension between smallest and largest unique point. + \item Return scanned dimension names, column indices and points. +\end{denumerate} +The \python{from_COLORS} and \python{from_KENT} functions then linearly interpolate each row in the +channels onto the grid defined by \python{discover_dimensions}. % +This interpolation uses \python{scipy.interpolate.griddata}, which in turn relies upon the C++ +library Qhull. % + +This strategy can be copied in the future if other non-self-describing data sources are added into +WrightTools. % \subsubsection{From directory} +The \python{wt.collection.from_directory} function can be used to automatically import all of the +data sources in an entire directory tree. % +It returns a WrightTools collection with the same internal structure of the directory tree, but +with WrightTools data objects in the place of raw data source files. % +Users can configure which files are routed to which from-function. % + % TODO (also document on wright.tools) \subsection{Math} % ------------------------------------------------------------------------------ @@ -252,9 +296,33 @@ Now that we know the basics of how the WrightTools \python{Data} class stores da some data manipulation. % Let's start with some elementary algebra. % -\subsubsection{In place operators} +\subsubsection{In-place operators} + +In Python, operators are symbols that carry out some computation. % +Consider the following: +\begin{codefragment}{python, label=abcdefg} +>>> import numpy as np +>>> a = np.array([4, 5, 6]) +>>> b = np.array([-1, -2, -3]) +>>> c = a + b +>>> c +array([3, 3, 3]) +\end{codefragment} +Here, \python{a} and \python{b} are operands and \python{+} is an operator. % +When used in this simple way, operators typically create and return a \emph{new} object in the +computers memory. % +We can verify this by using Python's built-in \python{id} function on the objects created in +\ref{abcdefg}. % +\begin{codefragment}{python} +>>> id(a), id(b), id(c) +(139712529580400, 139712333712320, 139712333713040) +\end{codefragment} +This is usually fine, but sometimes the operands are unwieldy large objects that take a lot of +memory to store. % +In other cases operators are used millions of times such that, used as above, millions of new +arrays will be created. % +One way to avoid these problems is to use \emph{in-place} operators -Operators are... % TODO Because the \python{Data} object is mostly stored outside of memory, it is better to do in-place... % TODO -- cgit v1.2.3