diff options
| -rw-r--r-- | dissertation.tex | 24 | ||||
| -rw-r--r-- | software/chapter.tex | 234 | 
2 files changed, 170 insertions, 88 deletions
| diff --git a/dissertation.tex b/dissertation.tex index 9a7f374..4aa5b1f 100644 --- a/dissertation.tex +++ b/dissertation.tex @@ -74,19 +74,19 @@ This dissertation is approved by the following members of the Final Oral Committ  \include{software/chapter}
  \part{Development} \label{prt:development}
 -%\include{processing/chapter}
 -%\include{acquisition/chapter}
 -%\include{active_correction/chapter}
 -%\include{opa/chapter}
 +\include{processing/chapter}
 +\include{acquisition/chapter}
 +\include{active_correction/chapter}
 +\include{opa/chapter}
  %\include{mixed_domain/chapter}
  \part{Applications} \label{prt:applications}
  %\include{PbSe/chapter}
  %\include{MX2/chapter}
 -% TODO: perovskites
 +% ABANDONED perovskites
  %\include{PEDOT:PSS/chapter}
 -%\include{pyrite/chapter}
 -%\include{BiVO4/chapter}
 +% ABANDONED \include{pyrite/chapter}
 +% ABANDONED \include{BiVO4/chapter}
  % TODO: SPV
  % TODO: consider cobalamin chapter
 @@ -94,12 +94,12 @@ This dissertation is approved by the following members of the Final Oral Committ  \part{Appendix} \label{prt:appendix}
  \begin{appendix}
 -%\include{public/chapter}
 -%\include{procedures/chapter}
 -%\include{hardware/chapter}
 +\include{public/chapter}
 +\include{procedures/chapter}
 +\include{hardware/chapter}
  % TODO: consider inserting WrightTools documentation as PDF
 -%\include{errata/chapter}
 -%\include{colophon/chapter}
 +\include{errata/chapter}
 +\include{colophon/chapter}
  \end{appendix}
  % post --------------------------------------------------------------------------------------------
 diff --git a/software/chapter.tex b/software/chapter.tex index 2a6cce2..758ca8f 100644 --- a/software/chapter.tex +++ b/software/chapter.tex @@ -101,9 +101,10 @@ enabling infrastructure for science and engineering research'' [CITE https://www  \section{Challenges in scientific software development}  % ========================================
  Software development ``by-and-for'' scientists poses unique challenges.  %
 +In this section, I attempt to summarize the literature about these challenges, with a focus on
 +those challenges that I have found most relevant.  %
 -\subsection{Extensibility}  % ---------------------------------------------------------------------
 -
 +\textbf{Extensibility.}  % TODO: cite
  Many traditional software development paradigms demand an upfront articulation of goals and
  requirements.  %
  This allows the developers to carefully design their software, even before a single line of code is
 @@ -120,23 +121,13 @@ of researchers and a contracted team of software engineers.  %  \end{dquote}
 -\subsection{Testing}  % ---------------------------------------------------------------------------
 -
  PrabhuPrakash2011a---lots of good stuff under ``Scientists do not rigorously test their programs''
 -\subsection{Lifetime}  % --------------------------------------------------------------------------
 -
 +\textbf{Lifetime.}
  PrabhuPrakash2011a--- subsection ``long history of software development''
 -
  Challenges with portability, and updating to ``modern standards''.
 -\subsection{Optimization}  % ----------------------------------------------------------------------
 -
 -PrabhuPrakash2011a: ``scientists do not optimize for the common case'', ``scientists are unaware of
 -parallelization paradigms''
 -
 -\subsection{Maintenance}  % -----------------------------------------------------------------------
 -
 +\textbf{Maintenance}
  Scientific software, especially software maintained by graduate students, tends to be very hard to
  maintain.  %
  This problem is compounded by the long lifetime of such software, and the poorly defined
 @@ -146,50 +137,115 @@ written by generation upon generation of student.  %  Worse, software is sometimes abandoned or left untouched to become a crucial but arcane component
  of a scientific research project.  %
 +\textbf{Optimization}
 +PrabhuPrakash2011a: ``scientists do not optimize for the common case'', ``scientists are unaware of
 +parallelization paradigms''
 +
  \section{Good-enough practices}  % ================================================================
  In their [...] perspective, ``Good enough practices in scientific computing'', (from which this
  section gets its name) [WILSON ET AL] describe a set of techniques that, in their words, ``every
  researcher can and should consider adopting''.  %
 -
 -\subsection{Write clearly and document often}  % --------------------------------------------------
 -
 -Let the computer do the work...
 -
 -Write programs for people, not computers.  %
 -
 -\subsection{Do not reinvent}  % -------------------------------------------------------------------
 -
 -Don't repeat yourself, or others (we built on top of scipy, hdf5).
 -
 -\subsection{Avoid premature optimization}  % ------------------------------------------------------
 -
 -Write first, optimize later.
 -
 -\subsection{Data formats}  % ----------------------------------------------------------------------
 -
 -% HDF5
 -
 -% SELF-DESCRIBING DATA
 -
 -% OBJECT ORIENTED PROGRAMMING
 -
 -\subsection{Collaboration and version control}  % -------------------------------------------------
 -
 -Plan for mistakes / use testing.
 -
 -Document document docuement.
 -
 -Collaborate.
 -Code review...
 -Issues...
 -Make incremental changes...
 -
 -% SOURCE CONTROL AND VERSIONING
 -
 -\subsection{Licensing and distribution}  % --------------------------------------------------------
 -
 -% LICENSING AND DISTRIBUTION
 +In this section, I attempt to very quickly summarize my personal perspective on what makes good
 +software development good---with citations to literature that supports each idea.  %
 +These practices are not, generally, ``extra work''.  %
 +In fact, many of them save massive amounts of time and effort in the long \emph{and} short run,
 +when properly applied.  %
 +
 +\textbf{Do not reinvent.} \cite{WilsonGreg2017a}  %
 +Before you sit down and implement a piece of software, stop!  %
 +First you should try hard to find a library that already has what you need.  %
 +You'll often surprise yourself with what you can find.  %
 +Search the package repository for your language, such as PyPI [CITE], MATLAB File Exchange [CITE]
 +or CRAN [CITE].  %
 +Even if there is not a full solution to your problem out there, there is almost certainly a
 +solution to some part of it.  %
 +Much better to have a dependency than a custom implementation.  %
 +Make your dependencies explicit, in machine readable ways where possible.  %
 +
 +\textbf{Do not duplicate.} \cite{WilsonGreg2017a}  %
 +If you do need to write some software, make sure that you do not duplicate code within your own
 +work.  %
 +Instead of writing the same 10 lines of code again and again with small tweaks, write a function
 +that accepts a set of arguments.  %
 +If your software package grows to contain multiple files, make those files modular.  %
 +As a general rule, once you have two classes you need multiple files.  %
 +
 +\textbf{Choose good data formats.} \cite{WilsonGreg2017a}  %
 +Choose a non-proprietary format if at all possible---remember: you yourself might not have access
 +to the proprietary software in 10 years.  %
 +Choose plain text if you can.  %
 +Consider conforming to specifications, such as Tidy Data. [CITE]  %
 +If you must, use open binary formats such as HDF5.  %
 +Put as much metadata as you can into the file.  %
 +Any piece of metadata that can automatically be added by the computer is essentially free---you
 +might as well do it.  %
 +Make sure that it is clear what each piece of data means.  %
 +For tabular data, use headers.  %
 +Don't forget units.  %
 +
 +\textbf{Use version control.}  %
 +Version control systems allow programmers to save a software package such that they can always
 +return to that save point.  %
 +All of the files in the package are saved together.  %
 +Modern version control systems allow programmers to see exactly what has changed between each save
 +point, and since the last save point.  %
 +This is indispensable when trying to diagnose software problems.  %
 +In order to use version control as effectively as possible, try to save the package after every
 +change (feature addition, bugfix, etc).  %
 +Typically version control is coupled with uploading to a remote server, for example using git with
 +GitHub [CITE] or git.chem.wisc.edu [CITE], but version control need not be synonymous with
 +uploading and distribution.  %
 +Tools like git have a lot of fantastic features beyond simply saving [CITE], but those are beyond the
 +scope of these ``good enough'' recommendations.  %
 +Also consider defining a version for the software package as a whole.  %
 +Use semantic versioning [CITE], unless there is a strong reason not to.  %
 +If the language you are using has a convention for representing the version programmatically, such
 +as a \python{__version__} attribute in Python, comply with that convention.  %
 +
 +\textbf{Test.} \cite{WilsonGreg2017a}  %
 +As the old saying goes, ``if it's not tested, it's broken''.  %
 +If you rely on a piece of functionality in your software, consider writing a test that defines that
 +functionality.  %
 +In this way, as you make changes you can run your tests to ensure that those changes do not
 +accidentally break important functionality.  %
 +Testing sounds difficult, but it's really just about writing simple functions that use your
 +software to do something, and then raise an exception if the result is not correct.  %
 +If you add tests when you add features or fix bugs, you'll quickly find that you have a lot of
 +tests that do a good job of defining the expected behavior of your software.  %
 +Software engineers tend to be dogmatic about testing, but don't worry too much about test coverage
 +unless your project becomes very important.  %
 +Distribute test datasets, when appropriate.  %
 +Remember, your tests can serve double duty as simple minimal examples.  %
 +
 +\textbf{Collaborate and share.} \cite{WilsonGreg2017a}  %
 +If you are part of a team, consider sharing software and collaborating to create it.  %
 +Try using practices like code review and issue tracking, but don't feel obligated to use them if it
 +doesn't make sense for your project.  %
 +When working as part of a team, making incremental changes and using version control become even
 +more important.  %
 +Earlier we mentioned ``do not reinvent''.  %
 +The other side of that coin is ``if you make something, consider sharing it''.  %
 +Put your software on an open platform, like GitHub, and mint a DOI.  %
 +Cite your software, and ask other people who are using your software to do the same.  %
 +Choose a license early, and choose permissive and commercially compatible unless you 1. know what
 +you are doing and 2. plan to enforce.  %
 +% TODO: cite 'publish your code it is good enough'
 +
 +\textbf{Write human readable code, and document it well.} \cite{WilsonGreg2017a}  %
 +Let the computer do the work, but write the program to be read by a human.  %
 +Give classes, functions, attributes and variables meaningful names.  %
 +Don't be afraid to be verbose, most programming environments have tab completion so long names are
 +not all that hard to type.  %
 +Try to follow the recommended style for your language, but don't obsess about it.  %
 +
 +\textbf{Avoid premature optimization.} \cite{WilsonGreg2017a}
 +Don't get pulled into the trap of trying to make things perfect the first time.  %
 +Software design is typically a very iterative process, and for good reason.  %
 +Write first, and if it works, consider optimization.  %
 +If you do need to make your software faster, use profiling tools like cProfile [CITE] and SnakeVis
 +[CITE] to empirically determine what operations are taking the longest, rather than trying to guess
 +or use intuition.  %
  \section{Object oriented programming}  % ----------------------------------------------------------
 @@ -223,9 +279,9 @@ class Person():          if food == self.favorite_food:
              return 'yum! my favorite'
          elif food == self.hated_food:
 -            return 'gross---no thank you'
 +            return 'gross---no thank you'''''''''
          else:
 -            return 'meh'
 +            return 'meh''
  \end{codefragment}
  Now I can make some instances of that class, and access their attributes and methods.  %
  \begin{codefragment}{python}
 @@ -264,8 +320,8 @@ I recommend The Quarks of Object-Oriented Development, by \textcite{ArmstrongDeb  \section{Hierarchical data format}  % -------------------------------------------------------------
 -One of the particularly important challenges in MR-CMDS is data storage.  %
 -MR-CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment
 +One of the particularly important challenges in CMDS is data storage.  %
 +CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment
  to experiment.  %
  Historically, the Wright Group has stored data as ``flattened'' arrays in plain text, where each
  column corresponds to one of the scannable hardwares or one of the sensors in the experiment.  %
 @@ -276,28 +332,54 @@ increasingly large and higher-dimensional data.  %  Heirarchial data files are an alternative strategy that scales much better with large and
  high-dimensional data.  %
 -
 -Originally, CDF \cite{TreinshLloydA1987a}.  %
 -Support ``random access to data, so that efficient access of small portions or large data files
 -would be possible''.  %
 -
 -Then, NetCDF \cite{RewRuss1990a}.
 -More portability.  %
 -Named dimensions.  %
 -Metadata.  %
 -``Hyperslab''
 -
 -FITS used by astronomy community, with a focus on backwards compatibility.  %
 -\cite{WellsDC1981a}
 +These are binary files that store the array directly, not in a flattened way.  %
 +They can contain multiple arrays, with different data types, in the same file under a well-defined
 +organizational system.  %
 +They support arbitrary metadata, integrated into the same hierarchy as the arrays, so making them
 +self-describing is trivial.  %
 +While in general plain text is prefered for its simplicity, these file-types are simply superior
 +for storing CMDS data.  %
 +
 +To this author's best knowledge, the Common Data Format (CDF) was the first general purpose
 +self-describing multidimensional array data format. \cite{TreinshLloydA1987a}  %
 +The engineers at the National Space Science Data Center (a division of NASA) created the CDF.  %
 +Using this construct, ``scientific softwares at NSSDC ... do not need specific knowledge of the
 +data whith which they are working. This permits users of such systems to apply the same functions
 +to different sets of data.''
 +These are exactly the capabilities that CMDS requires.  %
 +
 +A second-order challenge in CMDS data storage is the size of the arrays.  %
 +While by no-means ``big data'', CMDS data is often awkwardly large: large enough to fill up the
 +memory of an average modern laptop or desktop computer.  %
 +CDF also has a unique solution to this problem: use a block structure to allow access to parts of
 +the array without reading the entire data into memory.  %
 +
 +Slightly later, NetCDF was introduced \cite{RewRuss1990a}.  %
 +Very similar to CDF, NetCDF focused on enhancments to portability.  %
 +Certain metadata conventions were also introduced, including named dimensions.  %
 +NetCDF remains popular in the aerospace and 
 +
 +The Flexable Image Transform System (FITS) is a similar format with a focus on visualization and
 +backwards compatibility. \cite{WellsDC1981a}  %
  % CITE https://fits.gsfc.nasa.gov/
  % CONSIDER CITING https://fits.gsfc.nasa.gov/rfc4047.txt
 +Fits is still popular in the astronomy community.  %
 -
 -I have chosen to build off of HDF5.  %
 +Today, these hierarchical data formats have gathered under the umbrella of the HDF5 format, built
 +and maintained by the HDF Group. [CITE]  %
 +This format has all of the advantages of FITS, CDF, and NetCDF.  %
 +It can support arbitrary datatypes and is optimized to quickly process large and complex
 +datasets.  %
 +In Python, HDF5 is supported primarily through the h5py package. [CITE]  %
  \section{Scientific Python}  % --------------------------------------------------------------------
 -Numpy, SciPy
 -
 -% TODO: add MillmanKJarrod2011a (Python for Scientists and Engineers)
 -% TODO: add vanderWaltStefan2011a (The NumPy Array: A Structure for Efficient Numerical Computation)
\ No newline at end of file +SciPy is a collection of ``open-source software for mathematics, science, and egnineering.''
 +\cite{MillmanKHarrod2011a}  %
 +SciPy was an absolute essential component of this dissertation and the work it describes.  %
 +There are packages under the SciPy umbrella.  %
 +NumPy is a very powerful and fast package for working with multidimensional arrays.
 +\cite{vanderWaltStefan2011a}  %
 +The SciPy library contains a vast number of scientific computing tools, including many mathematical
 +operations that this work depends on. [CITE]  %
 +Matplotlib is a beautiful visualization package for 1, 2, and 3D plotting. [CITE]  %
 | 
