2018-04-03 12:37

author: Blaise Thompson <blaise@untzag.com> 2018-04-03 21:37:32 -0500
committer: Blaise Thompson <blaise@untzag.com> 2018-04-03 21:37:32 -0500
commit: ea0a89fccc79836bf2bcd5a9992a05c4d4918157 (patch)
tree: 858b6f2674fc82c511a37f093479a3c68046c504 /software
parent: f17a23c402dce796a0e7483d8a822eb6c874d489 (diff)
1 files changed, 158 insertions, 76 deletions
diff --git a/software/chapter.tex b/software/chapter.tex
index 2a6cce2..758ca8f 100644
--- a/software/chapter.tex
+++ b/software/chapter.tex
@@ -101,9 +101,10 @@ enabling infrastructure for science and engineering research'' [CITE https://www
 \section{Challenges in scientific software development}  % ========================================
 
 Software development ``by-and-for'' scientists poses unique challenges.  %
+In this section, I attempt to summarize the literature about these challenges, with a focus on
+those challenges that I have found most relevant.  %
 
-\subsection{Extensibility}  % ---------------------------------------------------------------------
-
+\textbf{Extensibility.}  % TODO: cite
 Many traditional software development paradigms demand an upfront articulation of goals and
 requirements.  %
 This allows the developers to carefully design their software, even before a single line of code is
@@ -120,23 +121,13 @@ of researchers and a contracted team of software engineers.  %
 
 \end{dquote}
 
-\subsection{Testing}  % ---------------------------------------------------------------------------
-
 PrabhuPrakash2011a---lots of good stuff under ``Scientists do not rigorously test their programs''
 
-\subsection{Lifetime}  % --------------------------------------------------------------------------
-
+\textbf{Lifetime.}
 PrabhuPrakash2011a--- subsection ``long history of software development''
-
 Challenges with portability, and updating to ``modern standards''.
 
-\subsection{Optimization}  % ----------------------------------------------------------------------
-
-PrabhuPrakash2011a: ``scientists do not optimize for the common case'', ``scientists are unaware of
-parallelization paradigms''
-
-\subsection{Maintenance}  % -----------------------------------------------------------------------
-
+\textbf{Maintenance}
 Scientific software, especially software maintained by graduate students, tends to be very hard to
 maintain.  %
 This problem is compounded by the long lifetime of such software, and the poorly defined
@@ -146,50 +137,115 @@ written by generation upon generation of student.  %
 Worse, software is sometimes abandoned or left untouched to become a crucial but arcane component
 of a scientific research project.  %
 
+\textbf{Optimization}
+PrabhuPrakash2011a: ``scientists do not optimize for the common case'', ``scientists are unaware of
+parallelization paradigms''
+
 \section{Good-enough practices}  % ================================================================
 
 In their [...] perspective, ``Good enough practices in scientific computing'', (from which this
 section gets its name) [WILSON ET AL] describe a set of techniques that, in their words, ``every
 researcher can and should consider adopting''.  %
-
-\subsection{Write clearly and document often}  % --------------------------------------------------
-
-Let the computer do the work...
-
-Write programs for people, not computers.  %
-
-\subsection{Do not reinvent}  % -------------------------------------------------------------------
-
-Don't repeat yourself, or others (we built on top of scipy, hdf5).
-
-\subsection{Avoid premature optimization}  % ------------------------------------------------------
-
-Write first, optimize later.
-
-\subsection{Data formats}  % ----------------------------------------------------------------------
-
-% HDF5
-
-% SELF-DESCRIBING DATA
-
-% OBJECT ORIENTED PROGRAMMING
-
-\subsection{Collaboration and version control}  % -------------------------------------------------
-
-Plan for mistakes / use testing.
-
-Document document docuement.
-
-Collaborate.
-Code review...
-Issues...
-Make incremental changes...
-
-% SOURCE CONTROL AND VERSIONING
-
-\subsection{Licensing and distribution}  % --------------------------------------------------------
-
-% LICENSING AND DISTRIBUTION
+In this section, I attempt to very quickly summarize my personal perspective on what makes good
+software development good---with citations to literature that supports each idea.  %
+These practices are not, generally, ``extra work''.  %
+In fact, many of them save massive amounts of time and effort in the long \emph{and} short run,
+when properly applied.  %
+
+\textbf{Do not reinvent.} \cite{WilsonGreg2017a}  %
+Before you sit down and implement a piece of software, stop!  %
+First you should try hard to find a library that already has what you need.  %
+You'll often surprise yourself with what you can find.  %
+Search the package repository for your language, such as PyPI [CITE], MATLAB File Exchange [CITE]
+or CRAN [CITE].  %
+Even if there is not a full solution to your problem out there, there is almost certainly a
+solution to some part of it.  %
+Much better to have a dependency than a custom implementation.  %
+Make your dependencies explicit, in machine readable ways where possible.  %
+
+\textbf{Do not duplicate.} \cite{WilsonGreg2017a}  %
+If you do need to write some software, make sure that you do not duplicate code within your own
+work.  %
+Instead of writing the same 10 lines of code again and again with small tweaks, write a function
+that accepts a set of arguments.  %
+If your software package grows to contain multiple files, make those files modular.  %
+As a general rule, once you have two classes you need multiple files.  %
+
+\textbf{Choose good data formats.} \cite{WilsonGreg2017a}  %
+Choose a non-proprietary format if at all possible---remember: you yourself might not have access
+to the proprietary software in 10 years.  %
+Choose plain text if you can.  %
+Consider conforming to specifications, such as Tidy Data. [CITE]  %
+If you must, use open binary formats such as HDF5.  %
+Put as much metadata as you can into the file.  %
+Any piece of metadata that can automatically be added by the computer is essentially free---you
+might as well do it.  %
+Make sure that it is clear what each piece of data means.  %
+For tabular data, use headers.  %
+Don't forget units.  %
+
+\textbf{Use version control.}  %
+Version control systems allow programmers to save a software package such that they can always
+return to that save point.  %
+All of the files in the package are saved together.  %
+Modern version control systems allow programmers to see exactly what has changed between each save
+point, and since the last save point.  %
+This is indispensable when trying to diagnose software problems.  %
+In order to use version control as effectively as possible, try to save the package after every
+change (feature addition, bugfix, etc).  %
+Typically version control is coupled with uploading to a remote server, for example using git with
+GitHub [CITE] or git.chem.wisc.edu [CITE], but version control need not be synonymous with
+uploading and distribution.  %
+Tools like git have a lot of fantastic features beyond simply saving [CITE], but those are beyond the
+scope of these ``good enough'' recommendations.  %
+Also consider defining a version for the software package as a whole.  %
+Use semantic versioning [CITE], unless there is a strong reason not to.  %
+If the language you are using has a convention for representing the version programmatically, such
+as a \python{__version__} attribute in Python, comply with that convention.  %
+
+\textbf{Test.} \cite{WilsonGreg2017a}  %
+As the old saying goes, ``if it's not tested, it's broken''.  %
+If you rely on a piece of functionality in your software, consider writing a test that defines that
+functionality.  %
+In this way, as you make changes you can run your tests to ensure that those changes do not
+accidentally break important functionality.  %
+Testing sounds difficult, but it's really just about writing simple functions that use your
+software to do something, and then raise an exception if the result is not correct.  %
+If you add tests when you add features or fix bugs, you'll quickly find that you have a lot of
+tests that do a good job of defining the expected behavior of your software.  %
+Software engineers tend to be dogmatic about testing, but don't worry too much about test coverage
+unless your project becomes very important.  %
+Distribute test datasets, when appropriate.  %
+Remember, your tests can serve double duty as simple minimal examples.  %
+
+\textbf{Collaborate and share.} \cite{WilsonGreg2017a}  %
+If you are part of a team, consider sharing software and collaborating to create it.  %
+Try using practices like code review and issue tracking, but don't feel obligated to use them if it
+doesn't make sense for your project.  %
+When working as part of a team, making incremental changes and using version control become even
+more important.  %
+Earlier we mentioned ``do not reinvent''.  %
+The other side of that coin is ``if you make something, consider sharing it''.  %
+Put your software on an open platform, like GitHub, and mint a DOI.  %
+Cite your software, and ask other people who are using your software to do the same.  %
+Choose a license early, and choose permissive and commercially compatible unless you 1. know what
+you are doing and 2. plan to enforce.  %
+% TODO: cite 'publish your code it is good enough'
+
+\textbf{Write human readable code, and document it well.} \cite{WilsonGreg2017a}  %
+Let the computer do the work, but write the program to be read by a human.  %
+Give classes, functions, attributes and variables meaningful names.  %
+Don't be afraid to be verbose, most programming environments have tab completion so long names are
+not all that hard to type.  %
+Try to follow the recommended style for your language, but don't obsess about it.  %
+
+\textbf{Avoid premature optimization.} \cite{WilsonGreg2017a}
+Don't get pulled into the trap of trying to make things perfect the first time.  %
+Software design is typically a very iterative process, and for good reason.  %
+Write first, and if it works, consider optimization.  %
+If you do need to make your software faster, use profiling tools like cProfile [CITE] and SnakeVis
+[CITE] to empirically determine what operations are taking the longest, rather than trying to guess
+or use intuition.  %
 
 \section{Object oriented programming}  % ----------------------------------------------------------
 
@@ -223,9 +279,9 @@ class Person():
         if food == self.favorite_food:
             return 'yum! my favorite'
         elif food == self.hated_food:
-            return 'gross---no thank you'
+            return 'gross---no thank you'''''''''
         else:
-            return 'meh'
+            return 'meh''
 \end{codefragment}
 Now I can make some instances of that class, and access their attributes and methods.  %
 \begin{codefragment}{python}
@@ -264,8 +320,8 @@ I recommend The Quarks of Object-Oriented Development, by \textcite{ArmstrongDeb
 
 \section{Hierarchical data format}  % -------------------------------------------------------------
 
-One of the particularly important challenges in MR-CMDS is data storage.  %
-MR-CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment
+One of the particularly important challenges in CMDS is data storage.  %
+CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment
 to experiment.  %
 Historically, the Wright Group has stored data as ``flattened'' arrays in plain text, where each
 column corresponds to one of the scannable hardwares or one of the sensors in the experiment.  %
@@ -276,28 +332,54 @@ increasingly large and higher-dimensional data.  %
 
 Heirarchial data files are an alternative strategy that scales much better with large and
 high-dimensional data.  %
-
-Originally, CDF \cite{TreinshLloydA1987a}.  %
-Support ``random access to data, so that efficient access of small portions or large data files
-would be possible''.  %
-
-Then, NetCDF \cite{RewRuss1990a}.
-More portability.  %
-Named dimensions.  %
-Metadata.  %
-``Hyperslab''
-
-FITS used by astronomy community, with a focus on backwards compatibility.  %
-\cite{WellsDC1981a}
+These are binary files that store the array directly, not in a flattened way.  %
+They can contain multiple arrays, with different data types, in the same file under a well-defined
+organizational system.  %
+They support arbitrary metadata, integrated into the same hierarchy as the arrays, so making them
+self-describing is trivial.  %
+While in general plain text is prefered for its simplicity, these file-types are simply superior
+for storing CMDS data.  %
+
+To this author's best knowledge, the Common Data Format (CDF) was the first general purpose
+self-describing multidimensional array data format. \cite{TreinshLloydA1987a}  %
+The engineers at the National Space Science Data Center (a division of NASA) created the CDF.  %
+Using this construct, ``scientific softwares at NSSDC ... do not need specific knowledge of the
+data whith which they are working. This permits users of such systems to apply the same functions
+to different sets of data.''
+These are exactly the capabilities that CMDS requires.  %
+
+A second-order challenge in CMDS data storage is the size of the arrays.  %
+While by no-means ``big data'', CMDS data is often awkwardly large: large enough to fill up the
+memory of an average modern laptop or desktop computer.  %
+CDF also has a unique solution to this problem: use a block structure to allow access to parts of
+the array without reading the entire data into memory.  %
+
+Slightly later, NetCDF was introduced \cite{RewRuss1990a}.  %
+Very similar to CDF, NetCDF focused on enhancments to portability.  %
+Certain metadata conventions were also introduced, including named dimensions.  %
+NetCDF remains popular in the aerospace and 
+
+The Flexable Image Transform System (FITS) is a similar format with a focus on visualization and
+backwards compatibility. \cite{WellsDC1981a}  %
 % CITE https://fits.gsfc.nasa.gov/
 % CONSIDER CITING https://fits.gsfc.nasa.gov/rfc4047.txt
+Fits is still popular in the astronomy community.  %
 
-
-I have chosen to build off of HDF5.  %
+Today, these hierarchical data formats have gathered under the umbrella of the HDF5 format, built
+and maintained by the HDF Group. [CITE]  %
+This format has all of the advantages of FITS, CDF, and NetCDF.  %
+It can support arbitrary datatypes and is optimized to quickly process large and complex
+datasets.  %
+In Python, HDF5 is supported primarily through the h5py package. [CITE]  %
 
 \section{Scientific Python}  % --------------------------------------------------------------------
 
-Numpy, SciPy
-
-% TODO: add MillmanKJarrod2011a (Python for Scientists and Engineers)
-% TODO: add vanderWaltStefan2011a (The NumPy Array: A Structure for Efficient Numerical Computation)
-\ No newline at end of file
+SciPy is a collection of ``open-source software for mathematics, science, and egnineering.''
+\cite{MillmanKHarrod2011a}  %
+SciPy was an absolute essential component of this dissertation and the work it describes.  %
+There are packages under the SciPy umbrella.  %
+NumPy is a very powerful and fast package for working with multidimensional arrays.
+\cite{vanderWaltStefan2011a}  %
+The SciPy library contains a vast number of scientific computing tools, including many mathematical
+operations that this work depends on. [CITE]  %
+Matplotlib is a beautiful visualization package for 1, 2, and 3D plotting. [CITE]  %
author	Blaise Thompson <blaise@untzag.com>	2018-04-03 21:37:32 -0500
committer	Blaise Thompson <blaise@untzag.com>	2018-04-03 21:37:32 -0500
commit	ea0a89fccc79836bf2bcd5a9992a05c4d4918157 (patch)
tree	858b6f2674fc82c511a37f093479a3c68046c504 /software
parent	f17a23c402dce796a0e7483d8a822eb6c874d489 (diff)