aboutsummaryrefslogtreecommitdiff
path: root/software
diff options
context:
space:
mode:
authorBlaise Thompson <blaise@untzag.com>2018-04-03 21:37:32 -0500
committerBlaise Thompson <blaise@untzag.com>2018-04-03 21:37:32 -0500
commitea0a89fccc79836bf2bcd5a9992a05c4d4918157 (patch)
tree858b6f2674fc82c511a37f093479a3c68046c504 /software
parentf17a23c402dce796a0e7483d8a822eb6c874d489 (diff)
2018-04-03 12:37
Diffstat (limited to 'software')
-rw-r--r--software/chapter.tex234
1 files changed, 158 insertions, 76 deletions
diff --git a/software/chapter.tex b/software/chapter.tex
index 2a6cce2..758ca8f 100644
--- a/software/chapter.tex
+++ b/software/chapter.tex
@@ -101,9 +101,10 @@ enabling infrastructure for science and engineering research'' [CITE https://www
\section{Challenges in scientific software development} % ========================================
Software development ``by-and-for'' scientists poses unique challenges. %
+In this section, I attempt to summarize the literature about these challenges, with a focus on
+those challenges that I have found most relevant. %
-\subsection{Extensibility} % ---------------------------------------------------------------------
-
+\textbf{Extensibility.} % TODO: cite
Many traditional software development paradigms demand an upfront articulation of goals and
requirements. %
This allows the developers to carefully design their software, even before a single line of code is
@@ -120,23 +121,13 @@ of researchers and a contracted team of software engineers. %
\end{dquote}
-\subsection{Testing} % ---------------------------------------------------------------------------
-
PrabhuPrakash2011a---lots of good stuff under ``Scientists do not rigorously test their programs''
-\subsection{Lifetime} % --------------------------------------------------------------------------
-
+\textbf{Lifetime.}
PrabhuPrakash2011a--- subsection ``long history of software development''
-
Challenges with portability, and updating to ``modern standards''.
-\subsection{Optimization} % ----------------------------------------------------------------------
-
-PrabhuPrakash2011a: ``scientists do not optimize for the common case'', ``scientists are unaware of
-parallelization paradigms''
-
-\subsection{Maintenance} % -----------------------------------------------------------------------
-
+\textbf{Maintenance}
Scientific software, especially software maintained by graduate students, tends to be very hard to
maintain. %
This problem is compounded by the long lifetime of such software, and the poorly defined
@@ -146,50 +137,115 @@ written by generation upon generation of student. %
Worse, software is sometimes abandoned or left untouched to become a crucial but arcane component
of a scientific research project. %
+\textbf{Optimization}
+PrabhuPrakash2011a: ``scientists do not optimize for the common case'', ``scientists are unaware of
+parallelization paradigms''
+
\section{Good-enough practices} % ================================================================
In their [...] perspective, ``Good enough practices in scientific computing'', (from which this
section gets its name) [WILSON ET AL] describe a set of techniques that, in their words, ``every
researcher can and should consider adopting''. %
-
-\subsection{Write clearly and document often} % --------------------------------------------------
-
-Let the computer do the work...
-
-Write programs for people, not computers. %
-
-\subsection{Do not reinvent} % -------------------------------------------------------------------
-
-Don't repeat yourself, or others (we built on top of scipy, hdf5).
-
-\subsection{Avoid premature optimization} % ------------------------------------------------------
-
-Write first, optimize later.
-
-\subsection{Data formats} % ----------------------------------------------------------------------
-
-% HDF5
-
-% SELF-DESCRIBING DATA
-
-% OBJECT ORIENTED PROGRAMMING
-
-\subsection{Collaboration and version control} % -------------------------------------------------
-
-Plan for mistakes / use testing.
-
-Document document docuement.
-
-Collaborate.
-Code review...
-Issues...
-Make incremental changes...
-
-% SOURCE CONTROL AND VERSIONING
-
-\subsection{Licensing and distribution} % --------------------------------------------------------
-
-% LICENSING AND DISTRIBUTION
+In this section, I attempt to very quickly summarize my personal perspective on what makes good
+software development good---with citations to literature that supports each idea. %
+These practices are not, generally, ``extra work''. %
+In fact, many of them save massive amounts of time and effort in the long \emph{and} short run,
+when properly applied. %
+
+\textbf{Do not reinvent.} \cite{WilsonGreg2017a} %
+Before you sit down and implement a piece of software, stop! %
+First you should try hard to find a library that already has what you need. %
+You'll often surprise yourself with what you can find. %
+Search the package repository for your language, such as PyPI [CITE], MATLAB File Exchange [CITE]
+or CRAN [CITE]. %
+Even if there is not a full solution to your problem out there, there is almost certainly a
+solution to some part of it. %
+Much better to have a dependency than a custom implementation. %
+Make your dependencies explicit, in machine readable ways where possible. %
+
+\textbf{Do not duplicate.} \cite{WilsonGreg2017a} %
+If you do need to write some software, make sure that you do not duplicate code within your own
+work. %
+Instead of writing the same 10 lines of code again and again with small tweaks, write a function
+that accepts a set of arguments. %
+If your software package grows to contain multiple files, make those files modular. %
+As a general rule, once you have two classes you need multiple files. %
+
+\textbf{Choose good data formats.} \cite{WilsonGreg2017a} %
+Choose a non-proprietary format if at all possible---remember: you yourself might not have access
+to the proprietary software in 10 years. %
+Choose plain text if you can. %
+Consider conforming to specifications, such as Tidy Data. [CITE] %
+If you must, use open binary formats such as HDF5. %
+Put as much metadata as you can into the file. %
+Any piece of metadata that can automatically be added by the computer is essentially free---you
+might as well do it. %
+Make sure that it is clear what each piece of data means. %
+For tabular data, use headers. %
+Don't forget units. %
+
+\textbf{Use version control.} %
+Version control systems allow programmers to save a software package such that they can always
+return to that save point. %
+All of the files in the package are saved together. %
+Modern version control systems allow programmers to see exactly what has changed between each save
+point, and since the last save point. %
+This is indispensable when trying to diagnose software problems. %
+In order to use version control as effectively as possible, try to save the package after every
+change (feature addition, bugfix, etc). %
+Typically version control is coupled with uploading to a remote server, for example using git with
+GitHub [CITE] or git.chem.wisc.edu [CITE], but version control need not be synonymous with
+uploading and distribution. %
+Tools like git have a lot of fantastic features beyond simply saving [CITE], but those are beyond the
+scope of these ``good enough'' recommendations. %
+Also consider defining a version for the software package as a whole. %
+Use semantic versioning [CITE], unless there is a strong reason not to. %
+If the language you are using has a convention for representing the version programmatically, such
+as a \python{__version__} attribute in Python, comply with that convention. %
+
+\textbf{Test.} \cite{WilsonGreg2017a} %
+As the old saying goes, ``if it's not tested, it's broken''. %
+If you rely on a piece of functionality in your software, consider writing a test that defines that
+functionality. %
+In this way, as you make changes you can run your tests to ensure that those changes do not
+accidentally break important functionality. %
+Testing sounds difficult, but it's really just about writing simple functions that use your
+software to do something, and then raise an exception if the result is not correct. %
+If you add tests when you add features or fix bugs, you'll quickly find that you have a lot of
+tests that do a good job of defining the expected behavior of your software. %
+Software engineers tend to be dogmatic about testing, but don't worry too much about test coverage
+unless your project becomes very important. %
+Distribute test datasets, when appropriate. %
+Remember, your tests can serve double duty as simple minimal examples. %
+
+\textbf{Collaborate and share.} \cite{WilsonGreg2017a} %
+If you are part of a team, consider sharing software and collaborating to create it. %
+Try using practices like code review and issue tracking, but don't feel obligated to use them if it
+doesn't make sense for your project. %
+When working as part of a team, making incremental changes and using version control become even
+more important. %
+Earlier we mentioned ``do not reinvent''. %
+The other side of that coin is ``if you make something, consider sharing it''. %
+Put your software on an open platform, like GitHub, and mint a DOI. %
+Cite your software, and ask other people who are using your software to do the same. %
+Choose a license early, and choose permissive and commercially compatible unless you 1. know what
+you are doing and 2. plan to enforce. %
+% TODO: cite 'publish your code it is good enough'
+
+\textbf{Write human readable code, and document it well.} \cite{WilsonGreg2017a} %
+Let the computer do the work, but write the program to be read by a human. %
+Give classes, functions, attributes and variables meaningful names. %
+Don't be afraid to be verbose, most programming environments have tab completion so long names are
+not all that hard to type. %
+Try to follow the recommended style for your language, but don't obsess about it. %
+
+\textbf{Avoid premature optimization.} \cite{WilsonGreg2017a}
+Don't get pulled into the trap of trying to make things perfect the first time. %
+Software design is typically a very iterative process, and for good reason. %
+Write first, and if it works, consider optimization. %
+If you do need to make your software faster, use profiling tools like cProfile [CITE] and SnakeVis
+[CITE] to empirically determine what operations are taking the longest, rather than trying to guess
+or use intuition. %
\section{Object oriented programming} % ----------------------------------------------------------
@@ -223,9 +279,9 @@ class Person():
if food == self.favorite_food:
return 'yum! my favorite'
elif food == self.hated_food:
- return 'gross---no thank you'
+ return 'gross---no thank you'''''''''
else:
- return 'meh'
+ return 'meh''
\end{codefragment}
Now I can make some instances of that class, and access their attributes and methods. %
\begin{codefragment}{python}
@@ -264,8 +320,8 @@ I recommend The Quarks of Object-Oriented Development, by \textcite{ArmstrongDeb
\section{Hierarchical data format} % -------------------------------------------------------------
-One of the particularly important challenges in MR-CMDS is data storage. %
-MR-CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment
+One of the particularly important challenges in CMDS is data storage. %
+CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment
to experiment. %
Historically, the Wright Group has stored data as ``flattened'' arrays in plain text, where each
column corresponds to one of the scannable hardwares or one of the sensors in the experiment. %
@@ -276,28 +332,54 @@ increasingly large and higher-dimensional data. %
Heirarchial data files are an alternative strategy that scales much better with large and
high-dimensional data. %
-
-Originally, CDF \cite{TreinshLloydA1987a}. %
-Support ``random access to data, so that efficient access of small portions or large data files
-would be possible''. %
-
-Then, NetCDF \cite{RewRuss1990a}.
-More portability. %
-Named dimensions. %
-Metadata. %
-``Hyperslab''
-
-FITS used by astronomy community, with a focus on backwards compatibility. %
-\cite{WellsDC1981a}
+These are binary files that store the array directly, not in a flattened way. %
+They can contain multiple arrays, with different data types, in the same file under a well-defined
+organizational system. %
+They support arbitrary metadata, integrated into the same hierarchy as the arrays, so making them
+self-describing is trivial. %
+While in general plain text is prefered for its simplicity, these file-types are simply superior
+for storing CMDS data. %
+
+To this author's best knowledge, the Common Data Format (CDF) was the first general purpose
+self-describing multidimensional array data format. \cite{TreinshLloydA1987a} %
+The engineers at the National Space Science Data Center (a division of NASA) created the CDF. %
+Using this construct, ``scientific softwares at NSSDC ... do not need specific knowledge of the
+data whith which they are working. This permits users of such systems to apply the same functions
+to different sets of data.''
+These are exactly the capabilities that CMDS requires. %
+
+A second-order challenge in CMDS data storage is the size of the arrays. %
+While by no-means ``big data'', CMDS data is often awkwardly large: large enough to fill up the
+memory of an average modern laptop or desktop computer. %
+CDF also has a unique solution to this problem: use a block structure to allow access to parts of
+the array without reading the entire data into memory. %
+
+Slightly later, NetCDF was introduced \cite{RewRuss1990a}. %
+Very similar to CDF, NetCDF focused on enhancments to portability. %
+Certain metadata conventions were also introduced, including named dimensions. %
+NetCDF remains popular in the aerospace and
+
+The Flexable Image Transform System (FITS) is a similar format with a focus on visualization and
+backwards compatibility. \cite{WellsDC1981a} %
% CITE https://fits.gsfc.nasa.gov/
% CONSIDER CITING https://fits.gsfc.nasa.gov/rfc4047.txt
+Fits is still popular in the astronomy community. %
-
-I have chosen to build off of HDF5. %
+Today, these hierarchical data formats have gathered under the umbrella of the HDF5 format, built
+and maintained by the HDF Group. [CITE] %
+This format has all of the advantages of FITS, CDF, and NetCDF. %
+It can support arbitrary datatypes and is optimized to quickly process large and complex
+datasets. %
+In Python, HDF5 is supported primarily through the h5py package. [CITE] %
\section{Scientific Python} % --------------------------------------------------------------------
-Numpy, SciPy
-
-% TODO: add MillmanKJarrod2011a (Python for Scientists and Engineers)
-% TODO: add vanderWaltStefan2011a (The NumPy Array: A Structure for Efficient Numerical Computation) \ No newline at end of file
+SciPy is a collection of ``open-source software for mathematics, science, and egnineering.''
+\cite{MillmanKHarrod2011a} %
+SciPy was an absolute essential component of this dissertation and the work it describes. %
+There are packages under the SciPy umbrella. %
+NumPy is a very powerful and fast package for working with multidimensional arrays.
+\cite{vanderWaltStefan2011a} %
+The SciPy library contains a vast number of scientific computing tools, including many mathematical
+operations that this work depends on. [CITE] %
+Matplotlib is a beautiful visualization package for 1, 2, and 3D plotting. [CITE] %