From ea0a89fccc79836bf2bcd5a9992a05c4d4918157 Mon Sep 17 00:00:00 2001 From: Blaise Thompson Date: Tue, 3 Apr 2018 21:37:32 -0500 Subject: 2018-04-03 12:37 --- software/chapter.tex | 234 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 158 insertions(+), 76 deletions(-) (limited to 'software/chapter.tex') diff --git a/software/chapter.tex b/software/chapter.tex index 2a6cce2..758ca8f 100644 --- a/software/chapter.tex +++ b/software/chapter.tex @@ -101,9 +101,10 @@ enabling infrastructure for science and engineering research'' [CITE https://www \section{Challenges in scientific software development} % ======================================== Software development ``by-and-for'' scientists poses unique challenges. % +In this section, I attempt to summarize the literature about these challenges, with a focus on +those challenges that I have found most relevant. % -\subsection{Extensibility} % --------------------------------------------------------------------- - +\textbf{Extensibility.} % TODO: cite Many traditional software development paradigms demand an upfront articulation of goals and requirements. % This allows the developers to carefully design their software, even before a single line of code is @@ -120,23 +121,13 @@ of researchers and a contracted team of software engineers. % \end{dquote} -\subsection{Testing} % --------------------------------------------------------------------------- - PrabhuPrakash2011a---lots of good stuff under ``Scientists do not rigorously test their programs'' -\subsection{Lifetime} % -------------------------------------------------------------------------- - +\textbf{Lifetime.} PrabhuPrakash2011a--- subsection ``long history of software development'' - Challenges with portability, and updating to ``modern standards''. -\subsection{Optimization} % ---------------------------------------------------------------------- - -PrabhuPrakash2011a: ``scientists do not optimize for the common case'', ``scientists are unaware of -parallelization paradigms'' - -\subsection{Maintenance} % ----------------------------------------------------------------------- - +\textbf{Maintenance} Scientific software, especially software maintained by graduate students, tends to be very hard to maintain. % This problem is compounded by the long lifetime of such software, and the poorly defined @@ -146,50 +137,115 @@ written by generation upon generation of student. % Worse, software is sometimes abandoned or left untouched to become a crucial but arcane component of a scientific research project. % +\textbf{Optimization} +PrabhuPrakash2011a: ``scientists do not optimize for the common case'', ``scientists are unaware of +parallelization paradigms'' + \section{Good-enough practices} % ================================================================ In their [...] perspective, ``Good enough practices in scientific computing'', (from which this section gets its name) [WILSON ET AL] describe a set of techniques that, in their words, ``every researcher can and should consider adopting''. % - -\subsection{Write clearly and document often} % -------------------------------------------------- - -Let the computer do the work... - -Write programs for people, not computers. % - -\subsection{Do not reinvent} % ------------------------------------------------------------------- - -Don't repeat yourself, or others (we built on top of scipy, hdf5). - -\subsection{Avoid premature optimization} % ------------------------------------------------------ - -Write first, optimize later. - -\subsection{Data formats} % ---------------------------------------------------------------------- - -% HDF5 - -% SELF-DESCRIBING DATA - -% OBJECT ORIENTED PROGRAMMING - -\subsection{Collaboration and version control} % ------------------------------------------------- - -Plan for mistakes / use testing. - -Document document docuement. - -Collaborate. -Code review... -Issues... -Make incremental changes... - -% SOURCE CONTROL AND VERSIONING - -\subsection{Licensing and distribution} % -------------------------------------------------------- - -% LICENSING AND DISTRIBUTION +In this section, I attempt to very quickly summarize my personal perspective on what makes good +software development good---with citations to literature that supports each idea. % +These practices are not, generally, ``extra work''. % +In fact, many of them save massive amounts of time and effort in the long \emph{and} short run, +when properly applied. % + +\textbf{Do not reinvent.} \cite{WilsonGreg2017a} % +Before you sit down and implement a piece of software, stop! % +First you should try hard to find a library that already has what you need. % +You'll often surprise yourself with what you can find. % +Search the package repository for your language, such as PyPI [CITE], MATLAB File Exchange [CITE] +or CRAN [CITE]. % +Even if there is not a full solution to your problem out there, there is almost certainly a +solution to some part of it. % +Much better to have a dependency than a custom implementation. % +Make your dependencies explicit, in machine readable ways where possible. % + +\textbf{Do not duplicate.} \cite{WilsonGreg2017a} % +If you do need to write some software, make sure that you do not duplicate code within your own +work. % +Instead of writing the same 10 lines of code again and again with small tweaks, write a function +that accepts a set of arguments. % +If your software package grows to contain multiple files, make those files modular. % +As a general rule, once you have two classes you need multiple files. % + +\textbf{Choose good data formats.} \cite{WilsonGreg2017a} % +Choose a non-proprietary format if at all possible---remember: you yourself might not have access +to the proprietary software in 10 years. % +Choose plain text if you can. % +Consider conforming to specifications, such as Tidy Data. [CITE] % +If you must, use open binary formats such as HDF5. % +Put as much metadata as you can into the file. % +Any piece of metadata that can automatically be added by the computer is essentially free---you +might as well do it. % +Make sure that it is clear what each piece of data means. % +For tabular data, use headers. % +Don't forget units. % + +\textbf{Use version control.} % +Version control systems allow programmers to save a software package such that they can always +return to that save point. % +All of the files in the package are saved together. % +Modern version control systems allow programmers to see exactly what has changed between each save +point, and since the last save point. % +This is indispensable when trying to diagnose software problems. % +In order to use version control as effectively as possible, try to save the package after every +change (feature addition, bugfix, etc). % +Typically version control is coupled with uploading to a remote server, for example using git with +GitHub [CITE] or git.chem.wisc.edu [CITE], but version control need not be synonymous with +uploading and distribution. % +Tools like git have a lot of fantastic features beyond simply saving [CITE], but those are beyond the +scope of these ``good enough'' recommendations. % +Also consider defining a version for the software package as a whole. % +Use semantic versioning [CITE], unless there is a strong reason not to. % +If the language you are using has a convention for representing the version programmatically, such +as a \python{__version__} attribute in Python, comply with that convention. % + +\textbf{Test.} \cite{WilsonGreg2017a} % +As the old saying goes, ``if it's not tested, it's broken''. % +If you rely on a piece of functionality in your software, consider writing a test that defines that +functionality. % +In this way, as you make changes you can run your tests to ensure that those changes do not +accidentally break important functionality. % +Testing sounds difficult, but it's really just about writing simple functions that use your +software to do something, and then raise an exception if the result is not correct. % +If you add tests when you add features or fix bugs, you'll quickly find that you have a lot of +tests that do a good job of defining the expected behavior of your software. % +Software engineers tend to be dogmatic about testing, but don't worry too much about test coverage +unless your project becomes very important. % +Distribute test datasets, when appropriate. % +Remember, your tests can serve double duty as simple minimal examples. % + +\textbf{Collaborate and share.} \cite{WilsonGreg2017a} % +If you are part of a team, consider sharing software and collaborating to create it. % +Try using practices like code review and issue tracking, but don't feel obligated to use them if it +doesn't make sense for your project. % +When working as part of a team, making incremental changes and using version control become even +more important. % +Earlier we mentioned ``do not reinvent''. % +The other side of that coin is ``if you make something, consider sharing it''. % +Put your software on an open platform, like GitHub, and mint a DOI. % +Cite your software, and ask other people who are using your software to do the same. % +Choose a license early, and choose permissive and commercially compatible unless you 1. know what +you are doing and 2. plan to enforce. % +% TODO: cite 'publish your code it is good enough' + +\textbf{Write human readable code, and document it well.} \cite{WilsonGreg2017a} % +Let the computer do the work, but write the program to be read by a human. % +Give classes, functions, attributes and variables meaningful names. % +Don't be afraid to be verbose, most programming environments have tab completion so long names are +not all that hard to type. % +Try to follow the recommended style for your language, but don't obsess about it. % + +\textbf{Avoid premature optimization.} \cite{WilsonGreg2017a} +Don't get pulled into the trap of trying to make things perfect the first time. % +Software design is typically a very iterative process, and for good reason. % +Write first, and if it works, consider optimization. % +If you do need to make your software faster, use profiling tools like cProfile [CITE] and SnakeVis +[CITE] to empirically determine what operations are taking the longest, rather than trying to guess +or use intuition. % \section{Object oriented programming} % ---------------------------------------------------------- @@ -223,9 +279,9 @@ class Person(): if food == self.favorite_food: return 'yum! my favorite' elif food == self.hated_food: - return 'gross---no thank you' + return 'gross---no thank you''''''''' else: - return 'meh' + return 'meh'' \end{codefragment} Now I can make some instances of that class, and access their attributes and methods. % \begin{codefragment}{python} @@ -264,8 +320,8 @@ I recommend The Quarks of Object-Oriented Development, by \textcite{ArmstrongDeb \section{Hierarchical data format} % ------------------------------------------------------------- -One of the particularly important challenges in MR-CMDS is data storage. % -MR-CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment +One of the particularly important challenges in CMDS is data storage. % +CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment to experiment. % Historically, the Wright Group has stored data as ``flattened'' arrays in plain text, where each column corresponds to one of the scannable hardwares or one of the sensors in the experiment. % @@ -276,28 +332,54 @@ increasingly large and higher-dimensional data. % Heirarchial data files are an alternative strategy that scales much better with large and high-dimensional data. % - -Originally, CDF \cite{TreinshLloydA1987a}. % -Support ``random access to data, so that efficient access of small portions or large data files -would be possible''. % - -Then, NetCDF \cite{RewRuss1990a}. -More portability. % -Named dimensions. % -Metadata. % -``Hyperslab'' - -FITS used by astronomy community, with a focus on backwards compatibility. % -\cite{WellsDC1981a} +These are binary files that store the array directly, not in a flattened way. % +They can contain multiple arrays, with different data types, in the same file under a well-defined +organizational system. % +They support arbitrary metadata, integrated into the same hierarchy as the arrays, so making them +self-describing is trivial. % +While in general plain text is prefered for its simplicity, these file-types are simply superior +for storing CMDS data. % + +To this author's best knowledge, the Common Data Format (CDF) was the first general purpose +self-describing multidimensional array data format. \cite{TreinshLloydA1987a} % +The engineers at the National Space Science Data Center (a division of NASA) created the CDF. % +Using this construct, ``scientific softwares at NSSDC ... do not need specific knowledge of the +data whith which they are working. This permits users of such systems to apply the same functions +to different sets of data.'' +These are exactly the capabilities that CMDS requires. % + +A second-order challenge in CMDS data storage is the size of the arrays. % +While by no-means ``big data'', CMDS data is often awkwardly large: large enough to fill up the +memory of an average modern laptop or desktop computer. % +CDF also has a unique solution to this problem: use a block structure to allow access to parts of +the array without reading the entire data into memory. % + +Slightly later, NetCDF was introduced \cite{RewRuss1990a}. % +Very similar to CDF, NetCDF focused on enhancments to portability. % +Certain metadata conventions were also introduced, including named dimensions. % +NetCDF remains popular in the aerospace and + +The Flexable Image Transform System (FITS) is a similar format with a focus on visualization and +backwards compatibility. \cite{WellsDC1981a} % % CITE https://fits.gsfc.nasa.gov/ % CONSIDER CITING https://fits.gsfc.nasa.gov/rfc4047.txt +Fits is still popular in the astronomy community. % - -I have chosen to build off of HDF5. % +Today, these hierarchical data formats have gathered under the umbrella of the HDF5 format, built +and maintained by the HDF Group. [CITE] % +This format has all of the advantages of FITS, CDF, and NetCDF. % +It can support arbitrary datatypes and is optimized to quickly process large and complex +datasets. % +In Python, HDF5 is supported primarily through the h5py package. [CITE] % \section{Scientific Python} % -------------------------------------------------------------------- -Numpy, SciPy - -% TODO: add MillmanKJarrod2011a (Python for Scientists and Engineers) -% TODO: add vanderWaltStefan2011a (The NumPy Array: A Structure for Efficient Numerical Computation) \ No newline at end of file +SciPy is a collection of ``open-source software for mathematics, science, and egnineering.'' +\cite{MillmanKHarrod2011a} % +SciPy was an absolute essential component of this dissertation and the work it describes. % +There are packages under the SciPy umbrella. % +NumPy is a very powerful and fast package for working with multidimensional arrays. +\cite{vanderWaltStefan2011a} % +The SciPy library contains a vast number of scientific computing tools, including many mathematical +operations that this work depends on. [CITE] % +Matplotlib is a beautiful visualization package for 1, 2, and 3D plotting. [CITE] % -- cgit v1.2.3