% TODO: add StoddenVictoria2016a (Enhancing reproducibility for computational methods) % TODO: http://pubs.acs.org/doi/10.1021/cen-09535-scitech2 \chapter{Software} \begin{dquote} The following guidelines are to be used in the documentation of all software developed in the Wright group for the IBM 9000 computer. % These rules have arisen as a necessary consequence of the group's programming philosophy of writing software in the form of units which can be readily shared among a number of programmers. % The approach outlined here should help to avoid some of the confusion otherwise produced by several persons simultaniously developing and modifying shared software. % % Roger Carlson, Appendix 2.3, Software Development Guidelines \dsignature{Roger Carlson, ``Software Development Guidelines'' (1988) \cite{CarlsonRogerJ1988a}} \end{dquote} \clearpage \section{Science needs software} % =============================================================== Cutting-edge science increasingly relies on custom software. % Software does more than just help scientists analyze data---scientific software enables scientists to collect, analyze, and model results in ways that would otherwise be wholly impossible. % How does scientific software get made? % Who makes it, and what is the quality of that product? % Much has been written about these questions. % To this authors knowledge, there are at least 8 case studies and surveys dedicated to how scientists develop and use scientific software. \cite{CardDavidN1986a, SeamanCarolynB1997a, MullerMatthias2001a, SegalJudith2004a, SegalJudith2005a, CarverJeffreyC2007a, HannayJoErskine2009a, PrabuPrakash2011a} % Although they focus on different disciplines, and were published at different times, these articles present a remarkably consistent perspective on what challenges tend to arise when developing software ``by and for'' scientists. % Scientists do more than just use software: they develop it. % In their 2008 survey, \textcite{HannayJoErskine2009a} showed just how much of the work of science comes down to software development: % \begin{ditemize} \item 84.3\% of surveyed scientists state that developing scientific software is important or very important for their own research. \item 91.2\% of surveyed scientists state that using scientific software is important or very important for their own research. \item On average, scientists spend approximately 40\% of their work time using scientific software. \item On average, scientists spend approximately 30\% of their work time developing scientific software. \end{ditemize} PrabhuPrakash2011a---35\% developing, breakdown by type of work... Despite the importance of software to science and scientists, most scientists are not familiar with basic software engineering concepts. % This is in part due to the their general lack of formal training in programming and software development. \textcite{HannayJoErskine2009a} found that over 90\% of scientists learn software development through `informal self study', while \textcite{SegalJudith2004a} mentions that ``[scientists] do not describe themselves as software developers and have little formal education or training in software development''. HannayJoErskine2009a agrees. JoppaLucasN2013a aggrees. This lack of training is not in-and-of-itself a problem. % After all, academic scientists are required to be ``do-it-yourself''ers in many contexts for which they receive no formal training: everything from plumbing and electrical engineering to human resources and project management. % So why pay particular attention to software development practices and skills? % One reason to pay special attention to software is that software mistakes can have particularly dramatic consequences. % As experimentalists in the physical sciences, we are often tempted by the intuition that small mistakes lead to small errors. % These intuitions do not typically apply to software---software is ``brittle'' and small bugs have huge consequences. % In his 2015 opinion article ``Rampant software errors may undermine scientific results'', David A. W. Soergel attempts to estimate how many errors there might be in scientific software, and how far reaching the consequences might be. % Quoting Soergel: \begin{dquote} ...software is profoundly brittle: ``small'' bugs commonly have unbounded error propagation. % A sign error, a missing semicolon, an off-by-one error in matching up two columns of data, etc. will render the results complete noise. % It is rare that a software bug would alter a small proportion of the data by a small amount. % More likely, it systematically alters every data point, or occurs in some downstream aggregate step with effectively global consequences. % In general, software errors produce outcomes that are inaccurate, not merely imprecise. % \end{dquote} On a more positive note, better software development practices may be ``low-hanging-fruit'' that can greatly improve researcher's lives without huge amounts of investment. % Great software makes science easier, faster, and often of higher quality. % And making great software isn't necessarily harder than the development practices that scientists are following today---indeed sometimes it is easier to follow best practices. % In the United States, funding agencies have recognized the crucial role that software plays in science. % The National Science Foundation has a long-running ``Software Infrastructure for Sustained Innovation'' (SI$^2$) program, which endeavors to take a ``leadership role in providing software as enabling infrastructure for science and engineering research'' [CITE https://www.nsf.gov/pubs/2012/nsf12113/nsf12113.pdf]. % https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503489 \section{Challenges in scientific software development} % ======================================== Software development ``by-and-for'' scientists poses unique challenges. % \subsection{Extensibility} % --------------------------------------------------------------------- Many traditional software development paradigms demand an upfront articulation of goals and requirements. % This allows the developers to carefully design their software, even before a single line of code is written. % In her seminal 2005 case study \textcite{SegalJudith2005a} describes a collaboration between a team of researchers and a contracted team of software engineers. % \begin{dquote} Unlinke traditional commercial software developers, but very much like developers in open source projects or startups, scientific programmers usually don't get their requirements from customers, and their requirements are rarely frozen. In fact, scientists often can't know what their programs should do next until the current version has produced some results. \end{dquote} \subsection{Testing} % --------------------------------------------------------------------------- PrabhuPrakash2011a---lots of good stuff under ``Scientists do not rigorously test their programs'' \subsection{Lifetime} % -------------------------------------------------------------------------- PrabhuPrakash2011a--- subsection ``long history of software development'' Challenges with portability, and updating to ``modern standards''. \subsection{Optimization} % ---------------------------------------------------------------------- PrabhuPrakash2011a: ``scientists do not optimize for the common case'', ``scientists are unaware of parallelization paradigms'' \subsection{Maintenance} % ----------------------------------------------------------------------- Scientific software, especially software maintained by graduate students, tends to be very hard to maintain. % This problem is compounded by the long lifetime of such software, and the poorly defined requirements and lack of documentation and testing. % Often times, scientific software ends up being a mess of layer upon layer of incongruent pieces written by generation upon generation of student. % Worse, software is sometimes abandoned or left untouched to become a crucial but arcane component of a scientific research project. % \section{Good-enough practices} % ================================================================ In their [...] perspective, ``Good enough practices in scientific computing'', (from which this section gets its name) [WILSON ET AL] describe a set of techniques that, in their words, ``every researcher can and should consider adopting''. % \subsection{Write clearly and document often} % -------------------------------------------------- Let the computer do the work... Write programs for people, not computers. % \subsection{Do not reinvent} % ------------------------------------------------------------------- Don't repeat yourself, or others (we built on top of scipy, hdf5). \subsection{Avoid premature optimization} % ------------------------------------------------------ Write first, optimize later. \subsection{Data formats} % ---------------------------------------------------------------------- % HDF5 % SELF-DESCRIBING DATA % OBJECT ORIENTED PROGRAMMING \subsection{Collaboration and version control} % ------------------------------------------------- Plan for mistakes / use testing. Document document docuement. Collaborate. Code review... Issues... Make incremental changes... % SOURCE CONTROL AND VERSIONING \subsection{Licensing and distribution} % -------------------------------------------------------- % LICENSING AND DISTRIBUTION \section{Object oriented programming} % ---------------------------------------------------------- The work in this dissertation makes heavy use of object oriented programming, so some very basic introduction to the concept seems warranted. % Object oriented programming (OOP) is a \emph{programming paradigm}. % Other popular paradigms are procedural programming and functional programming. % Python is a popular programming language which allows for OOP. % This section will discuss OOP in the context of a Python implementation. % The basic idea of OOP is defining object types (classes) that are self-contained. % These classes define pieces of associated data (attributes) and associated procedures (functions) within themselves. % Once the class is defined, instances of that class are created. % Instances, as the name implies, are just specific ``concrete occurance'' of a given class. % The classic example: \python{Dog} is a class, \python{fido}, \python{spot}, and \python{duke} are three dogs---three instances of the dog class. % OOP is easier to demonstrate than explain, so let's have some fun with some working Python examples. % First, we will define a class. % \begin{codefragment}{python} class Person(): def __init__(self, name, favorite_food=None, hated_food=None): self.name = name self.favorite_food = favorite_food self.hated_food = hated_food def react_to(self, food): if food == self.favorite_food: return 'yum! my favorite' elif food == self.hated_food: return 'gross---no thank you' else: return 'meh' \end{codefragment} Now I can make some instances of that class, and access their attributes and methods. % \begin{codefragment}{python} >>> mary = Person(name='Mary', favorite_food='pizza', hated_food='falafel') >>> jane = Person(name='Jane', favorite_food='salad') >>> mary.react_to('falafel') 'gross---no thank you''' >>> jane.react_to('salad') 'yum! my favorite' >>> mary.favorite_food 'pizza' >>> jane.react_to(mary.favorite_food) 'meh' \end{codefragment} We can already begin to see how powerful this approach is. % Instances of \python{Person} contain their own attributes and methods. % Instances can be interacted with in complex or simple ways. % The attributes \python{favorite_food} and \python{hated_food} are fully accessible, but need not be directly dealt with when using the \python{read_to} method. % When using OOP, one can hide complexity while still being able to access everything. % One of the most powerful patterns within OOP is \emph{inheritance}. % Inheritance is a special relationship between classes. % When a class (the child) is made to inherit from another class (the parent), all of the attributes and methods of the parent come automatically. % The child class, then, can benefit from all of the behaviors enabled by its parent while still maintaining its own identity where needed. The inheritance pattern makes it very easy to cleanly define expectations and shared structure throughout a large piece of software without repeating functionality. % % TODO: more exposition on inheritance, perhaps including an example OOP is a deep subject with many patterns and concepts behind it. % There are many places to read further [CITES]. I recommend The Quarks of Object-Oriented Development, by \textcite{ArmstrongDeborahJ2006a}. % \section{Hierarchical data format} % ------------------------------------------------------------- One of the particularly important challenges in MR-CMDS is data storage. % MR-CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment to experiment. % Historically, the Wright Group has stored data as ``flattened'' arrays in plain text, where each column corresponds to one of the scannable hardwares or one of the sensors in the experiment. % The simplicity and portability of these formats is fantastic, but they do not scale well with increasingly large and higher-dimensional data. % % TODO: justify further why flattening UTF8 files are bad idea Heirarchial data files are an alternative strategy that scales much better with large and high-dimensional data. % Originally, CDF \cite{TreinshLloydA1987a}. % Support ``random access to data, so that efficient access of small portions or large data files would be possible''. % Then, NetCDF \cite{RewRuss1990a}. More portability. % Named dimensions. % Metadata. % ``Hyperslab'' FITS used by astronomy community, with a focus on backwards compatibility. % \cite{WellsDC1981a} % CITE https://fits.gsfc.nasa.gov/ % CONSIDER CITING https://fits.gsfc.nasa.gov/rfc4047.txt I have chosen to build off of HDF5. % \section{Scientific Python} % -------------------------------------------------------------------- Numpy, SciPy % TODO: add MillmanKJarrod2011a (Python for Scientists and Engineers) % TODO: add vanderWaltStefan2011a (The NumPy Array: A Structure for Efficient Numerical Computation)