aboutsummaryrefslogtreecommitdiff
path: root/software/chapter.tex
blob: 50a9e755c91ae03623917a66ff8d8b0a397fc2fa (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
\chapter{Software} \label{cha:sof}

\begin{dquote}
  The following guidelines are to be used in the documentation of all software developed in the
  Wright group for the IBM 9000 computer.  %
  These rules have arisen as a necessary consequence of the group's programming philosophy of
  writing software in the form of units which can be readily shared among a number of
  programmers.  %
  The approach outlined here should help to avoid some of the confusion otherwise produced by
  several persons simultaniously developing and modifying shared software.  %

  % Roger Carlson, Appendix 2.3, Software Development Guidelines
  \dsignature{Roger Carlson, ``Software Development Guidelines'' (1988)
    \cite{CarlsonRogerJohn1988a}}
\end{dquote}

\clearpage

\section{Science needs software}  % ===============================================================

Cutting-edge science increasingly relies on custom software.  %
Scientific software enables scientists to collect, analyze, and model results in ways that would
otherwise be wholly impossible.  %

How does scientific software get made?  %
Who makes it, and what is the quality of that product?  %
Much has been written about these questions.  %
To my knowledge, there are at least 8 case studies and surveys dedicated to how scientists develop
and use scientific software. \cite{CardDavidN1986a, SeamanCarolynB1997a, MullerMatthiasM2001a,
  SegalJudith2004a, SegalJudith2005a, CarverJeffreyC2007a, HannayJoErskine2009a,
  PrabuPrakash2011a}  %
Although they focus on different disciplines, and were published at different times, these articles
present a remarkably consistent perspective on what challenges tend to arise when developing
software ``by and for'' scientists.  %

Scientists do more than just use software: they develop it.  %
In their 2008 survey, \textcite{HannayJoErskine2009a} showed just how much of the work of science
comes down to software development:  %
\begin{ditemize}
	\item 84.3\% of surveyed scientists state that developing scientific software is important or
    very important for their own research.
	\item 91.2\% of surveyed scientists state that using scientific software is important or very
    important for their own research.
	\item On average, scientists spend approximately 40\% of their work time using scientific
    software.
	\item On average, scientists spend approximately 30\% of their work time developing scientific
    software.
\end{ditemize}
\textcite{PrabhuPrakash2011a} had similar results in their 2011 survey, finding that 35\% of
research time is spent in programming and developing software.
Most of that time (57\%) is spent \emph{``finding and fixing errors in their programs''}.  %
The amount of software work done for each scientific project is very heterogeneous, with projects
ranging between 5\% and 95\% software development time.  %
To me, the averages reported by \textcite{HannayJoErskine2009a} and \textcite{PrabhuPrakash2011a}
seem roughly correct for the average Wright Group member.  %

Despite the importance of software to science and scientists, most scientists are not familiar with
basic software engineering concepts.  %
This is in part due to the their general lack of formal training in programming and software
development. \textcite{HannayJoErskine2009a} found that over 90\% of scientists learn software
development through `informal self study', while \textcite{SegalJudith2004a} mentions that
\emph{``[scientists] do not describe themselves as software developers and have little formal
  education or training in software development''}.

This lack of training is not in-and-of-itself a problem.  %
After all, academic scientists are required to be ``do-it-yourself''ers in many contexts for which
they receive no formal training: everything from plumbing and electrical engineering to human
resources and project management.  %
So why pay particular attention to software development practices and skills?  %

One reason to pay special attention to software is that software mistakes can have particularly
dramatic consequences.  %
As experimentalists in the physical sciences, we are often tempted by the intuition that small
mistakes lead to small errors.  %
These intuitions do not typically apply to software---software is ``brittle'' and small bugs have
huge consequences.  %
In his 2015 opinion article ``Rampant software errors may undermine scientific results'', David A.
W. Soergel attempts to estimate how many errors there might be in scientific software, and how far
reaching the consequences might be.  %
Quoting Soergel:

\begin{dquote}
  ...software is profoundly brittle: ``small'' bugs commonly have unbounded error propagation.  %
  A sign error, a missing semicolon, an off-by-one error in matching up two columns of data, etc.
  will render the results complete noise.  %
  It is rare that a software bug would alter a small proportion of the data by a small amount.  %
  More likely, it systematically alters every data point, or occurs in some downstream aggregate
  step with effectively global consequences.  %
  In general, software errors produce outcomes that are inaccurate, not merely imprecise.  %

\end{dquote}

On a more positive note, better software development practices may be ``low-hanging-fruit'' that
can greatly improve researcher's lives without huge amounts of investment.  %
Great software makes science easier, faster, and often of higher quality.  %
And making great software isn't necessarily harder than the development practices that scientists
are following today---indeed sometimes it is easier to follow best practices.  % TODO: cite wilson

In the United States, funding agencies have recognized the crucial role that software plays in
science.  %
The National Science Foundation has a long-running ``Software Infrastructure for Sustained
Innovation'' (SI$^2$) program, which endeavors to take a \emph{``leadership role in providing software as enabling infrastructure for science and engineering research''}. \cite{SI2}  %
Other funding agencies have similar projects.  %

\section{Challenges in scientific software development}  % ========================================

Software development ``by-and-for'' scientists poses unique challenges.  %
In this section, I attempt to summarize the literature about these challenges, with a focus on
those that I have found most relevant.  %

\textbf{``End-user developers.''} \cite{SegalJudith2005a, HannayJoErskine2009a, JoppaLucasN2013a}
% TODO: see Joppa ref 17, 21 22
Typically the developers of scientific software are not trained software developers.  %
This is perfectly appropriate, because scientific software development typically requires a large
amount of domain-specific knowledge that only ``end-users'' possess.  %
Software development practices may not be valued in a scientific environment.  %
End-users may lack the skill and knowledge required to develop high quality, maintainable
software.  %
They may not be aware of best practices in software development.  %
They focus on feature additions and neglect documentation and maintenance.  %

\textbf{Shifting goals.} \cite{SegalJudith2005a, CarverJeffreyC2007a, HannayJoErskine2009a,
  PrabhuPrakash2011a}
Traditional software development paradigms typically demand an upfront articulation of goals and
requirements.  %
This allows the developers to carefully design their software, even before a single line of code is
written.  %
In her seminal 2005 case study \textcite{SegalJudith2005a} describes a collaboration between a team
of researchers and a contracted team of software engineers.  %

\begin{dquote}

  Unlike traditional commercial software developers, but very much like developers in open source
  projects or startups, scientific programmers usually don't get their requirements from customers,
  and their requirements are rarely frozen.
  In fact, scientists often can't know what their programs should do next until the current version
  has produced some results.

\end{dquote}

Scientific software is \emph{explorative}, and it needs to be flexible and extendable.  %
Scientific software developers cannot know what will be required before they set out to try.  %
This is probably the most fundamental challenge in such projects, and a big part of why science
cannot simply ``contract out'' a large part of its software development needs.  %
Sometimes, a scientific problem is worked out though the iterative process of developing software
to solve it.  %

\textbf{Maintenance.} \cite{CarverJeffreyC2007a, PrabhuPrakash2011a}
Scientific software is famously hard to maintain.  %
Graduate students graduate, and institutional knowledge about the internal workings of software
projects is diminished over time.  %
This problem is compounded by the long lifetime of some software, the poorly defined
requirements, and lack of documentation and testing.  %
Often times, scientific software ends up being a mess of layer upon layer of incongruent pieces
written by generation upon generation of student.  %
Worse, software is sometimes abandoned or left untouched to become a crucial but arcane component
of a scientific research project.  %

\textbf{Lack of testing.} \cite{SandersRebecca2008a, PrabhuPrakash2011a, JoppaLucasN2013a}
Testing is a huge part of software development practices, but many researchers do not engage in
sufficient testing of their software.  %
Without testing, even small software projects can rapidly ``get out of hand''---they can become
unsustainable and unmaintainable.  %
Especially for domain-specific computational software, determining the ``correct outcome'' to test
against is often infeasible.  %
Software is not typically peer reviewed, so a lack of software testing is often a weak link in the
loop of critical self assessment that science depends upon.  %
On the positive side, testing can be an easy-to-add development practice with huge rewards.  %
Well written tests can be a programmers best friend: helping her to ensure that her code has met
all of the given requirements.  %
This allows programmers to optimize without worrying about breaking crucial components of their
software.  %

\textbf{Struggles with optimization.} \cite{PrabhuPrakash2011a}
Sometimes, a scientific application requires performant code.  %
Scientists typically struggle to write such code.  %
They may struggle with parallelization paradigms, or they may not understand what is limiting the
speed of their software.  %
They may not have good intuitions about how long certain operations should take, or what patterns
could be used to speed up execution.  %
Scientists typically do not use profiling tools which help them see which parts of their program
would benefit most from optimization.  %

\section{Good-enough practices}  % ================================================================

In their 2017 perspective, ``Good enough practices in scientific computing'', (from which this
section gets its name) \textcite{WilsonGreg2017a} describe a set of techniques that, in their
words, \emph{``every researcher can and should consider adopting''}.  %
In this section, I attempt to very quickly summarize my personal perspective on what makes good
software development good---with citations to literature that supports each idea.  %
These practices are not, generally, \emph{extra work}.  %
In fact, many of them save massive amounts of time and effort in the long \emph{and} short run,
when properly applied. \cite{WilsonGreg2006a}  %

\textbf{Do not reinvent.} \cite{WilsonGreg2017a}  %
Before you sit down and implement a piece of software, stop!  %
First you should try hard to find a library that already has what you need.  %
You'll often surprise yourself with what you can find.  %
Search the package repository for your language, such as PyPI \cite{PyPI}, MATLAB File Exchange
\cite{FileExchange} or CRAN \cite{CRAN}.  %
Even if there is not a full solution to your problem out there, there is almost certainly a
solution to some part of it.  %
Much better to have a dependency than a custom implementation.  %
Make your dependencies explicit, in machine readable ways where possible.  %

\textbf{Do not duplicate.} \cite{WilsonGreg2017a}  %
If you do need to write some software, make sure that you do not duplicate code within your own
work.  %
Instead of writing the same few lines of code again and again with small tweaks, write a function
that accepts a set of arguments.  %
If you are doing the same operation in many different contexts, consider defining a library to that
operation that can be imported and shared between your different projects.  %
If your software package grows to contain multiple files, make those files modular.  %
As a general rule, once you have two classes you need multiple files.  %

\textbf{Choose good data formats.} \cite{BaxterSusanM2006a, WilsonGreg2017a}  %
Choose a non-proprietary format if at all possible---remember: you yourself might not have access
to the proprietary software in 10 years.  %
Choose plain text if you can.  %
Consider conforming to specifications, such as Tidy Data \cite{WickhamHadley2014a}.  %
If you must, use open binary formats such as HDF5 \cite{FolkMike2011a}.  %
Put as much metadata as you can into the file.  %
Any piece of metadata that can automatically be added by the computer is essentially free---you
might as well do it.  %
Make sure that it is clear what each piece of data means.  %
For tabular data, use headers.  %
Don't forget units.  %

\textbf{Use version control.} \cite{BaxterSusanM2006a, WilsonGreg2006a}  %
Version control systems allow programmers to save a software package such that they can always
return to that save point.  %
All of the files in the package are saved together.  %
These systems also allow programmers to see exactly what has changed between each save
point, and since the last save point.  %
This is indispensable when trying to diagnose software problems.  %
In order to use version control as effectively as possible, try to save the package after every
change (feature addition, bugfix, etc).  %
Typically version control is coupled with uploading to a remote server, for example using git with
GitHub \cite{GitHub}, GitLab \cite{GitLab} or git.chem.wisc.edu \cite{git.chem.wisc.edu}, but
version control need not be synonymous with uploading and distribution.  %
Tools like git have a lot of fantastic features beyond simply saving, but those are beyond the
scope of these ``good enough'' recommendations.  %
Also consider defining a version for the software package as a whole.  %
Use semantic versioning (MAJOR.MINOR.PATCH) \cite{SemanticVersioning}, unless there is a strong
reason not to.  %
If the language you are using has a convention for representing the version programmatically, such
as a \python{__version__} attribute in Python, comply with that convention.  %

\textbf{Test.} \cite{BaxterSusanM2006a, WilsonGreg2006a, WilsonGreg2017a}  %
As the old saying goes, ``if it's not tested, it's broken''.  %
If you rely on a piece of functionality in your software, consider writing a test that defines that
functionality.  %
In this way, as you make changes you can run your tests to ensure that those changes do not
accidentally break important functionality.  %
Testing sounds difficult, but it's really just about writing simple functions that use your
software to do something, and then asking if the result is correct.  %
If you add tests when you add features or fix bugs, you'll quickly find that you have a lot of
tests that do a good job of defining the expected behavior of your software.  %
Software engineers tend to be dogmatic about testing, but don't worry too much about test coverage
unless your project becomes very important.  %
Distribute test datasets, when appropriate.  %
Remember, your tests can serve double duty as simple minimal examples.  %

\textbf{Collaborate and share.} \cite{BaxterSusanM2006a, WilsonGreg2017a, BarnesNick2010a}  %
If you are part of a team, consider sharing software and collaborating to create it.  %
Try using practices like code review and issue tracking, but don't feel obligated to use them if it
doesn't make sense for your project.  %
When working as part of a team, making incremental changes and using version control become even
more important.  %
Earlier we mentioned ``do not reinvent''.  %
The other side of that coin is ``if you make something, consider sharing it''.  %
Put your software on an open platform, like GitHub \cite{GitHub}, and mint a DOI.  %
Cite your software, and ask other people who are using your software to do the same.  %
Choose a license early, and choose permissive and commercially compatible unless you 1. know what
you are doing and 2. plan to enforce.  %
Afraid to share because your code needs more polish?  %
If your software is good enough to be used in active scientific research, it's worth sharing.  %
As Nick Barnes says, \emph{``Publish your computer code: it is good enough''}.
\cite{BarnesNick2010a}  %

\textbf{Write human readable code, and document it well.} \cite{WilsonGreg2017a}  %
Let the computer do the work, but write the source code to be read by a human.  %
Give classes, functions, attributes and variables meaningful names.  %
Don't be afraid to be verbose, most programming environments have tab completion so long names are
not all that hard to type.  %
Try to follow the recommended style for your language, but don't obsess about it.  %

\textbf{Avoid premature optimization.} \cite{WilsonGreg2014a, WilsonGreg2017a}
Don't get pulled into the trap of trying to make things perfect the first time.  %
Software design is typically a very iterative process, and for good reason.  %
This is particularly true in a scientific context, where goals may evolve during the development
process.  %
Write for correctness first, and if it works, consider optimization.  %
If you do need to make your software faster, use profiling tools like cProfile
\cite{PythonProfilers} and SnakeViz \cite{SnakeViz} to empirically determine what operations are
taking the longest, rather than trying to guess or use intuition.  %
Only optimize speed-limiting operations, and stop optimizing once the code runs as quickly as
needed.  %

\section{Object oriented programming} \label{sof:sec:oop}  % ======================================

The work in this dissertation makes heavy use of object oriented programming, so some very basic
introduction to the concept seems warranted.  %
Object oriented programming (OOP) is a \emph{programming paradigm}.  %
Other popular paradigms are procedural programming and functional programming.  %
Python is a popular programming language which allows for OOP.  %
This section will discuss OOP in the context of a Python implementation.  %

The basic idea of OOP is defining object types (classes) that are self-contained.  %
These classes define pieces of associated data (attributes) and associated procedures (functions)
within themselves.  %
Once the class is defined, instances of that class can be created.  %
Instances, as the name implies, are just specific ``concrete occurances'' of a given class.  %
The classic example: \python{Dog} is a class, \python{fido}, \python{spot}, and \python{duke} are
three dogs---three instances of the dog class.  %

OOP is easier to demonstrate than explain, so let's have some fun with some working Python
examples.  %
First, we will define a class.  %
\begin{codefragment}{python, label=sof:lst:person}
class Person():

    def __init__(self, name, favorite_food=None, hated_food=None):
        self.name = name
        self.favorite_food = favorite_food
        self.hated_food = hated_food

    def react_to(self, food):
        if food == self.favorite_food:
            return 'yum! my favorite'
        elif food == self.hated_food:
            return 'gross---no thank you'
        else:
            return 'meh'
\end{codefragment}
Now I can make some instances of that class, and access their attributes and methods.  %
\begin{codefragment}{python}
>>> mary = Person(name='Mary', favorite_food='pizza', hated_food='falafel')
>>> jane = Person(name='Jane', favorite_food='salad'')
>>> mary.react_to('falafel')
'gross---no thank you'
>>> jane.react_to('salad')
'yum! my favorite'
>>> mary.favorite_food
'pizza'
>>> jane.react_to(mary.favorite_food)
'meh'
\end{codefragment}
We can already begin to see how powerful this approach is.  %
Instances of \python{Person} contain their own attributes and methods.  %
Instances can be interacted with in complex or simple ways.  %
The attributes \python{favorite_food} and \python{hated_food} are fully accessible, but need not be
directly dealt with when using the \python{react_to} method.  %
When using OOP, one can hide complexity while still being able to access everything.  %

One of the most powerful patterns within OOP is \emph{inheritance}.  %
Inheritance is a special relationship between classes.  %
When a class (the child) is made to inherit from another class (the parent), all of the attributes
and methods of the parent come automatically.  %
The child class, then, can benefit from all of the behaviors enabled by its parent while still
maintaining its own identity where needed.
The inheritance pattern makes it very easy to cleanly define expectations and shared structure
throughout a large piece of software without repeating functionality.  %
As an example, let's create a child of or \python{Person} class, defined in
\autoref{sof:lst:person}:  %
\begin{codefragment}{python}
class GradStudent(Person):

    def react_to(self, food):
        if food == self.hated_food:
            return 'thanks!'
        else:
            return super().react_to(food)
\end{codefragment}
Again, let's make an instance and see how it behaves:
\begin{codefragment}{python}
>>> joe = GradStudent(name='Joe', favorite_food='pizza', hated_food='falafel')
>>> joe.react_to('falafel')
'thanks!'
>>> joe.react_to('pizza')
'yum! my favorite'
\end{codefragment}
\python{joe} has the same preferences as \python{mary}, but we were able to \emph{overload} the
behavior of \python{Person} to give \python{joe} a different reaction when faced with his
\python{hated_food} (the joke being that graduate students will eat anything).  %
The wonderful thing is that all of the other behaviors---the \python{__init__} method, the reaction
to \python{favorite_food}---were inherited from \python{Person}.  %
We could even add new functionality to our \python{Person} class, and that functionality would
immediately be available to \python{GradStudent}.  %
In complex programs with trees of inheritance being able to edit one class to change the behavior
of entire sections of the software is a very useful capability.  %
You can even have inheritance between different packages, allowing programmers to customize or
extend the behavior of existing tools for their specific needs.  %

OOP is a deep subject with many patterns and concepts behind it.  %
There are many places to read further.  % BJT: KFS can you give me some citations?
I recommend The Quarks of Object-Oriented Development, by \textcite{ArmstrongDeborahJ2006a}.  %

% TODO: consider discussion of polymorphism

\section{Hierarchical data format} \label{sof:sec:hdf}  % =========================================

One of the particularly important challenges in CMDS is data storage.  %
CMDS datasets are multi-dimensional, and the particular dimensions are different from experiment
to experiment.  %
Historically, the Wright Group has stored data as ``flattened'' arrays in plain text, where each
column corresponds to one of the scannable hardwares or one of the sensors in the experiment.  %
The simplicity and portability of these formats is fantastic, but they do not scale well with
increasingly large and higher-dimensional data.  %

Heirarchial data files are an alternative strategy that scales much better with large and
high-dimensional data.  %
These are binary files that store the arrays directly, not in a flattened way.  %
They can contain multiple arrays, with different data types, in the same file under a well-defined
organizational system.  %
They support arbitrary metadata, integrated into the same hierarchy as the arrays, so making them
self-describing is trivial.  %
While in general plain text is prefered for its simplicity, these file-types are simply superior
for storing CMDS data.  %

To this author's best knowledge, the Common Data Format (CDF) was the first general purpose
self-describing multidimensional array data format. \cite{TreinshLloydA1987a}  %
The engineers at the National Space Science Data Center (a division of NASA) created the CDF.  %
Using this construct, \emph{``scientific softwares at NSSDC ... do not need specific knowledge of the
data whith which they are working. This permits users of such systems to apply the same functions
to different sets of data.''}
These are exactly the capabilities that CMDS requires.  %

A second-order challenge in CMDS data storage is the size of the arrays.  %
While by no-means ``big data'', CMDS data is often awkwardly large: large enough to fill up the
memory of an average modern laptop or desktop computer.  %
CDF also has a unique solution to this problem: use a block structure to allow access to parts of
the array without reading the entire data into memory.  %

Slightly later, NetCDF was introduced \cite{RewRuss1990a}.  %
Very similar to CDF, NetCDF focused on enhancments to portability.  %
Certain metadata conventions were also introduced, including named dimensions.  %
NetCDF remains popular in the aerospace and geoscience communities.  %

The Flexable Image Transform System (FITS) is a similar format with a focus on visualization and
backwards compatibility. \cite{FITS, WellsDC1981a}  %
FITS is still popular in the astronomy community.  %

Today, these hierarchical data formats have gathered under the umbrella of the HDF5 format, built
and maintained by the HDF Group. \cite{FolkMike2011a}  %
This format has all of the advantages of FITS, CDF, and NetCDF.  %
It can support arbitrary datatypes and is optimized to quickly process large and complex
datasets.  %
In Python, HDF5 is supported primarily through the h5py package. \cite{h5py}  %

Many scientific disciplines have built custom file formats on top of the HDF5 standard.  %
These include biological imaging \cite{DoughertyMatthewT2009a}, scanning transmission X-ray
microscopy \cite{WattsBenjamin2016a}, scalable nucleotide tallies \cite{PylPaulTheodor2014a}, and
even ultrasonic concrete data \cite{PrinceLuke2017a}.

\section{Scientific Python}  % --------------------------------------------------------------------

SciPy is a collection of \emph{``open-source software for mathematics, science, and egnineering.''}
\cite{OliphantTravisE2007a, MillmanJarrodK2011a}  %
SciPy was an absolutely essential component of this dissertation and the work it describes.  %
There are many packages under the SciPy umbrella.  %
NumPy is a very powerful and fast package for working with multidimensional arrays.
\cite{OliphantTravisE2006a}  %
The SciPy library contains a vast number of scientific computing tools, including many mathematical
operations that this work depends on. \cite{SciPy}  %
Matplotlib is a beautiful visualization package for 1, 2, and 3D plotting.
\cite{HunterJohnD2007a}  %