Many software systems for scientific research make it difficult for others to reproduce the results, or do so with their own revisions. If a person can’t reproduce a published result, they simply have to take the scientist’s word that the computations were as described in say, a journal or conference paper.
It’s not enough to have the code itself in a version system such as Git. The original code used, parameters, and data sources must be available to precisely reproduce the results. In addition, descriptive notes about the computations must to be available.
Git can take care of much of the above, but only if the code and documents are organized. I would say that it’s better to have separate documents for each major run that was done, rather than single documents that grow from version to version.
My code is currently organized as a collection of Eclipse projects (Java 1.6.0_33) as well as a rich infrastructure as of August, 2012. Mac OS X Mountain Lion.
I’ve written a lot of code since I retired from Northeastern University the end of June, 2011. Much of it has been a clean re-imagining and from-scratch redevelopment of the NLP system(s) that were developed with the help of many students when I was at Northeastern. Not strictly a clean-room version; some code was reused or adapted from older code.
Strategies for infrastructure that supports/documents runs
- Code organized by Suites and that run subsidiary Tasks
e.g., a Suite could create a pipeline of tasks
- This suggests that Suites be written first, with Task stubs.
- A unique ID, UID, for each major production run
- Example: 000042.00_____RPF
- The Git tag for the code/files used in a run is the UID.
- Files produced have names that include the UID
Example: File 000042.00_____RPF_PlainToSnpCollec_V0.3.meta
- The leading zeroes accomodate sorting.
- Number to the right of the decimal point allows multiple runs with same basic code, e.g., with different parameters.
- RPF is the user ID (space for more characters reserved by underscores).
- PlainToSnpCollec is the human readable portion of the title.
- V03 is the related Suite class version number
- Some use of JUnit – sometimes just too awkward to arrange – needs thought
- Data – My own simple and efficient approach to column stores based on the DataOutput/DataInput interfaces, e.g., the corresponding streams can read and write sequences that contain a mix of Java primitive types.
- Metadata – The code, by itself, is not enough. Therefore, I use
- A log file for each user with a few lines per Suite run
- Parameter files, usually Java Properties files, that may include parameters, and data source filepaths, for files that may result from earlier runs or be produced by the run.
- However, my recent successful experiments with Json (using Google’s Gson) mean that I can go beyond the flat structure of properties files. Inner classes create nesting/depth. I’ve only needed toJson(obj) and fromJson(string).
- Descriptive notes, high level, or UID-related.
- Extensive Javadoc, including package Javadocs
- Provenance – Important – Not handled explicitly, but the metadata allows the provenance of any result to be reconstructed. Ref: The Open Provenance Model and the following W3 Provenance Working Group.
My theoretical stance and practical approach to NLP, in particular for the full-text Biology papers I analyze, is an entirely different topic. It will only come to light when I’ve published papers describing the approach and results. It’s a huge project for one person so publications are a few years off.