NLP software: Versioning, metadata, provenance


Many software systems for scientific research make it difficult for others to reproduce the results, or do so with their own revisions.  If a person can’t reproduce a published result, they simply have to take the scientist’s word that the computations were as described in say, a journal or conference paper.

It’s not enough to have the code itself in a version system such as Git.  The original code used, parameters, and data sources must be available to precisely reproduce the results. In addition, descriptive notes about the computations must to be available.

Git can take care of much of the above, but only if the code and documents are organized.  I would say that it’s better to have separate documents for each major run that was done, rather than single documents that grow from version to version.

My code is currently organized as a collection of Eclipse projects (Java 1.6.0_33) as well as a rich infrastructure as of August, 2012.  Mac OS X Mountain Lion.

I’ve written a lot of code since I retired from Northeastern University the end of June, 2011.  Much of it has been a clean re-imagining and from-scratch redevelopment of the NLP system(s) that were developed with the help of many students when I was at Northeastern. Not strictly a clean-room version; some code was reused or adapted from older code.

Strategies for infrastructure that supports/documents runs

  • Code organized by Suites and that run subsidiary Tasks
    e.g., a Suite could create a pipeline of tasks
  • This suggests that Suites be written first, with Task stubs.
  • A unique ID, UID, for each major production run
  • Example:  000042.00_____RPF
  • The Git tag for the code/files used in a run is the UID.
  • Files produced have names that include the UID
    Example: File 000042.00_____RPF_PlainToSnpCollec_V0.3.meta

    • The leading zeroes accomodate sorting.
    • Number to the right of the decimal point allows multiple runs with same basic code, e.g., with different parameters.
    • RPF is the user ID (space for more characters reserved by underscores).
    • PlainToSnpCollec is the human readable portion of the title.
    • V03 is the related Suite class version number
  • Some use of JUnit – sometimes just too awkward to arrange – needs thought
  • Data – My own simple and efficient approach to column stores based on the DataOutput/DataInput  interfaces, e.g., the corresponding streams can read and write sequences that contain a mix of Java primitive types.
  • Metadata – The code, by itself, is not enough. Therefore, I use
    • A log file for each user with a few lines per Suite run
    • Parameter files, usually Java Properties files, that may include parameters, and data source filepaths,  for files that may result from earlier runs or be produced by the run.
    • However, my recent successful experiments with Json (using Google’s Gson) mean that I can go beyond the flat structure of properties files.  Inner classes create nesting/depth.  I’ve only needed toJson(obj) and fromJson(string).
    • Descriptive notes, high level, or UID-related.
    • Extensive Javadoc, including package Javadocs
  • Provenance – Important – Not handled explicitly, but the metadata allows the provenance of any result to be reconstructed. Ref: The Open Provenance Model and the following W3 Provenance Working Group.


My theoretical stance and practical approach to NLP, in particular for the full-text Biology papers I analyze, is an entirely different topic.  It will only come to light when I’ve published papers describing the approach and results.  It’s a huge project for one person so publications are a few years off.


About heartandmindandme

Husband and father. Wise and droll wife and three wonderful kids. Scientist all my long life - Physics, Biology, Computer Science, and more. Washington, DC and Leesburg, VA public schools. MIT bachelor's and PhD. Faculty member of U. of Illinois and Northeastern U. for 25 years until recent retirement. Member of Marine Biological Lab, Woods Hole, MA. Like: Classical music plus instrumental folk and old jazz, or just sit and watch the sky, trees, birds, rain, wind, and snow.
This entry was posted in Computer programming, natural language processing - computational linguistics. Bookmark the permalink.

One Response to NLP software: Versioning, metadata, provenance

  1. My replies keep vanishing. Point I tried to make about keeping track of my program development was that I use a goodly number of Google Docs. Each section is given a date and time and each section title is included in the TOC, sometimes amounting to 3 or 4 pages. Since they’re all searchable and saved in the cloud, they are an excellent strategy for keeping track of every detail. They include snips from code and results, even screen shots.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s