The difficulties and importance of interdisciplinary research

I have done good interdisciplinary research.
But it has been difficult for others to measure its quality.
That’s because fewer people understand the two or three fields that overlap in some of my work.

Impact factors are too often used as a measure of scholarly achievement.
That’s OK if you work in a narrowly specified field and publish in the leading  journals and conferences in that field.
Google on criticism of journal impact factor to see the recent spate of criticism of journal impact factors.

A personal problem for my research, related to what I’ll call, my cognitive style and predispositions. I develop new ideas and systems and publish them.
Next, I find some other area interesting and publish in it.
As a practical matter, only a fraction of the community reads any particular journal or goes to any particular conference series. So my work has been seen by too few people.

The lesson to learn: Publish everywhere and often on any topic I feel should be noticed.  I’m finally doing this with my current computational linguistics work.
I’ve been focused on it for some time.  It’s novel (that’s me!). I intend to stay focused on it until it’s in production and achieving real results.  I intend to publish many substantial results from it, not moving on to some new area.

My NLP research may not reach fruition until 2016. Stay tuned.

Posted in The Science of Science | Leave a comment

Getting the first one out

A recurring event – getting the first kleenex out of a new full box.  Any reasonable attempt at doing this invariably yields a clump of two or three.

Many suggestions have been made over the years, so many that I won’t try to recount them here.  The “final solution strategy” is worth a mention.  Using a razor, slice into the box at enough places that the box can be opened flat, leaving the full stack of kleenexes totally exposed.

Finally, there is now an accepted strategy to avoid first out of the box problem. There is slot in the middle of the plastic covering the kleenex underneath. If it is lengthened on each end, the problem vanishes.

Since this technique was anointed by the IKC (the Independent Kleenex Council), many suggestions about how to lengthen the slot have been put forward.

These include:

  • Grabbing it with your teeth and ripping it open.
    Too crude; results are too variable
  •  Melting the two ends of the slot, using a hot match head.
  • Talk to your kleenex.
  • A
  • The New Age solution. But scientists have done careful experiments with this approach. It doesn’t work.  The Enlighten Ones protest. They claim to have seen it working.
    Tends to char the top layer of kleenex.  A waste
  • Pinking shears –
  • grass clippers –
  • steak knife –
  • kid’s blunt end scissors

Discovered the fallacy in this: Toward the end of the supply, they fall back in rather than holding themselves partly projecting out for grabbing.

Posted in Health, The Science of Science | Leave a comment

Much of science is folk science

“Folk Science” is usually thought of as the knowledge of science held by laypeople.

But any time a person speaks or writes about something they don’t understand in detail, it blends into a folk science level of understanding.  This applies to scientists as well.  When a scientist writes a technical paper in his/her field, some of the content is well-understood by the scientist.  But there are always portions of the paper that refer to knowledge outside the scientist’s field.  This is often signalled by the nature of references to the literature. If the reference is to narrow and technical material within the scientist’s field, it is solid scientific knowledge.  Topics more distant from the author’s expertise are often signalled by references to review articles, or older ‘classic’ papers or books.

This phenomena has a graded structure:  The further from the author’s domain of expertise, the more general and/or older the literature referred to.

If a historian is writing about early nineteenth century culture, references to biology might only go so far as referring to Darwin’s Origin of the Species, or a book or paper that attempts to explain the nature of the understanding of biology that existed during that period. The historian would rarely, if ever, refer to some highly technical paper of the earlier era, or a contemporary research paper explaining abstruse details of DNA sequence comparisons of extant or ancient recovered DNA that adds to our scientific understanding  of the complex genetic and molecular aspects of evolution.  So the historian’s understanding of biology is at the level of folk science.

The ideas above are important to me as I work on detailed analysis of the content of biology research papers.  The text components range from highly technical details to more general statements about ‘distant’ topics.

For example, in a paper about malarial parasites, we find the following,

In both vertebrates and yeast, one intron harbours single snoRNA gene but in plants, there are reports of clustered snoRNA genes present in a single intron [35,40]. Plasmodium falciparum has a cluster of two snoRNAs, viz PFS15 and PFS16, which are present in the same intron of PF 14_0027 (Fig 3).”

The red-highlighted sentence refers to material that is presumably outside the authors’ central expertise. It is signalled by references to two papers, one of which is broad in scope, “Plant snoRNA database”.  

The aqua-highlighted material doesn’t refer to secondary sources – it is at the core of the research done by the authors and explained in the technical portions of the paper. It refers to the detailed presentation of the authors’ own data in their figure 3.

A friend of mine, some years ago had an interesting metaphor for this phenomenon:
Consider the light coming from distant stars.  If a star is 1,000 light years away, the light from it shows us what it looked like 1,000 years ago.  The more distant the scientific field is, or the more distant the star, the older the information about either.

Posted in natural language processing - computational linguistics, The Science of Science | Leave a comment

NLP software: Versioning, metadata, provenance

Introduction

Many software systems for scientific research make it difficult for others to reproduce the results, or do so with their own revisions.  If a person can’t reproduce a published result, they simply have to take the scientist’s word that the computations were as described in say, a journal or conference paper.

It’s not enough to have the code itself in a version system such as Git.  The original code used, parameters, and data sources must be available to precisely reproduce the results. In addition, descriptive notes about the computations must to be available.

Git can take care of much of the above, but only if the code and documents are organized.  I would say that it’s better to have separate documents for each major run that was done, rather than single documents that grow from version to version.

My code is currently organized as a collection of Eclipse projects (Java 1.6.0_33) as well as a rich infrastructure as of August, 2012.  Mac OS X Mountain Lion.

I’ve written a lot of code since I retired from Northeastern University the end of June, 2011.  Much of it has been a clean re-imagining and from-scratch redevelopment of the NLP system(s) that were developed with the help of many students when I was at Northeastern. Not strictly a clean-room version; some code was reused or adapted from older code.

Strategies for infrastructure that supports/documents runs

  • Code organized by Suites and that run subsidiary Tasks
    e.g., a Suite could create a pipeline of tasks
  • This suggests that Suites be written first, with Task stubs.
  • A unique ID, UID, for each major production run
  • Example:  000042.00_____RPF
  • The Git tag for the code/files used in a run is the UID.
  • Files produced have names that include the UID
    Example: File 000042.00_____RPF_PlainToSnpCollec_V0.3.meta

    • The leading zeroes accomodate sorting.
    • Number to the right of the decimal point allows multiple runs with same basic code, e.g., with different parameters.
    • RPF is the user ID (space for more characters reserved by underscores).
    • PlainToSnpCollec is the human readable portion of the title.
    • V03 is the related Suite class version number
  • Some use of JUnit – sometimes just too awkward to arrange – needs thought
  • Data – My own simple and efficient approach to column stores based on the DataOutput/DataInput  interfaces, e.g., the corresponding streams can read and write sequences that contain a mix of Java primitive types.
  • Metadata – The code, by itself, is not enough. Therefore, I use
    • A log file for each user with a few lines per Suite run
    • Parameter files, usually Java Properties files, that may include parameters, and data source filepaths,  for files that may result from earlier runs or be produced by the run.
    • However, my recent successful experiments with Json (using Google’s Gson) mean that I can go beyond the flat structure of properties files.  Inner classes create nesting/depth.  I’ve only needed toJson(obj) and fromJson(string).
    • Descriptive notes, high level, or UID-related.
    • Extensive Javadoc, including package Javadocs
  • Provenance – Important – Not handled explicitly, but the metadata allows the provenance of any result to be reconstructed. Ref: The Open Provenance Model and the following W3 Provenance Working Group.

————-

My theoretical stance and practical approach to NLP, in particular for the full-text Biology papers I analyze, is an entirely different topic.  It will only come to light when I’ve published papers describing the approach and results.  It’s a huge project for one person so publications are a few years off.

Posted in Computer programming, natural language processing - computational linguistics | 1 Comment

New Mayan art finds – The world will not end on December 12, 2012

Three pictures from a recently discovered buried room in Guatemala.
The original paintings were done in the ninth century.

The paintings below were created after the discovery to show what the original colors probably looked like. A Google image search for:  oldest mayan calendar
will show you more.

The calendar markings on another wall indicate dates 7000 or more years into the future.

Some claimed, based on earlier finds of Mayan calendars, that the Mayans had calculated that world would end on December 12, 2012. No need to worry  – our futures will continue to unfold hundreds of generations into the future, when the …. (to be continued)

Posted in Uncategorized | Leave a comment

Looking deeper into the culture of organizations you might work for

My Golden Rule is don’t use this blog to recycle material that anyone can find on their own. But this item is about a company that only came into existence in February, 2012.
It’s thefit.com  They emphasize workplace culture, so a prospective employee can learn about “what it”s like” to work at a place, before accepting an offer. Their parent company, bullhorn.com sells recruiting software.

There’s a lot on their site including a blog (not as punchy as it is wordy).

BOSTON, Mass. (March 27, 2012) – Women work longer days and report working more often on vacation than their male counterparts. Yet, women also report greater perceived satisfaction with their compensation, according to new data released today in theFIT’s first Report on Workplace Culture. Fifty- four percent of women report working nine or more hours a day, compared to 41 percent of men. The report includes survey data from over 5,000 U.S. employees. (The report is a short set of slides in PDF.)

Posted in workplace | Leave a comment

My current (2012) R&D in computational linguistics – especially for biology text

Now that I’m retired, I spend a lot of time on what is essentially a re-development of the system described in our 1999 paper, 

NLP-NG – A New NLP System for Biomedical Text Analysis accessible at

http://www.ccs.neu.edu/home/futrelle/research37/naturalLanguage.html

I have moved beyond the above paper in some ways.  But I’m not throwing out the old, only building on it in a systematic and focused way. Else, I could spend forever on still more variations – that would be my bad.  For those of you in the Java/NLP world, I’ve discover useful capabilities in PostgreSQL, TreeSets, and TreeMaps.

I find that I now have enough time to design code that is tight and well-structured. Much of the code in my earlier work over the years had to be written in haste. Looking back at it, I can see its limitations.

I won’t go into detail, as the system will be under wraps for quite some time, since it’s complicated and I am the only developer.  No release until solid NLP results are produced.

My schedule allows me to spend as much time as necessary on any aspect of the project that I deem important.   For example in late April, 2012, into May and beyond, I am schooling myself in Java Swing.  I’ve used Swing for years, but the time has come to buckle down and become a Swing Master.  I’ll also be looking just as deeply into many  NLP-related areas.

What is Java Swing?  Swing is a set of Java-based tools for constructing GUIs (Graphical User Interfaces).  It’s designed to produce applications (apps), rather than web-based systems (web-apps).  Apps are like Word or a browser that can be run and used without any connection to the Internet.  A web-app  runs in your browser, e.g., Gmail.

A GUI is the “face” aspect of an application, what you see and interact with.

Java runs (essentially) identically on Linux, Mac, and Windows, so Swing does too.

Deepening my approach to my work/research.  In the past, I could get results by choosing the right topics and using my ingenuity to produce good results (to the tune of 60+ papers).  But that has its limits.  It’s time to step up to mastering the hard math, theory, etc. that will raise the level of my systems and their results.

I’ve learned that without this, others can’t understand what I’ve accomplished, no matter how good it is, because they can’t inspect my intuition.

Stay tuned.

Posted in natural language processing - computational linguistics | Leave a comment