End of NCEAS & the legacy of synthetic data

The upcoming termination of NSF funding for NCEAS has recently been receiving a bit of attention, including this nice piece in Science (Stokstad, 2011).  I was particularly intrigued by Oikos’s treatment of this, in which Oikos editor Jeremy Fox writes,

that’s probably NCEAS’ biggest legacy-the sense that the answers to all, or at least many, of our questions are already out there, in existing data that just needs to be pulled together"

and then argues that this may have gone to far,

As the saying goes, if all you have is a hammer, everything looks like a nail. Similarly, if all you have is the data you’ve currently got, every question looks like one that can be answered with those data.

I think this is an important point to raise, but it misses the mark.  I don’t think anyone meant we should never collect new data.  I think the issue is rather that we never look back at the data we collect, as Jeremy very aptly acknowledges:

The sort of datasets I (and everyone I know) work with can be handled easily on an ordinary laptop with ordinary software like Excel and R. […] we’re getting along just fine without any ecoinformatics or high performance computing resources.

As he says, if all you have is a hammer, everything looks like a nail. This is exactly why efforts such as NCEAS are so important. Darwin didn’t need a laptop with Excel and R, he got along just fine.  So why is everyone using fancy laptops?

I don’t think we’re at any risk of all new data collection ceasing and everyone pouring over the same tired old few studies.  Rather we’re in an era of being flooded with data, from GenBank to NCEAS to the thousands of individual studies becoming papers in our ever-expanding journals.  We will always need new data, and we’ll always collect it, even though it is hard work and expensive, as Jeremy says.

Do we check that our questions cannot be answered by combining existing data before we set off to get more?  Do we design our collection in a way that others can maximally reuse the data to answer other questions, or combine it with other existing data to answer the kinds of questions that a small, short-term, or system-specific dataset cannot? Do we recognize the contributions of excellent, careful data collection the way we do those brilliantly clever insights backed by the most tenuous available evidence?  Do we have the tools to do this?

Jeremy has argued that many centers have attempted to mimic NCEAS, and it is worth reflecting on exactly what is successful.  Another of NSF’s short-term initiative centers, NESCent, (National Evolutionary Synthesis Center) holds an equally synthetic charge which was excellently enunciated by it’s own independent post-docs in an article called Linking Big,  (Sidlauskas et. al. 2009), emerging in the aftermath of an NSF program review that critiqued the center early on for being insufficiently synthetic. NESCent will soon face similar challenges to NCEAS, figuring out how to continue its mission and value on a model that goes beyond 5-year program grants.

Their recent piece in Nature (Piwowar et. al. 2011) emphasizes the importance of funding data-archiving infastructure, which, at a cost of $400,000 annually may generate much more research for the dollar than the current NSF funding average of over $25,000 per paper ((or the 3.7 million budget of NCEAS, though admittedly was charged with a much greater scope)).

As NCEAS authors Reichman, Jones & Schildhauer eloquently state, (Reichman et. al. 2011)

Ecology is a synthetic discipline benefiting from open access to data from the earth, life, and social sciences. Technological challenges exist, however, due to the dispersed and heterogeneous nature of these data. Standardization of methods and development of robust metadata can increase data access but are not sufficient. Reproducibility of analyses is also important, and executable workflows are addressing this issue by capturing data provenance. Sociological challenges, including inadequate rewards for sharing data, must also be resolved. The establishment of well-curated, federated data repositories will provide a means to preserve data while promoting attribution and acknowledgement of its use.

Reichman, Jones & Schildhauer are putting their money on federated repositories such as DataONE to address these challenges, which may be a key as well as economical solution.  But will such programs be funded more continuously than the synthesis centers that have jump-started them?  What role will future centers play in helping us address those challenges?

References