20 May 2013

## EWS TE revisions

• Finalized manuscript(?) Checking references, reply letter, closing remaining issues.
• Very annoying to submit to a journal system that takes LaTeX but not any external style or class file dependencies. For instance, how does one add two footnotes to the same author without adding any such dependencies (oh, and writing macros that do not involve pairs of $ which pandoc mistakes for its own macros…) Reasonable author affiliations done otherwise using \and, \thanks, and \footnotemark; see simple.latex pandoc template. • Final edits from Alan. ## prosecutor comment revisions • added Figure for Allee model into reply document. • added code for generating figures Figure1.Rmd • combined all four data files into a single tidy data.frame as csv. ## ropensci Example API queries using Whitehouse Open Data CSV-API tool. ### rfishbase queries • Continue to get questions about performing queries on fishbase that are just not possible with their current awful data model. This time: how can one extract the 400 natural mortality rate estimates mentioned on this page. No, the page entitled popgrowth doesn’t actually show you the table. ### pdg-control: policycosts Over the weekend, re-ran the policy costs analysis with the simpler apples-to-apples comparison (after stream-lining the code a bit more). General patterns remain the same. Also added a block for showing how well we are doing on our grid sampling at getting to a shared npv0 level. ## Misc ### Notebook • obfuscated email via javascript, copy-paste and mailto: should still work. (Of course a scraper could use the very javascript code to access the email addresses, but I find this preferable to a solution that intenionally burdens the user). Spam filters should handle the rest. • Interesting reading about signing individual Git commits #### Null Distribution Width Minor Puzzle 17 May 2013 pageviews: 9 A quick foray into trying to understand why I see the wider distribution (though still symmetric) in the null of the OU model then in the null from the Allee model in the Prosecutor’s fallacy. Load the original run of the ou model and increase the nulldt data frame to use all points instead of a sample of length 5000 load("beer_run.rda") ou_dat <- dat ou_null <- nulldat #ou_null_ts <- nulldt null <- timeseries #[1000:6010,] null <- as.data.frame(cbind(time = 1:dim(null)[1], null)) ndf <- melt(null, id="time") names(ndf) = c("time", "reps", "value") levels(ndf$reps) <- 1:length(levels(ndf$reps)) # use numbers for reps instead of V1, V2, etc nulldt <- data.table(ndf) ou_null_ts <- nulldt Plot the final distribution of indicator statistics, showing the width of the null, ggplot(dat) + geom_histogram(aes(value, y=..density..), binwidth=0.3, alpha=.5) + facet_wrap(~variable) + xlim(c(-1, 1)) + geom_density(data=nulldat, aes(value), adjust=2) + xlab("Kendall's tau") + theme_bw() Load the allee model and rename variables appropriately, load("comment_run.rda") allee_dat <- dat allee_null <- nulldat allee_null_ts <- nulldt For the Allee model, also plot the final distribution of indicator statistics, showing the width of the null, ggplot(dat) + geom_histogram(aes(value, y=..density..), binwidth=0.3, alpha=.5) + facet_wrap(~variable) + xlim(c(-1, 1)) + geom_density(data=nulldat, aes(value), adjust=2) + xlab("Kendall's tau") + theme_bw() Tidy the data and plot a single replicate from each. Note the OU process has the correspondingly much wider null distribution due to allee_x <- subset(allee_null_ts, reps==1) ou_x <- subset(ou_null_ts, reps==1) allee_x <- data.frame(time = allee_x$time, value = allee_x$value) ou_x <- data.frame(time = ou_x$time, value = ou_x$value) ggplot(allee_x, aes(time, value)) + geom_point() ggplot(ou_x, aes(time, value)) + geom_point() Note that subsampling the data at coarser interval doesn’t matter warningtrend(ou_x[seq(1, length(ou_x$value),by=50),], window_var)
    tau
-0.5527 

Nor does scaling matter (recall Kendall’s $$tau$$ is a rank-correlation test)

warningtrend(data.frame(time=ou_x$time, value=(ou_x$value+1)*500), window_var)
    tau
-0.5516 
warningtrend(data.frame(time=ou_x$time, value=ou_x$value), window_var)
    tau
-0.5516 

Lengthening the sample suggests the sampling is not long enough to have converged in distribution of this statistic. Computing on this fine sampling resolution over adequate length of time quickly becomes prohibitive.

sapply(seq(1000, 20000, by=1000), function(i) warningtrend(ou_x[1:i,], window_var))
       tau        tau        tau        tau        tau        tau
0.1802475 -0.0832527  0.6365570  0.1787726 -0.1747774 -0.5295226
tau        tau        tau        tau        tau        tau
-0.5403782 -0.6790710 -0.6114706 -0.2950858 -0.0444861  0.0006919
tau        tau        tau        tau        tau        tau
-0.2579410 -0.5220490 -0.6439519 -0.6221482 -0.5545048 -0.5134236
tau        tau
-0.5265975 -0.5515807 

16 May 2013

## Prosecutor

• Add figures for both examples with and without ASS
• text to accomidate both examples
• ROC curves?

## ews-review

• looking over revisions, touch-ups
• See issues log for details
• Impressive set of tools provided by the Whitehouse open data project In particular, see the tool for generating a RESTful API from CSV files and the “common core” metadata definitions. Notebook complies with many of these (throught the Dublin core RDFa), but looks like I could benefit from adding some more terms from the Data Catalog Vocabulary. Awesome that this site is built with Twitter-Bootstrap and Jekyll, hosted on Github, and licenced as CC-BY (content) and MIT (code). See for example, the issues log.

## Notebook

• added black-white theme for readers that prefer higher contrast / more traditional appearance. Otherwise matches the orginal feel reasonably well. (#70)

• Dropped the local javascript based search using the stemming engine (#7). The stemming search was reasonably fast, but only matches words instead of phrases (and adds considerable overhead to generating the site).

14 May 2013

Exploring potential plots that would allow for some visualization of the uncertainty in each of the models. Example follows up on analysis of e97043d/allen.md, as also in the script from b5d78b9/step_ahead_plots.md

## One-step ahead predictor plots and forecasting plots

Set our usual plotting options for the notebook

opts_chunk$set(tidy = FALSE, warning = FALSE, message = FALSE, cache = FALSE, comment = NA, fig.width = 7, fig.height = 4) library(ggplot2) opts_knit$set(upload.fun = socialR::flickr.url)
theme_set(theme_bw(base_size = 12))
theme_update(panel.background = element_rect(fill = "transparent", colour = NA),
plot.background = element_rect(fill = "transparent", colour = NA))
cbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2",
"#D55E00", "#CC79A7")

Now we show how each of the fitted models performs on the training data (e.g. plot of the step-ahead predictors). For the GP, we need to predict on the training data first:

gp_f_at_obs <- gp_predict(gp, x, burnin=1e4, thin=300)

For the parametric models, prediction is a matter of feeding in each of the observed data points to the fitted parameters, like so

step_ahead <- function(x, f, p){
h = 0
x_predict <- sapply(x, f, h, p)
n <- length(x_predict) - 1
y <- c(x[1], x_predict[1:n])
y
}

Which we apply over each of the fitted models, including the GP, organizing the “expected” transition points (given the previous point) into a data frame.

df <- melt(data.frame(time = 1:length(x), stock = x,
GP = gp_f_at_obs$E_Ef, True = step_ahead(x,f,p), MLE = step_ahead(x,f,est$p),
Ricker = step_ahead(x,alt$f, ricker_bayes_pars), Myers = step_ahead(x, Myer_harvest, myers_bayes_pars) ), id=c("time", "stock")) ggplot(df) + geom_point(aes(time, stock)) + geom_line(aes(time, value, col=variable)) + scale_colour_manual(values=colorkey)  ## Posterior predictive curves This shows only the mean predictions. For the Bayesian cases, we can instead loop over the posteriors of the parameters (or samples from the GP posterior) to get the distribution of such curves in each case. We will need a vector version (pmin in place of min) of this function that can operate on the posteriors, others are vectorized already. ricker_f <- function(x,h,p){ sapply(x, function(x){ x <- pmax(0, x-h) pmax(0, x * exp(p[2] * (1 - x / p[1] )) ) }) } Then we proceed as before, now looping over 100 random samples from the posterior for each Bayesian estimate. We write this as a function for easy reuse. require(MASS) step_ahead_posteriors <- function(x){ gp_f_at_obs <- gp_predict(gp, x, burnin=1e4, thin=300) df_post <- melt(lapply(sample(100), function(i){ data.frame(time = 1:length(x), stock = x, GP = mvrnorm(1, gp_f_at_obs$Ef_posterior[,i], gp_f_at_obs$Cf_posterior[[i]]), True = step_ahead(x,f,p), MLE = step_ahead(x,f,est$p),
}), id=c("time", "stock"))

}

ggplot(df_post) + geom_point(aes(time, stock)) +
geom_line(aes(time, value, col=variable, group=interaction(L1,variable)), alpha=.1) +
facet_wrap(~variable) +
scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) 

alternately, try the plot without facets

ggplot(df_post) + geom_point(aes(time, stock)) +
geom_line(aes(time, value, col=variable, group=interaction(L1,variable)), alpha=.1) +
scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) 

## Performance on data outside of observations

Of course it is hardly suprising that all models do reasonably well on the data on which they were trained. A crux of the problem is the model performance on data outside the observed range. (Though we might also wish to repeat the above plot on data in the observed range but a different sequence from the observed data).

First we generate some data from the underlying model coming from below the tipping point:

Tobs <- 8
y <- numeric(Tobs)
y[1] = 4.5
for(t in 1:(Tobs-1))
y[t+1] = z_g() * f(y[t], h=0, p=p)

Proceed as before on this data:

crash_data <- step_ahead_posteriors(y)

ggplot(crash_data) + geom_point(aes(time, stock)) +
geom_line(aes(time, value, col=variable, group=interaction(L1,variable)), alpha=.1) +
facet_wrap(~variable) +
scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) 

Note that the GP is doing remarkably well even outside the observed range, though with greater uncertainty as well. Other models tend to be overly optimistic, often predicting an increase instead of a decline, hence the many trajectories that continually float above the data.

## Forecast Distributions

Another way to visualize this is to look directly at the distribution predicted under each model, one step and several steps into the future (say, at a fixed harvest level). Here we have a simple function that will look one step ahead of the given x (given as index i), then Tobs steps ahead, then 2*Tobs steps ahead, each at a fixed harvest. This lets us compare both the expected outcome over short and long term under a given harvest policy, as well as seeing how the distribution of possible outcomes evolves.

library(expm)
get_forecasts <- function(i, Tobs, h_i){

df <- data.frame(
x = x_grid,
GP = matrices_gp[[h_i]][i,],
True = matrices_true[[h_i]][i,],
MLE = matrices_estimated[[h_i]][i,],
Parametric.Bayes = matrices_par_bayes[[h_i]][i,],
Ricker = matrices_alt[[h_i]][i,],
Myers = matrices_myers[[h_i]][i,])

df2 <- data.frame(
x = x_grid,
GP = (matrices_gp[[h_i]] %^% Tobs)[i,],
True = (matrices_true[[h_i]] %^% Tobs)[i,],
MLE = (matrices_estimated[[h_i]] %^% Tobs)[i,],
Parametric.Bayes = (matrices_par_bayes[[h_i]] %^% Tobs)[i,],
Ricker = (matrices_alt[[h_i]] %^% Tobs)[i,],
Myers = (matrices_myers[[h_i]] %^% Tobs)[i,])

T2 <- 2 * Tobs

df4 <- data.frame(
x = x_grid,
GP = (matrices_gp[[h_i]] %^% T2)[i,],
True = (matrices_true[[h_i]] %^% T2)[i,],
MLE = (matrices_estimated[[h_i]] %^% T2)[i,],
Parametric.Bayes = (matrices_par_bayes[[h_i]] %^% T2)[i,],
Ricker = (matrices_alt[[h_i]] %^% T2)[i,],
Myers = (matrices_myers[[h_i]] %^% T2)[i,])

forecasts <- melt(list(T_start = df, T_mid = df2, T_end = df4), id="x")

}

i = 15

This takes i, an index to x_grid value (e.g. for i=15 corresponds to a starting postion x = 3.4286)

forecasts <- get_forecasts(i = 15, Tobs = 5, h_i = 1)

ggplot(forecasts) +
geom_line(aes(x, value, group=interaction(variable, L1), col=variable, lty=L1)) +
facet_wrap(~ variable, scale="free_y", ncol=2) +
scale_colour_manual(values=colorkey) 

We can compare to a better starting stock,

i<-30

where i=30 corresponds to a starting postion x = 7.102

forecasts <- get_forecasts(i = i, Tobs = 5, h_i = 1)

ggplot(forecasts) +
geom_line(aes(x, value, group=interaction(variable, L1), col=variable, lty=L1)) +
facet_wrap(~ variable, scale="free_y", ncol=2) +
scale_colour_manual(values=colorkey) 

Note the greater uncertainty in both the positive and negative outcomes under the parametric Bayesian models (of correct and incorrect structure).

Apparently I am catching up on my C. Titus Brown reading… Achiving comments for my records & to keep them in my local search index.

Titus Brown on if he really practices Open Science?

Methinks:

I think you undersell your Github practices regarding Open Science. While I agree entirely that anyone reading through your source-code to scoop you is incredibly unlikely; I imagine many researchers would still fear the practice of using a public, Google-indexed repository, particularly if they also practice reasonable literate programming documentation, provide test cases, and use issue tracking that could allow some close rival to scoop them.

It sounds like you are talking more about not marketing your work before it is “finished”, rather not being open. (When you say “to really open up about what we’re doing” I assume you mean advertising the results, rather than something like simply putting your notebook or publication drafts on github). Though “Open Science” is used in in both contexts, marketing your pre-publication science takes additional time that exposing your workflow on Github does not. The latter practice provides transparency, provenance machine-discoverability, (and cool graphs of research contributions like https://github.com/ctb/khmer/c… etc. Perhaps you are too hard on yourself here.

Meanwhile, I think suggesting that being open prematurely would “waste other people’s time, energy, and attention” is misleading and damaging here. Sure, I understand you mean that marketing unfinished results (blog, tweet, present at conferences) would do these things, but when the same reason is frequently given for not sharing data, code, etc this becomes very damaging. Some people may find it useful to blog/tweet/present unfinished work (with appropriate disclaimer) to get feedback, build audience, or for all the same reasons one would do so with published work, but to me that is really just a question of timing & marketing, not a question of “openness”.

On Research Software Reuse

On the Costs of Open Science, which I refute:

Great post and important question. I certainly agree that it is all about incentives, but I think an important missing part of that discussion is in the timescales. Certainly there is a cost to not writing up the low-lying fruit following up on your work, but surely such exercises would take non-trivial time away from doing whatever (presumably more interesting) stuff you did instead. Perhaps the it is the benefits, not the costs, that are hidden:

It sounds like you may have forgone a short-term cost in not publishing these easy follow-ups while gaining a benefit both of time to work on other interesting things and of getting other researchers invested in your work, who might not have taken it up at all if there was no low-hanging fruit to entice them in. Once invested in it, no doubt they can continue to be a multiplier of impact. Meanwhile you break new ground rather than appearing to continue to wring every ounce from one good idea, right? (If I read this correctly, both you and George Gey appear to regard these other publications as not particularly exciting work; the regret comes only because they are intrinsically valued publications)

It sounds like the difficulty with these benefits is that they pay off only on a longer timescale than the gristmill publication strategy. On the longest timescales, e.g. career lifetime, it seems clear that at least in a statistical sense, researchers who keep breaking new ground while allowing others to pick the low-hanging follow-up work will be much more likely to end up as the most distinguished researchers, etc. When the relevant timescale is a tenure clock rather than a career, the perhaps the calculus is different?

Without open science, moving on to other exciting stuff is far less effective, since it leaves both unfinished work and less impact. Perhaps this a somewhat idealized view, and perhaps the timescales for tenure etc work against this strategy. You and others could no doubt better speak to whether the candidate who says “look at all the papers I wrote about X” or the one who says “look at all the papers others have built upon my method X while I break ground in hard problem Y” has the stronger case.

13 May 2013

### Prosecutor’s Fallacy comment

• Fallacy comment revisions
• Add example of system with an alternative stable state?

### Policy costs (pdg-control)

• touched up tex document following Paul’s edits (mostly reflects comments and decisions from Meeting 3). Reminder of why collaborating on TeX documents can be annoying even when co-authors are tex-literate. Attempted quick conversion to markdown but mapping is troublesome.

## Misc

• Some software providing rather impressive/high fidelity pdf to html conversion… Not just rendering as images – the text is searchable, though in the html source it just looks like a bunch of binary data URIs.
• Scathingly honest review of the classical ecological statistics text by Sokal and Rohlf (pdf)
• PNAS hates mathematics Not really, but a depressing example of poor reviewing.

09 May 2013

## Talk prep

• Finished preparing slides for Environmental Resources Economics talk

Toying around with animations for final plots. Building up plot by subsetting progressively more of the data each time is a bit of a nuciance (both in coding and efficiency). Can convert replicates to characters and assign as a data.table key for fast join subsetting, but straight-forward subsetting seems best (e.g. once we want reps 1:5 from both “True” and “Ricker” models, dt[J(c("True", "Ricker"), as.character(1:5)] doesn’t quite do this. (Can you guess what it gives? Actually alternates “True”+rep1, “Ricker”+rep2, “True”+rep3, …)

require(animation)
ani.options(loop=TRUE)
saveMovie({
for (i in seq(1, OptTime, by=2)) {
print(
ggplot(subset(dt, method %in% c("True", "Ricker") & reps < 5 & time <= i),
aes(time, fishstock, group=interaction(reps,method), color = method), alpha=.9) +
geom_line() + xlim(0, OptTime)
)
}
}, interval = 0.1, movie.name = "wrong-model.gif", ani.width = 600, ani.height = 600)

• All plots for slides also archived in flickr/ere2013

• Which were quickly converted from vector pdfs into decent resolution pngs with a few commands:

for f in *.pdf ; do convert -density 300 "$f" "${f/.pdf}".png ; done
flickr_upload --tag="nonparametric-bayes ere2013 talk" *.png

• (Sirota et al. 2013 ) provide a cute example of a system that can be manipulated through a tipping point of eutrophication in the tiny pools forming inside pitcher plants by adding ground-up dead insects. A nice natural system that provides a more realistic setting than lab manipulations of single species micro-organisms while also being more accessible to replication that the whole-lake experiments in Wisconsin. I do note they critique retrospective analyses on the basis of length, but don’t mention the prosecutor’s fallacy. Five stars for archiving the raw data very nicely (see below) along with the mathematica notebook file used for the analysis. Also notable that the first author is an undergrad at North Dakota State.
• Wow, Harvard Forest provides a data repository with EML files for each data entry (example). EML file serves more as a metadata description, raw data provided as an 8.8 MB .csv file in tidy (long) format. Very nice.
• From the Whitehouse, executive order: The default state of new and modernized Government information resources shall be open and machine readable I can’t really express just how incredible that is. May it impact government funded science appropriately. Whitehouse Chief technology officer and chief information officer explain the policy in one minute

## References

• J. Sirota, B. Baiser, N. J. Gotelli, A. M. Ellison, (2013) Organic-Matter Loading Determines Regime Shifts And Alternative States in an Aquatic Ecosystem. Proceedings of The National Academy of Sciences 110 7742-7747 10.1073/pnas.1221037110

07 May 2013

## Monday

• MacKenzie meeting
• paperwork to Marc
• paperwork to Karthik
• Writing

## Misc

• latex templates for UCSC AMS dept Gosh sometimes it’s nice to be officially in a department that understands tex.

• Best teacher’s know student’s misconceptions Hmm, our misconceptions are probably a bigger barrier than lack of knowledge for most scientists too…

• #openoxford’s debate Elseiver rep says that if they disclose prices, prices would be driven so low that they’d go out of buisness. Yikes, what would the economists say!

• Finally, you can search within a git repositories (without cloning first and using grep obviously. Before one was had global search options only, albeit filtered by language, etc).

• Callaway (2013) Ironic that they quote @mfenner saying that peerJ preprints are too like Nature Preceedings while also pointing out the role of archiving things like slides and posters (as these were categories of objects in Preceedings, but not in arXiv or PeerJ). Of course I agree with Martin about diverse outputs, but there appears to be something about the concept of preprint server that carries more meaning for people. Clearly figshare is a viable preprint server that is both the most discipline agonstic (arXiv isn’t interesting in expanding its scope, and peerJ’s stated scope does not cover earth science / climate science / physical oceanography side of things) and content-type agnostic (paper, poster, data, code, or anything else), and has a rich API that is absent or minimal in the other platforms.

## References

• Ewen Callaway, (2013) Biomedical Journal And Publisher Hope to Bring Preprints to Life. Nature Medicine 19 512-512 10.1038/nm0513-512

03 May 2013

## Parametric Bayesian comparisons

• Added parameteric Bayesian version of Ricker to the set of comparisons, see allen.Rmd
• Adjusted thinning of posterior sample before determining optimum of GP

Need a nice set of basic fisheries examples involving potential tipping points for ERE talk. hmm.

## References

• Richard B. Aronson, William F. Precht, (2000) Herbivory And Algal Dynamics on The Coral Reef at Discovery Bay, Jamaica. Limnology And Oceanography 45 251-255 10.4319/lo.2000.45.1.0251
• M Lindegren, R Diekmann, C Möllmann, (2010) Regime Shifts, Resilience And Recovery of A Cod Stock. Marine Ecology Progress Series 402 239-253 10.3354/meps08454
• T Oguz, V Velikova, (2010) Abrupt Transition of The Northwestern Black Sea Shelf Ecosystem From A Eutrophic to an Alternative Pristine State. Marine Ecology Progress Series 405 231-242 10.3354/meps08538
• RW Osman, P Munguia, RN Zajac, (2010) Ecological Thresholds in Marine Communities: Theory, Experiments And Management. Marine Ecology Progress Series 413 185-187 10.3354/meps08765
• Michelle J. Paddack, John D. Reynolds, Consuelo Aguilar, Richard S. Appeldoorn, Jim Beets, Edward W. Burkett, Paul M. Chittaro, Kristen Clarke, Rene Esteves, Ana C. Fonseca, Graham E. Forrester, Alan M. Friedlander, Jorge García-Sais, Gaspar González-Sansón, Lance K.B. Jordan, David B. McClellan, Margaret W. Miller, Philip P. Molloy, Peter J. Mumby, Ivan Nagelkerken, Michael Nemeth, Raúl Navas-Camacho, Joanna Pitt, Nicholas V.C. Polunin, Maria Catalina Reyes-Nivia, D. Ross Robertson, Alberto Rodríguez-Ramírez, Eva Salas, Struan R. Smith, Richard E. Spieler, Mark A. Steele, Ivor D. Williams, Clare L. Wormald, Andrew R. Watkinson, Isabelle M. Côté, (2009) Recent Region-Wide Declines in Caribbean Reef Fish Abundance. Current Biology 19 590-595 10.1016/j.cub.2009.02.041
• Peter S. Petraitis, Steve R. Dudgeon, (2004) Detection of Alternative Stable States in Marine Communities. Journal of Experimental Marine Biology And Ecology 300 343-371 10.1016/j.jembe.2003.12.026
• E. K. Pikitch, (2012) The Risks of Overfishing. Science 338 474-475 10.1126/science.1229965
• B. Worm, R. Hilborn, J. K. Baum, T. A. Branch, J. S. Collie, C. Costello, M. J. Fogarty, E. A. Fulton, J. A. Hutchings, S. Jennings, O. P. Jensen, H. K. Lotze, P. M. Mace, T. R. McClanahan, C. Minto, S. R. Palumbi, A. M. Parma, D. Ricard, A. A. Rosenberg, R. Watson, D. Zeller, (2009) Rebuilding Global Fisheries. Science 325 578-585 10.1126/science.1173146
• Brad deYoung, Manuel Barange, Gregory Beaugrand, Roger Harris, R. Ian Perry, Marten Scheffer, Francisco Werner, (2008) Regime Shifts in Marine Ecosystems: Detection, Prediction And Management. Trends in Ecology & Evolution 23 402-409 10.1016/j.tree.2008.03.008

#### Notebook features: SHA Hashes

Providing an immutable and verifiable research record

03 May 2013

Note: this entry is part of a series of posts which explain some of the techinical features of my lab notebook.

I version manage all changes to my entry using git. Each page is linked to its source history on Github, which will display a list of all previous edits to the post with an easy-to-read commit log and highlighted diffs. A version history is often considered an essential part of an open lab notebook, where changes to the notebook are documented and preserved. While wikis, google docs, dropbox, wordpress plugins, or just regular backups can provide version history of pages, none come close to comparison with a full version management system such as git. This is because git’s underlying architecture is based on
The magic of cryptographic SHA hashes.

Hashes provide an immutable and verifiable record of any all changes I make. Because the hash is generated by the cryptographic SHA1 algorithm from the contents of the site, it is impossible to make changes without causing the hash to update. By referencing the content’s hash value we can be sure to link to a constant version of the entry. These can be verified by re-generating the hash (requiring the previous state of the repository in this case, see note below). Unlike publication timestamps, versions, or DOIs, this provides a way not only to reference particular versions of a file, but a cryptographically secure way to verify that the version is what it claims to be. Tobias Kuhn has observed that this is a valuable feature we should want to see for all scientific publishing. Each of my posts now displays its SHA hash on the sidebar along with other metadata. While the history button already provides a convient way to browse all previous versions of a post, I chose to display the SHA hash directly so that the hash value would be part of the document metadata, while also highlighting this feature.

In addition to the GitHub repository for my lab notebook, My research code, analyses, and manuscripts are collected into Github repositories by project. This allows my analysis and paper writing to benefit from this same immutable and verifiable record. Because GitHub uses the SHA hashes in its link structure, this also provides a convenient way to link to a particular version of code in a given entry. This way, I can be sure the contents of the file displayed at that link never change, even as I continue to update that file. Even if the file or containing directory is later deleted or moved, the link will still resolve. Only if the entire project repo were deleted or if Github itself dissolved would the link be lost. Even then, using the SHA hash given in the link we could determine the contents of the file from some other copy of the repository (such as a local or figshare archive).

Tobias is actually working on his own SHA hash approach which is somewhat superior to the simpler method of using git. The Github hash corresponds to the state of the entire repository/notebook at the time of the commit, rather than the contents of an individual file. Consequently, one would need a snapshot of the entire repository, available on Github, to perform the verification. Tobias is looking into generating hashes based on the contents of the file directly – so far, only RDF data – that could proivde a unique and verifiable reference for any scholarly data or publication.

Version managing the notebook and code has many more practical day-to-day benefits, such as recovering from a mistaken deleted or corrupted file, merging changes made on different machines or by collaborators, or creating branches to test new features without disrupting current version, and comparing differences as a file evolves.