Notebook features: SHA Hashes

Providing an immutable and verifiable research record

Note: this entry is part of a series of posts which explain some of the technical features of my lab notebook.

I version manage all changes to my entry using git. Each page is linked to its source history on Github, which will display a list of all previous edits to the post with an easy-to-read commit log and highlighted diffs. A version history is often considered an essential part of an open lab notebook, where changes to the notebook are documented and preserved. While wikis, Google docs, Dropbox, Wordpress plugins, or just regular backups can provide version history of pages, none come close to comparison with a full version management system such as git. This is because git’s underlying architecture is based on
The magic of cryptographic SHA hashes.

Hashes provide an immutable and verifiable record of any all changes I make. Because the hash is generated by the cryptographic SHA1 algorithm from the contents of the site, it is impossible to make changes without causing the hash to update. By referencing the content’s hash value we can be sure to link to a constant version of the entry. These can be verified by re-generating the hash (requiring the previous state of the repository in this case, see note below). Unlike publication timestamps, versions, or DOIs, this provides a way not only to reference particular versions of a file, but a cryptographically secure way to verify that the version is what it claims to be. Tobias Kuhn has observed that this is a valuable feature we should want to see for all scientific publishing. Each of my posts now displays its SHA hash on the sidebar along with other metadata. While the history button already provides a convenient way to browse all previous versions of a post, I chose to display the SHA hash directly so that the hash value would be part of the document metadata, while also highlighting this feature.

In addition to the GitHub repository for my lab notebook, My research code, analyses, and manuscripts are collected into Github repositories by project. This allows my analysis and paper writing to benefit from this same immutable and verifiable record. Because GitHub uses the SHA hashes in its link structure, this also provides a convenient way to link to a particular version of code in a given entry. This way, I can be sure the contents of the file displayed at that link never change, even as I continue to update that file. Even if the file or containing directory is later deleted or moved, the link will still resolve. Only if the entire project repo were deleted or if Github itself dissolved would the link be lost. Even then, using the SHA hash given in the link we could determine the contents of the file from some other copy of the repository (such as a local or figshare archive).

Tobias is actually working on his own SHA hash approach which is somewhat superior to the simpler method of using git. The Github hash corresponds to the state of the entire repository/notebook at the time of the commit, rather than the contents of an individual file. Consequently, one would need a snapshot of the entire repository, available on Github, to perform the verification. Tobias is looking into generating hashes based on the contents of the file directly – so far, only RDF data – that could provide a unique and verifiable reference for any scholarly data or publication.

Version managing the notebook and code has many more practical day-to-day benefits, such as recovering from a mistaken deleted or corrupted file, merging changes made on different machines or by collaborators, or creating branches to test new features without disrupting current version, and comparing differences as a file evolves.