packrat and rmarkdown

I’m pretty happy with the way rmarkdown looks like it can pretty much replace my Makefile approach with a simple R command to rmarkdown::render(). Notably, a lot of the pandoc configuration can already go into the document’s yaml header (bibliography, csl, template, documentclass, etc), avoiding any messing around with the Makefile, etc.

Even more exciting is the pending RStudio integration with pandoc. This exposes the features of the rmarkdown package to the RStudio IDE buttons, but more importantly, seems like it will simplify the pandoc/latex dependency issues cross-platform.

In light of these developments, I wonder if I should separate my manuscripts from their corresponding R packages entirely (and/or treat them as vignettes?) I think it would be ideal to point people to a single .Rmd file and say “load this in RStudio” rather than passing along a whole working directory.

The rmarkdown::render workflow doesn’t cover installing the dependencies, or downloading the a pre-built cache. I’ve been relying on the R package mechanism itself to handle dependencies, though I list all packages loaded by the manuscript but not needed by the package functions themselves as SUGGESTS, as one would do with a vignette. Consequently, I’ve had to add an install.R script to my template, to make sure these packages are installed before a user attempts to run the document. The install script feels like a bit of a hack, and makes me think that RStudio’s packrat may be what I actually want for this. So I finally got around to playing with packrat.

packrat

Packrat isn’t yet on CRAN, and for an RStudio package I admit that it feels a bit clunky still. Having a single packrat.lock file (think Gemfile.lock I suppose) seems like a great idea. Carting around the hidden files .Rprofile, .Renviron, and the tar.gz sources for all the dependences (in packrat.sources) seems heavy and clunky, and logging in and out all the time feels like a hack.

The first discussion led to an interesting question about just how big are CRAN packages these days anyhow? Thanks to this clever rsync trick from Duncan, I could quickly explore this:

txt = system("rsync --list-only cran.r-project.org::CRAN/src/contrib/ | grep .tar.gz", intern = TRUE)
setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", from) ) )
ans = read.table(textConnection(txt), colClasses=c("character", "num.with.commas", "Date", "character"))
ggplot(ans, aes(V3, V2)) + geom_point()
sum(ans$V2>1e6)
sum(ans$V2/1e6)
cran
cran

Note that there are 711 packages over 1 MB, for a total weight of over 2.8 GB. Not huge but more than you might want in a Git repo all the same.

Nevertheless, packrat works pretty well. Using a bit of a hack we can just version manage/ship the packrat.lock file and let packrat try and restore the rest.

packrat::packify()
source(".Rprofile"); readRenviron(".Renviron")
packrat::restore()
source(".Rprofile"); readRenviron(".Renviron")

The source/readRenviron calls should really be restarts to R. Tried replacing this with calls to Rscript -e "packrat::packify() etc. but that fails to find packrat on the second call. (Attempting to reinstall it doesn’t work either).

Provided the sources haven’t disappeared from their locations on Github, CRAN, etc., I think this strategy should work just fine. More long-term, we would want to archive a tarball with the packrat.sources, perhaps downloading it from a script as I currently do with the cache archive.

knitcitations

Debugging check reveals some pretty tricky behavior on R’s part: it wants to check the R code in my vignette even though it’s not building the vignette. It does this by tangling out the code chunks, which ignores in-line code. Not sure if this should be a bug in knitr or R, but it can’t be my fault ;-). See knitr/issues/784

With checks passing, have sent v0.6 to CRAN. Fingers crossed…

Milestones for version 0.7 should be able to address the print formatting issues, hopefully as a new citation_format option and without breaking backwards compatibility.