# Lab Notebook

### (Introduction)

#### Coding

• cboettig pushed to master at cboettig/nonparametric-bayes: supress printing of mcmc diagnostics until appendix supress printing of mcmc diagnostics 08:24 2013/06/18
• cboettig pushed to master at cboettig/labnotebook: drafting drafting draft 03:50 2013/06/18
• cboettig pushed to master at cboettig/nonparametric-bayes: include color palette 04:32 2013/06/18
• cboettig pushed to master at cboettig/nonparametric-bayes: try the p=1,10,5 settings deviances block got deleted accidentally. restored 11:12 2013/06/17
• cboettig pushed to master at cboettig/cboettig.github.com: update site 08:44 2013/06/17

#### Discussing

• My treebase paper (re)appears along with @phylorich's diversitree and @dwbapst's paleotree in the @MethodsEcolEvol #evol2013 virtual issue

12:35 2013/06/18
• RT @phdpqc: @cboettig @tpoi @recology_ "Training" is the genetic solution to everything. Committees need to value software as much as publi…

11:30 2013/06/18
• We @rOpenSci get $180K from the Sloan Foundation! Here's what we hope to accomplish with it this year: http://t.co/4UXfMzvkxn 11:21 2013/06/18 • @tpoi @recology_ all for training; but I fear people learn what they have incentive to learn (my rant on same paper https://t.co/PV5J6lcPnf 10:52 2013/06/18 • @phdpqc @yodacomplex I agree. That's why my review focuses on things that should help user feedback 10:47 2013/06/18 #### Reading • Review of Sokal and Rolf 2012 Biometry 4th Ed.: Limnology and Oceanography Bulletin (2013). Volume: 115, Issue: 2. Pages: 62-65. Stuart H. Hurlbert et al. 05:21 2013/06/03 • Signature of ocean warming in global fisheries catch: Nature (2013). Volume: 497, Issue: 7449. Pages: 365-368. William W. L. Cheung, Reg Watson, Daniel Pauly et al. 05:21 2013/06/03 • Innovations in capture fisheries are an imperative for nutrition security in the developing world: Proceedings of the National Academy of Sciences (2013). Pages: 1-6. S. J. Hall, R. Hilborn, N. L. Andrew, E. H. Allison et al. 05:21 2013/06/03 • What early warning systems are there for environmental shocks?: Environmental Science & Policy (2013). Pages: S60-S75. Timothy M. Lenton et al. 05:21 2013/06/03 #### Entries #### What I look for in 'Software Papers' pet peeves and faux pas 13 Jun 2013 pageviews: 516 I am more and more frequently reviewing ‘software papers:’ which I define as publications whose primary purpose is to publicize a piece of scientific software and provide a traditional research product with hopes that it will receive citations and recognition from other researchers in grant and job reviews. To me this feels very much like hacking the publication recognition system rather than the ideal way to recognize and track the role of software in research communities, but a very practical one in the current climate. I have written two myself, so I have been on both ends of this issue. In this post, I share what I look for in such papers and what omissions I see most frequently. ## Reviewing software If we are going to employ this hack of the publication system for software, we should at least use it to maximal advantage. As a reviewer, I feel that means reviewing not just the submitted manuscript but the software itself. If we can agree on nothing else, we as a community should at least be able to say: my review of a software paper is a review of the software I assume most other authors, reviewers and editors on this content share this implicit assumption, but I’d love to hear first hand from anyone else. For instance: as an editor, what would you do if your reviewers only commented on the paper directly and not on the software distributed? We are not really taught to review software, any more than we are taught to write it in the first place. Most journals offer little guidance on this (though see the Journal of Open Research Software guidelines for peer review of software, all though they are still rather minimal.) In the absence of a culture on software reviewing, I thought I would lay out my own perspective with the hope of hearing feedback and push back from others. Perhaps through such dialogs we can develop clearer expectations for this rapidly expanding genre. ## Reviewing the software paper I don’t include “to document” the software as a purpose, since none do so very comprehensively, and besides, documentation belongs in the software, not in a journal. “Publicize” usually includes some motivating examples that could convince many readers that the software does something useful for them without too much effort. As such, I expect the paper to: • provide the journal’s audience with a clear motivation for why the package is useful, * and have at least one functioning “wow” example that I can run (by copy-paste) and understand without difficulty (e.g. without referring to code comments or the package manual to understand the function calls and their arguments). This is an intentionally low bar that I hope helps promote these kinds of contributions. Despite this, papers frequently fail the copy-paste test or the plain text explanations of the function calls. Come on people. Meanwhile, I try to focus the rest of my review on the software itself. ## My partial list of criteria As I am almost always reviewing R packages, the software already meets some very basic standards required by submission to CRAN: dependencies and license stated, built-in documentation, passing some minimal automatic checks, etc. (See the CRAN Policies and the Writing R Extensions Manual for details). This is great, as it clears the first few hurdles of installation, etc. without much fuss, but still provides a bar that is by itself unacceptably low for published scientific software. Here is a list of the things I see that most often frustrate me. This isn’t intended as a style-guide or a comprehensive list of best practices, just my own pet peeves. I have somewhat tongue-in-cheek labeled them by severity of the review I might give; which like any other use of these terms is more of a measure of how annoyed I am then anything else. Critiques and suggestions welcome. ### Edit: The comments from other reviewers, authors, and editors have been fantastic, thank you all. I particularly appreciate the opportunity to have reviewing styles critiqued, something that does not happen in normal peer review. Just a note on my headings here. I do not see any of these things as “gatekeeping requirements” and have intentionally omitted the option of “Reject”. I would reject such a paper for methodological flaws, etc., but not for any of the reasons on my list below. The list is intended only to improve, not prevent, software publication. I believe any of the decisions below typically result in a revision to the same journal, that authors judiciously choose how to respond to reviewer comments guided by the editor’s own feedback, and that it is ultimately the editor’s decision whether any of this is relevant. </edit> ## “Reject and resubmit” ### Automatic tests A scientific R package must must must have some automated tests that are run by R CMD check. Even if further development of the package doesn’t break anything (most likely only if further development doesn’t happen), changes to the package dependencies could still break things, and so it is crucial to have a mechanism in place to detect these problems when they arise. For code that runs quickly, the simplest way to do this is through the examples in the documentation. I don’t expect all scientific software to have a complete test suite with 100% coverage covering all the weird things that can happen if a user passes in a matrix when the function expects a data frame or has some unanticipated missing values, etc. Just some tests to make sure the basic examples execute and I’ll be happy. Longer running functions or those that need calls to external web resources shouldn’t be run in the examples (too much of a burden for CRAN’s automatic testing) so they should be marked dontrun and put in a separate test suite or vignette as it says in the manual. ### Passing optional arguments I see authors write functions like this all the time: f <- myfunction(f, p){ # stuff o <- optim(f, p) # stuff } calling an existing library function like optim that has a whole host of very useful optional arguments that have a significant impact on how the algorithm functions. Whenever you a rich function like optim, please have the courtesy to make it’s arguments available to future users and developers through your function call. Yes, most users will just want the default arguments, (or your default arguments, if different), and that can be handled just fine by providing default values as optional arguments. R has a fantastic mechanism for this exact issue: the ... argument. The above code could be fixed simply by using: f <- myfunction(f, p, ...){ # stuff o <- optim(f, p, ...) # stuff } which works just they way you think it would. If you have more than one such function (ask yourself if you can write shorter functions first and then) pass optional arguments as lists, f <- myfunction(f, p, optim_options, fn2_options){ # stuff o <- do.call(optim, as.list(c(f, p, optim_options))) # stuff b <- do.call(fn2, fn2_options) # stuff } arguments can also be extracted with list(...)$arg1 etc.

A converse of this issue is not providing default arguments where it might be natural to do so. This does not bother me so much, as it is probably useful to force the user to think how many iterations n are appropriate for their problem rather than just assuming that 100 is good because it is the default. The only time this case is annoying is when the argument will not be changing – such as a user’s authentication token to access a web resource. Don’t make me manually pass the token to every function in the library please.

### Development site and bug tracker

I would really like to see a link to the software development page, such as r-forge or Github. The primary asset in this context is pointing reviewers to an address with a bug tracking system where issues can be assigned ticket numbers and readers can transparently see if a package is being actively maintained. A reader who comes across the paper years later who has only an email address that may or may not work has little way to determine what the latest version of the code is, whether it is actively maintained, or whether earlier versions that may have been in used in previous publications suffered from any significant bugs.

We write software papers with the sometimes vain hope that they will be cited by users, so authors of such papers should at least follow these best practices themselves. R includes a native mechanism for providing citations to packages, citation(packagename), including the information for any software paper published along with it. Be sure to add your own software papers to the CITATION file. More information can be found in my post on Citing R packages.

## “Major Revisions”

These are other things that commonly frustrate me, but fall on a bit more of a continuum of style rather than gross oversights. As such I’m not sure that any one of these things would justify rejection.

### Functionalize the code

Style guides will tell you to keep functions short, not more than a screen or 20 lines. Breaking large functions into a series of smaller functions and documenting those smaller functions – even if they are only used internally – is a great help to a reviewer trying to understand what a function is supposed to do and also test that it does what it says. Anyone building the code base later (most often yourself) will appreciate the reusable modules.

### Stable, clean, and complete return objects

An extension of providing optional arguments to functions is to also provide access to all of their return information. To extend the example from wrapping optim, this would involve returning the convergence information. Using object classes and helper functions for return objects helps keep code stable and lets users leverage existing code for similar objects, such as fitting or plotting routines. More discussion on this topic based on my own experiences in the post, we need more object oriented design in comparative methods

Because CRAN requires this through the DESCRIPTION file, R package authors rarely neglect this entirely. A sometimes misconception is that because R itself is primarily dual-licensed under GPL-2 and GPL-3 that R packages must use a GPL license due to the “viral” clause of the GPL. This clause only applies if you are modifying existing GPL functions directly and is not a requirement for R packages, which recognize a large array of licenses. My own recommendation for authors seeking to maximize the impact of their work is to use MIT, BSD (2 clause), or CC0 license for the package. CC0 has the advantage of being suitable for and data or documentation included, but authors should do there homework and decide what is best for them.

## “Minor Revisions”

Consistent use of coding style, good documentation, clear examples, intelligent reuse of code, and other best practices are all areas in which any work could improve. While we can all become better developers by highlighting these issues in our reviews, they are probably best developed over time and in dialog with the user community. I also put anything extending the scope of functionality into this category – I do not have any concept of minimal contribution as long as the code meets the criteria above. Meanwhile, there’s always a few pet peeves I just cannot help mentioning. Here’s one which is particular to R packages and so commonly overlooked.

### IMPORTS not DEPENDS

Many developers overlook that package dependencies that provide functions your functions will use internally should be listed as under IMPORTS rather than DEPENDS. This keeps the users namespace cleaner and avoids collisions of functions having the same name. Use DEPENDS only for those packages whose functions will be used by the end user as well.

If you are an author, editor, or reviewer of R software packages, what are your pet peeves?

#### manuscript reviews on github?

examples and questions

10 Jun 2013

pageviews: 253

I was recently impressed to learn of Trevor Bedford’s strategy of seeking pre-approval for posting his reviewer’s comments as Github issues. Beyond providing links to the data and source-code, I generally don’t advertise the open science nature of papers I submit – I guess I assume that if the reader or reviewers care, it should be easy enough for them to discover it. Consequently I am usually immediately frustrated to realize that upon receiving my reviews I have to create a second, private repository for the review material, our replies to reviewers, etc., as I don’t have permission to disclose that information. 1 I have recently stumbled across several examples of authors publishing to the web anonymous reviews they have received. Though anonymous, I feel the practice potentially murky without explicit permission, so I would appreciate any insight others have on this.

Trevor’s approach suggests I should consider broaching this question when first submitting my review, so I am puzzling over the best way to do so. One option would be to include such a request in the cover letter. For example,

The authors of this manuscript would like to request that the editor and reviewers indicate in their replies if they consent or decline to have their comments posted anonymously in the public Issues Tracker of this paper.

Does that need more explanation? A link to examples like Trevor’s that have done this before? Do I need to explain the value of having this kind of transparent provenance for the paper? Should I mention how this could give the reviewer more transparent credit? Encourage them to comment directly on Github from their own account?

Does this need the blessing of the journal? How would you feel about such a clause as a reviewer or editor? For a recent review I had done of a paper that was similarly written on Github, I obtained the Journal’s permission to post my review as an issue in their repository. I would love to see more examples of this kind of thing, if anyone has come across them.

## Cover letters for open science manuscripts?

While I lean towards a minimal statement such as the one above, perhaps a cover letter would be a good place to document some of the other open and reproducible features of the manuscript? Or perhaps such statements should be added to the manuscript itself? Among the options, I might point out:

1. The manuscript has been written on Github. Consequently the full drafting and revision history is available, along with graphs of author contributions (which omit authors without Github accounts and may be distorted by trivial line changes)
2. The manuscript has been written with all the code necessary to repoduce the results embedded as a knitr dynamic document. This helps ensure the analysis is always in synch with the results presented in the manuscript and the that the research is reproducible. The analysis, figures, and manuscript can be reassembled from scratch by typing make pdf in the repository directory.
3. Code to replicate the analysis and produce each of the figures shown can be found at: (Version-stable lnk to the appropriate Github pages? Deposit in Figshare/Dryad first?)
4. Data to replicate the analysis and data shown in each of the figures can be found at: (Easiest to link to Github, since the code and data already reside there.
Alternatively I could deposit these in Figshare or Dryad first…)
5. The manuscript, code, data, and documentation are available as an R package in the Github repository.
6. The issues tracker associated with the manuscript’s repository provides a record of this research, including lines of investigation that were resolved into the results presented here, lines that were closed as dead-ends or null results, and outstanding issues for further investigation.
7. The daily lab notebook entries accompanying this research can be found under the project-tag between dates of XX and XX.

Listing all of these would make for a somewhat lengthy cover letter, which might be a bit overwhelming to be useful (or seem more promotional than valuable). Are any of the seven things above worth highlighting in particular?

Perhaps these details could be deferred to a README file in the project’s Github repo, and the cover letter could simply provide a link to the project repository? What, if anything, would appear most useful and accessible to a reviewer unfamiliar with this approach or its potential value? Though elements of this approach have been discussed in the published literature, e.g. Gentleman & Temple Lang (2007) ‘compendium’ concept, Stodden (2009) RRS concept, or Peng (2011) reproducible papers in the Journal of Biostatistics, I’m unsure if pointing a reviewer to these references would be more valuable or more confusing. Let me know what you think.

## References

1. Strictly speaking the edits to the manuscript in the open repository could also be considered confidential, though at that stage I haven’t yet signed the copyright agreements that come with publication, which tend to be quite reasonable even for the traditional subscription based journals I work with

#### Semi Analytic Posteriors

05 Jun 2013

pageviews: 23

The difficulty in comparing the nonparametric Bayesian inference against parametric Bayesian inference is ensuring that the poorer performance of the latter is not do to numerical limitations of the MCMC (no one is quite so worried about the cases where the mcmc solution appears to work well…) Convergence is almost impossible to truly establish, and lots of pathologies (correlations between variables, particularly without simulataneous updating) can frustrate it considerably. While multiple chains and long run times are the reasonable default, for simple enough models we can take a more direct approach.

### Generating Model and parameters

Ricker model, parameterized as

$X_{t+1} = X_t r e^{-\beta X_t + \sigma Z_t}$

for random unit normal $$Z_t$$

f <- function(x,h,p)  x * p[1] * exp(-x * p[2])
p <- c(exp(1), 1/10)
K <- 10  # approx, a li'l' less
Xo <- 1 # approx, a li'l' less
sigma_g <- 0.1
z_g <- function() rlnorm(1,0, sigma_g)
x_grid <- seq(0, 1.5 * K, length=50)
N <- 40
set.seed(123)

### Sample Data

x <- numeric(N)
x[1] <- Xo
for(t in 1:(N-1))
x[t+1] = z_g() * f(x[t], h=0, p=p)
qplot(1:N, x)

## Compute the posterior after marginalizing over $$r$$ and $$\sigma$$ parameters:

$P(\beta | X)$

Mt <- function(t, beta) log(x[t+1]) - log(x[t]) + beta * x[t]
beta_grid = seq(1e-5, 2, by=1e-3)

P_B.X <- sapply(beta_grid, function(beta){
Mt_vec = sapply(1:(N-1), Mt, beta)
sum_of_squares <- sum(Mt_vec^2)
square_of_sums <- sum(Mt_vec)^2
0.5 ^ (N/2-1) * (sum_of_squares - square_of_sums/(N-1)) ^ (N/2-1) / gamma(N/2-1)
})

qplot(beta_grid, -log(P_B.X))

Posterior mode is at:

beta_grid[which.min(P_B.X)]
[1] 0.09801

Estimating the Myers model on this data:

$X_{t+1} = Z_t \frac{r X_t^{\theta}}{1 + X_t^{\theta} / K}$

With $$Z_t$$ lognormal, unit mean, std $$\sigma$$.

Marginal distribution over the remaining parameters is a 2D grid:

Mt <- function(t, theta, K) log(x[t+1]) - theta * log(x[t]) + log(1 + x[t] ^ theta / K)
theta_grid = seq(1e-5, 5, length=100)
K_grid = seq(1e-5, 30, length=100)

prob <- function(theta, K){
Mt_vec = sapply(1:(N-1), Mt, theta, K)
sum_of_squares <- sum(Mt_vec^2)
square_of_sums <- sum(Mt_vec)^2
0.5 ^ (N/2-1) * (sum_of_squares - square_of_sums/(N-1)) ^ (N/2-1) / gamma(N/2-1)
}

P_theta_K.X <- sapply(theta_grid, function(theta)
sapply(K_grid, function(k) prob(theta, k)))

require(reshape2)
df = melt(P_theta_K.X)
names(df) = c("theta", "K", "lik")
ggplot(df, aes(theta_grid[theta], K_grid[K], z=-log(lik))) + stat_contour(aes(color=..level..), binwidth=3)

04 Jun 2013

pageviews: 3

# Comparison of Nonparametric Bayesian Gaussian Process estimates to standard the Parametric Bayesian approach

Plotting and knitr options, (can generally be ignored)

require(modeest)
posterior.mode <- function(x) {
mlv(x, method="shorth")$M } ### Model and parameters Uses the model derived in citet("10.1080/10236190412331335373"), of a Ricker-like growth curve with an allee effect, defined in the pdgControl package, f <- RickerAllee p <- c(2, 8, 5) K <- 10 # approx, a li'l' less allee <- 5 # approx, a li'l' less Various parameters defining noise dynamics, grid, and policy costs. sigma_g <- 0.05 sigma_m <- 0.0 z_g <- function() rlnorm(1, 0, sigma_g) z_m <- function() 1+(2*runif(1, 0, 1)-1) * sigma_m x_grid <- seq(0, 1.5 * K, length=50) h_grid <- x_grid profit <- function(x,h) pmin(x, h) delta <- 0.01 OptTime <- 50 # stationarity with unstable models is tricky thing reward <- 0 xT <- 0 Xo <- allee+.5# observations start from x0 <- K # simulation under policy starts from Tobs <- 40 ### Sample Data  set.seed(1234) #harvest <- sort(rep(seq(0, .5, length=7), 5)) x <- numeric(Tobs) x[1] <- Xo nz <- 1 for(t in 1:(Tobs-1)) x[t+1] = z_g() * f(x[t], h=0, p=p) obs <- data.frame(x = c(rep(0,nz), pmax(rep(0,Tobs-1), x[1:(Tobs-1)])), y = c(rep(0,nz), x[2:Tobs])) raw_plot <- ggplot(data.frame(time = 1:Tobs, x=x), aes(time,x)) + geom_line() raw_plot ## Maximum Likelihood set.seed(12345) estf <- function(p){ mu <- f(obs$x,0,p)
-sum(dlnorm(obs$y, log(mu), p[4]), log=TRUE) } par <- c(p[1]*rlnorm(1,0,.4), p[2]*rlnorm(1,0,.3), p[3]*rlnorm(1,0, .3), sigma_g * rlnorm(1,0,.3)) o <- optim(par, estf, method="L", lower=c(1e-5,1e-5,1e-5,1e-5)) f_alt <- f p_alt <- c(as.numeric(o$par[1]), as.numeric(o$par[2]), as.numeric(o$par[3]))
sigma_g_alt <- as.numeric(o$par[4]) est <- list(f = f_alt, p = p_alt, sigma_g = sigma_g_alt, mloglik=o$value)

Mean predictions

true_means <- sapply(x_grid, f, 0, p)
est_means <- sapply(x_grid, est$f, 0, est$p)

## Non-parametric Bayes

#inv gamma has mean b / (a - 1) (assuming a>1) and variance b ^ 2 / ((a - 2) * (a - 1) ^ 2) (assuming a>2)
s2.p <- c(5,5)
d.p = c(10, 1/0.1)

Estimate the Gaussian Process (nonparametric Bayesian fit)

gp <- gp_mcmc(obs$x, y=obs$y, n=1e5, s2.p = s2.p, d.p = d.p)
gp_dat <- gp_predict(gp, x_grid, burnin=1e4, thin=300)

Show traces and posteriors against priors

plots <- summary_gp_mcmc(gp, burnin=1e4, thin=300)

# Summarize the GP model
tgp_dat <-
data.frame(  x = x_grid,
y = gp_dat$E_Ef, ymin = gp_dat$E_Ef - 2 * sqrt(gp_dat$E_Vf), ymax = gp_dat$E_Ef + 2 * sqrt(gp_dat$E_Vf) ) ## Parametric Bayesian Models We use the JAGS Gibbs sampler, a recent open source BUGS implementation with an R interface that works on most platforms. We initialize the usual MCMC parameters; see ?jags for details. All parametric Bayesian estimates use the following basic parameters for the JAGS MCMC: y <- x N <- length(x); jags.data <- list("N","y") n.chains <- 6 n.iter <- 1e6 n.burnin <- floor(10000) n.thin <- max(1, floor(n.chains * (n.iter - n.burnin)/1000)) n.update <- 10 We will use the same priors for process and observation noise in each model, stdQ_prior_p <- c(1e-6, 100) stdR_prior_p <- c(1e-6, .1) stdQ_prior <- function(x) dunif(x, stdQ_prior_p[1], stdQ_prior_p[2]) stdR_prior <- function(x) dunif(x, stdR_prior_p[1], stdR_prior_p[2]) ### Parametric Bayes of correct (Allen) model We initiate the MCMC chain (init_p) using the true values of the parameters p from the simulation. While impossible in real data, this gives the parametric Bayesian approach the best chance at succeeding. y is the timeseries (recall obs has the $$x_t$$, $$x_{t+1}$$ pairs) The actual model is defined in a model.file that contains an R function that is automatically translated into BUGS code by R2WinBUGS. The file defines the priors and the model. We write the file from R as follows: K_prior_p <- c(0.01, 40.0) logr0_prior_p <- c(-6.0, 6.0) logtheta_prior_p <- c(-6.0, 6.0) bugs.model <- paste(sprintf( "model{ K ~ dunif(%s, %s) logr0 ~ dunif(%s, %s) logtheta ~ dunif(%s, %s) stdQ ~ dunif(%s, %s)", K_prior_p[1], K_prior_p[2], logr0_prior_p[1], logr0_prior_p[2], logtheta_prior_p[1], logtheta_prior_p[2], stdQ_prior_p[1], stdQ_prior_p[2]), " iQ <- 1 / (stdQ * stdQ); r0 <- exp(logr0) theta <- exp(logtheta) y[1] ~ dunif(0, 10) for(t in 1:(N-1)){ mu[t] <- y[t] * exp(r0 * (1 - y[t]/K)* (y[t] - theta) / K ) y[t+1] ~ dnorm(mu[t], iQ) } }") writeLines(bugs.model, "allen_process.bugs") Write the priors into a list for later reference K_prior <- function(x) dunif(x, K_prior_p[1], K_prior_p[2]) logr0_prior <- function(x) dunif(x, logr0_prior_p[1], logr0_prior_p[2]) logtheta_prior <- function(x) dunif(x, logtheta_prior_p[1], logtheta_prior_p[2]) par_priors <- list(K = K_prior, deviance = function(x) 0 * x, logr0 = logr0_prior, logtheta = logtheta_prior, stdQ = stdQ_prior) We define which parameters to keep track of, and set the initial values of parameters in the transformed space used by the MCMC. We use logarithms to maintain strictly positive values of parameters where appropriate. jags.params=c("K","logr0","logtheta","stdQ") # be sensible about the order here jags.inits <- function(){ list("K"= 8 * rlnorm(1,0, 0.1), "logr0"=log(2 * rlnorm(1,0, 0.1) ), "logtheta"=log( 5 * rlnorm(1,0, 0.1) ), "stdQ"= abs( 0.1 * rlnorm(1,0, 0.1)), .RNG.name="base::Wichmann-Hill", .RNG.seed=123) } set.seed(1234) # parallel refuses to take variables as arguments (e.g. n.iter = 1e5 works, but n.iter = n doesn't) allen_jags <- do.call(jags.parallel, list(data=jags.data, inits=jags.inits, jags.params, n.chains=n.chains, n.iter=n.iter, n.thin=n.thin, n.burnin=n.burnin, model.file="allen_process.bugs")) # Run again iteratively if we haven't met the Gelman-Rubin convergence criterion recompile(allen_jags) # required for parallel allen_jags <- do.call(autojags, list(object=allen_jags, n.update=n.update, n.iter=n.iter, n.thin = n.thin)) #### Convergence diagnostics for Allen model R notes: this strips classes from the mcmc.list object (so that we have list of matrices; objects that reshape2::melt can handle intelligently), and then combines chains into one array. In this array each parameter is given its value at each sample from the posterior (index) for each chain. tmp <- lapply(as.mcmc(allen_jags), as.matrix) # strip classes the hard way... allen_posteriors <- melt(tmp, id = colnames(tmp[[1]])) names(allen_posteriors) = c("index", "variable", "value", "chain") ggplot(allen_posteriors) + geom_line(aes(index, value)) + facet_wrap(~ variable, scale="free", ncol=1) allen_priors <- ddply(allen_posteriors, "variable", function(dd){ grid <- seq(min(dd$value), max(dd$value), length = 100) data.frame(value = grid, density = par_priors[[dd$variable[1]]](grid))
})

ggplot(allen_posteriors, aes(value)) +
stat_density(geom="path", position="identity", alpha=0.7) +
geom_line(data=allen_priors, aes(x=value, y=density), col="red") +
facet_wrap(~ variable, scale="free", ncol=3)

Reshape the posterior parameter distribution data, transform back into original space, and calculate the mean parameters and mean function

A <- allen_posteriors
A$index <- A$index + A$chain * max(A$index) # Combine samples across chains by renumbering index
pardist <- acast(A, index ~ variable)
# pardist <- acast(allen_posteriors[2:3], 1:table(allen_posteriors$variable)[1] ~ variable) # NOT SURE WHY THIS FAILS # transform model parameters back first pardist[,"logr0"] = exp(pardist[,"logr0"]) pardist[,"logtheta"] = exp(pardist[,"logtheta"]) colnames(pardist)[colnames(pardist)=="logtheta"] = "theta" colnames(pardist)[colnames(pardist)=="logr0"] = "r0" bayes_coef <- apply(pardist,2, posterior.mode) bayes_pars <- unname(c(bayes_coef["r0"], bayes_coef["K"], bayes_coef["theta"])) # parameters formatted for f allen_f <- function(x,h,p) unname(RickerAllee(x,h, unname(p[c("r0", "K", "theta")]))) allen_means <- sapply(x_grid, f, 0, bayes_pars) bayes_pars [1] 0.01929 7.72010 0.06654 head(pardist)  K deviance r0 theta stdQ 170 21.327 45.03 0.010484 3.826031 0.3737 171 14.261 45.02 0.029587 0.019707 0.4387 172 7.828 40.50 0.129248 2.769482 0.3795 173 29.450 45.66 0.044378 0.006615 0.4461 174 37.580 44.89 0.006334 1.252886 0.3766 175 20.168 45.05 0.033854 0.144171 0.4294 ## Parametric Bayes based on the structurally wrong model (Ricker) K_prior_p <- c(0.01, 40.0) logr0_prior_p <- c(-6.0, 6.0) bugs.model <- paste(sprintf( "model{ K ~ dunif(%s, %s) logr0 ~ dunif(%s, %s) stdQ ~ dunif(%s, %s)", K_prior_p[1], K_prior_p[2], logr0_prior_p[1], logr0_prior_p[2], stdQ_prior_p[1], stdQ_prior_p[2]), " iQ <- 1 / (stdQ * stdQ); r0 <- exp(logr0) y[1] ~ dunif(0, 10) for(t in 1:(N-1)){ mu[t] <- y[t] * exp(r0 * (1 - y[t]/K) ) y[t+1] ~ dnorm(mu[t], iQ) } }") writeLines(bugs.model, "ricker_process.bugs") Compute prior curves K_prior <- function(x) dunif(x, K_prior_p[1], K_prior_p[2]) logr0_prior <- function(x) dunif(x, logr0_prior_p[1], logr0_prior_p[2]) par_priors <- list(K = K_prior, deviance = function(x) 0 * x, logr0 = logr0_prior, stdQ = stdQ_prior) We define which parameters to keep track of, and set the initial values of parameters in the transformed space used by the MCMC. We use logarithms to maintain strictly positive values of parameters where appropriate. # Uniform priors on standard deviation terms jags.params=c("K","logr0", "stdQ") jags.inits <- function(){ list("K"=10 * rlnorm(1,0,.5), "logr0"=log(1) * rlnorm(1,0,.5), "stdQ"=sqrt(0.05) * rlnorm(1,0,.5), .RNG.name="base::Wichmann-Hill", .RNG.seed=123) } set.seed(12345) ricker_jags <- do.call(jags.parallel, list(data=jags.data, inits=jags.inits, jags.params, n.chains=n.chains, n.iter=n.iter, n.thin=n.thin, n.burnin=n.burnin, model.file="ricker_process.bugs")) recompile(ricker_jags) Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 251 Initializing model Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 251 Initializing model Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 251 Initializing model Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 251 Initializing model Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 251 Initializing model Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 251 Initializing model ricker_jags <- do.call(autojags, list(object=ricker_jags, n.update=n.update, n.iter=n.iter, n.thin = n.thin, progress.bar="none")) #### Convergence diagnostics for parametric bayes Ricker model tmp <- lapply(as.mcmc(ricker_jags), as.matrix) # strip classes the hard way... ricker_posteriors <- melt(tmp, id = colnames(tmp[[1]])) names(ricker_posteriors) = c("index", "variable", "value", "chain") ggplot(ricker_posteriors) + geom_line(aes(index, value)) + facet_wrap(~ variable, scale="free", ncol=1) ricker_priors <- ddply(ricker_posteriors, "variable", function(dd){ grid <- seq(min(dd$value), max(dd$value), length = 100) data.frame(value = grid, density = par_priors[[dd$variable[1]]](grid))
})
# plot posterior distributions
ggplot(ricker_posteriors, aes(value)) +
stat_density(geom="path", position="identity", alpha=0.7) +
geom_line(data=ricker_priors, aes(x=value, y=density), col="red") +
facet_wrap(~ variable, scale="free", ncol=2)

Reshape posteriors data, transform back, calculate mode and corresponding function.

A <- ricker_posteriors
A$index <- A$index + A$chain * max(A$index) # Combine samples across chains by renumbering index
ricker_pardist <- acast(A, index ~ variable)
ricker_pardist[,"logr0"] = exp(ricker_pardist[,"logr0"]) # transform model parameters back first
colnames(ricker_pardist)[colnames(ricker_pardist)=="logr0"] = "r0"
bayes_coef <- apply(ricker_pardist,2, posterior.mode) # much better estimates from mode then mean
ricker_bayes_pars <- unname(c(bayes_coef["r0"], bayes_coef["K"]))

ricker_f <- function(x,h,p){
sapply(x, function(x){
x <- pmax(0, x-h)
pmax(0, x * exp(p["r0"] * (1 - x / p["K"] )) )
})
}
ricker_means <- sapply(x_grid, Ricker, 0, ricker_bayes_pars[c(1,2)])

head(ricker_pardist)
         K deviance       r0   stdQ
170 30.871    44.59 0.011138 0.4164
171 15.920    44.30 0.003492 0.4062
172  8.630    46.04 0.040291 0.4962
173  7.935    37.59 0.172355 0.3977
174  8.622    42.76 0.096497 0.3325
175  8.279    40.48 0.143866 0.4316
ricker_bayes_pars
[1] 0.008073 8.011228

## Myers Parametric Bayes

logr0_prior_p <- c(-6.0, 6.0)
logtheta_prior_p <- c(-6.0, 6.0)
logK_prior_p <- c(-6.0, 6.0)

bugs.model <-
paste(sprintf(
"model{
logr0    ~ dunif(%s, %s)
logtheta    ~ dunif(%s, %s)
logK    ~ dunif(%s, %s)
stdQ ~ dunif(%s, %s)",
logr0_prior_p[1], logr0_prior_p[2],
logtheta_prior_p[1], logtheta_prior_p[2],
logK_prior_p[1], logK_prior_p[2],
stdQ_prior_p[1], stdQ_prior_p[2]),

"
iQ <- 1 / (stdQ * stdQ);
r0 <- exp(logr0)
theta <- exp(logtheta)
K <- exp(logK)

y[1] ~ dunif(0, 10)
for(t in 1:(N-1)){
mu[t] <- r0 * pow(abs(y[t]), theta) / (1 + pow(abs(y[t]), theta) / K)
y[t+1] ~ dnorm(mu[t], iQ)
}
}")
writeLines(bugs.model, "myers_process.bugs")
logK_prior     <- function(x) dunif(x, logK_prior_p[1], logK_prior_p[2])
logr_prior     <- function(x) dunif(x, logr0_prior_p[1], logr0_prior_p[2])
logtheta_prior <- function(x) dunif(x, logtheta_prior_p[1], logtheta_prior_p[2])
par_priors <- list( deviance = function(x) 0 * x, logK = logK_prior,
logr0 = logr_prior, logtheta = logtheta_prior,
stdQ = stdQ_prior)
jags.params=c("logr0", "logtheta", "logK", "stdQ")
jags.inits <- function(){
list("logr0"=log(rlnorm(1,0,.1)),
"logK"=log(10 * rlnorm(1,0,.1)),
"logtheta" = log(2 * rlnorm(1,0,.1)),
"stdQ"=sqrt(0.5) * rlnorm(1,0,.1),
.RNG.name="base::Wichmann-Hill", .RNG.seed=123)
}
set.seed(12345)
myers_jags <- do.call(jags.parallel,
list(data=jags.data, inits=jags.inits, jags.params,
n.chains=n.chains, n.iter=n.iter, n.thin=n.thin,
n.burnin=n.burnin, model.file="myers_process.bugs"))
recompile(myers_jags)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph Size: 291

Initializing model

Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph Size: 291

Initializing model

Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph Size: 291

Initializing model

Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph Size: 291

Initializing model

Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph Size: 291

Initializing model

Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph Size: 291

Initializing model
myers_jags <- do.call(autojags,
list(myers_jags, n.update=n.update, n.iter=n.iter,
n.thin = n.thin, progress.bar="none"))

Convergence diagnostics for parametric bayes

tmp <- lapply(as.mcmc(myers_jags), as.matrix) # strip classes the hard way...
myers_posteriors <- melt(tmp, id = colnames(tmp[[1]]))
names(myers_posteriors) = c("index", "variable", "value", "chain")

ggplot(myers_posteriors) + geom_line(aes(index, value)) +
facet_wrap(~ variable, scale="free", ncol=1)
par_prior_curves <- ddply(myers_posteriors, "variable", function(dd){
grid <- seq(min(dd$value), max(dd$value), length = 100)
data.frame(value = grid, density = par_priors[[dd$variable[1]]](grid)) }) ggplot(myers_posteriors, aes(value)) + stat_density(geom="path", position="identity", alpha=0.7) + geom_line(data=par_prior_curves, aes(x=value, y=density), col="red") + facet_wrap(~ variable, scale="free", ncol=3) A <- myers_posteriors A$index <- A$index + A$chain * max(A$index) # Combine samples across chains by renumbering index myers_pardist <- acast(A, index ~ variable) ### myers_pardist <- acast(myers_posteriors[2:3], 1:table(myers_posteriors$variable) ~ variable)
myers_pardist[,"logK"] = exp(myers_pardist[,"logK"]) # transform model parameters back first
myers_pardist[,"logr0"] = exp(myers_pardist[,"logr0"]) # transform model parameters back first
myers_pardist[,"logtheta"] = exp(myers_pardist[,"logtheta"]) # transform model parameters back first
colnames(myers_pardist)[colnames(myers_pardist)=="logK"] = "K"
colnames(myers_pardist)[colnames(myers_pardist)=="logr0"] = "r0"
colnames(myers_pardist)[colnames(myers_pardist)=="logtheta"] = "theta"

bayes_coef <- apply(myers_pardist,2, posterior.mode) # much better estimates
myers_bayes_pars <- unname(c(bayes_coef[2], bayes_coef[3], bayes_coef[1]))
myers_means <- sapply(x_grid, Myer_harvest, 0, myers_bayes_pars)
myers_f <- function(x,h,p) Myer_harvest(x, h, p[c("r0", "theta", "K")])
head(myers_pardist)
    deviance      K      r0 theta   stdQ
170    37.69  32.42 0.32913 2.156 0.2818
171    32.99  51.72 0.20896 2.373 0.3612
172    40.26  52.83 0.22398 2.260 0.4654
173    32.80  76.54 0.12968 2.749 0.3711
174    32.81 216.04 0.04136 3.512 0.3907
175    32.55 330.07 0.02654 3.813 0.3118
myers_bayes_pars
[1] 31.8192  0.1065 34.4760

### Phase-space diagram of the expected dynamics

models <- data.frame(x=x_grid,
GP=tgp_dat$y, True=true_means, MLE=est_means, Ricker=ricker_means, Allen = allen_means, Myers = myers_means) models <- melt(models, id="x") # some labels names(models) <- c("x", "method", "value") # labels for the colorkey too model_names = c("GP", "True", "MLE", "Ricker", "Allen", "Myers") colorkey=cbPalette names(colorkey) = model_names  plot_gp <- ggplot(tgp_dat) + geom_ribbon(aes(x,y,ymin=ymin,ymax=ymax), fill="gray80") + geom_line(data=models, aes(x, value, col=method), lwd=1, alpha=0.8) + geom_point(data=obs, aes(x,y), alpha=0.8) + xlab(expression(X[t])) + ylab(expression(X[t+1])) + scale_colour_manual(values=cbPalette) print(plot_gp) ## Goodness of fit This shows only the mean predictions. For the Bayesian cases, we can instead loop over the posteriors of the parameters (or samples from the GP posterior) to get the distribution of such curves in each case. require(MASS) step_ahead <- function(x, f, p){ h = 0 x_predict <- sapply(x, f, h, p) n <- length(x_predict) - 1 y <- c(x[1], x_predict[1:n]) y } step_ahead_posteriors <- function(x){ gp_f_at_obs <- gp_predict(gp, x, burnin=1e4, thin=300) df_post <- melt(lapply(sample(100), function(i){ data.frame(time = 1:length(x), stock = x, GP = mvrnorm(1, gp_f_at_obs$Ef_posterior[,i], gp_f_at_obs$Cf_posterior[[i]]), True = step_ahead(x,f,p), MLE = step_ahead(x,f,est$p),
}), id=c("time", "stock"))
}

ggplot(df_post) + geom_point(aes(time, stock)) +
geom_line(aes(time, value, col=variable, group=interaction(L1,variable)), alpha=.1) +
scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) 

## Optimal policies by value iteration

Compute the optimal policy under each model using stochastic dynamic programming. We begin with the policy based on the GP model,

MaxT = 1000
# uses expected values from GP, instead of integrating over posterior
#matrices_gp <- gp_transition_matrix(gp_dat$E_Ef, gp_dat$E_Vf, x_grid, h_grid)

# Integrate over posteriors
matrices_gp <- gp_transition_matrix(gp_dat$Ef_posterior, gp_dat$Vf_posterior, x_grid, h_grid)

# Solve the SDP using the GP-derived transition matrix
opt_gp <- value_iteration(matrices_gp, x_grid, h_grid, MaxT, xT, profit, delta, reward)

Determine the optimal policy based on the allen and MLE models

matrices_true <- f_transition_matrix(f, p, x_grid, h_grid, sigma_g)
opt_true <- value_iteration(matrices_true, x_grid, h_grid, OptTime=MaxT, xT, profit, delta=delta)

matrices_estimated <- f_transition_matrix(est$f, est$p, x_grid, h_grid, est$sigma_g) opt_estimated <- value_iteration(matrices_estimated, x_grid, h_grid, OptTime=MaxT, xT, profit, delta=delta) Determine the optimal policy based on Bayesian Allen model matrices_allen <- parameter_uncertainty_SDP(allen_f, x_grid, h_grid, pardist, 4) opt_allen <- value_iteration(matrices_allen, x_grid, h_grid, OptTime=MaxT, xT, profit, delta=delta) Bayesian Ricker matrices_ricker <- parameter_uncertainty_SDP(ricker_f, x_grid, h_grid, as.matrix(ricker_pardist), 3) opt_ricker <- value_iteration(matrices_ricker, x_grid, h_grid, OptTime=MaxT, xT, profit, delta=delta) Bayesian Myers model matrices_myers <- parameter_uncertainty_SDP(myers_f, x_grid, h_grid, as.matrix(myers_pardist), 4) myers_alt <- value_iteration(matrices_myers, x_grid, h_grid, OptTime=MaxT, xT, profit, delta=delta) Assemble the data OPT = data.frame(GP = opt_gp$D, True = opt_true$D, MLE = opt_estimated$D, Ricker = opt_ricker$D, Allen = opt_allen$D, Myers = myers_alt$D) colorkey=cbPalette names(colorkey) = names(OPT)  ## Graph of the optimal policies policies <- melt(data.frame(stock=x_grid, sapply(OPT, function(x) x_grid[x])), id="stock") names(policies) <- c("stock", "method", "value") ggplot(policies, aes(stock, stock - value, color=method)) + geom_line(lwd=1.2, alpha=0.8) + xlab("stock size") + ylab("escapement") + scale_colour_manual(values=colorkey) ## Simulate 100 realizations managed under each of the policies sims <- lapply(OPT, function(D){ set.seed(1) lapply(1:100, function(i) ForwardSimulate(f, p, x_grid, h_grid, x0, D, z_g, profit=profit, OptTime=OptTime) ) }) dat <- melt(sims, id=names(sims[[1]][[1]])) dt <- data.table(dat) setnames(dt, c("L1", "L2"), c("method", "reps")) # Legend in original ordering please, not alphabetical: dt$method = factor(dt$method, ordered=TRUE, levels=names(OPT)) ggplot(dt) + geom_line(aes(time, fishstock, group=interaction(reps,method), color=method), alpha=.1) + scale_colour_manual(values=colorkey, guide = guide_legend(override.aes = list(alpha = 1))) Profit <- dt[, sum(profit), by=c("reps", "method")] Profit[, mean(V1), by="method"]  method V1 1: GP 24.908 2: True 26.532 3: MLE 4.420 4: Ricker 8.363 5: Allen 7.041 6: Myers 7.347 ggplot(Profit, aes(V1)) + geom_histogram() + facet_wrap(~method, scales = "free_y") + guides(legend.position = "none") + xlab("Total profit by replicate") allen_deviance <- posterior.mode(pardist[,'deviance']) ricker_deviance <- posterior.mode(ricker_pardist[,'deviance']) myers_deviance <- posterior.mode(myers_pardist[,'deviance']) true_deviance <- 2*estf(c(p, sigma_g)) mle_deviance <- 2*estf(c(est$p, est$sigma_g)) c(allen = allen_deviance, ricker=ricker_deviance, myers=myers_deviance, true=true_deviance, mle=mle_deviance)  allen ricker myers true mle 45.26 44.78 34.48 -61.08 -287.60  #### Analytic Marginalizing For Posteriors 04 Jun 2013 pageviews: 12 Consider the model $X_{t+1} = X_t r e^{-\beta X_t + \sigma Z_t }$ with $$Z_t$$ a unit normal random variable. The likelihood of the sequence of $$T$$ observations of $$X$$ under this model is thus $P(X | r, \beta, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}^{T-1}} \exp\left(\frac{\sum_t^{T-1} \left(\log X_{t+1} - \log X_t - \log r + \beta X_t\right)^2 }{2 \sigma^2}\right)$ To integrate out $$r$$, $$P(X | \beta, \sigma) = \int P(X | r, \beta, \sigma ) P(r) dr$$, we’ll make this look like a Gaussian in $$\log r$$ by completing the square; getting the square on the outside of the sum. First we collect all the other terms as the factor, $$M_t$$; $M_t := \log X_{t+1} - \log X_t + \beta X_t$ Also define $$a = \log r$$, then after expanding the square inside the sum we have $\sum_t^{T-1} \left(\log r - M_t\right)^2 = \sum_t^{T-1} a^2 - 2 \sum_t^{T-1} a M_t + \sum_t^{T-1} M_t^2$ (using the linearity of the summation operator). Use the trick of adding and subtracting $$\left( \sum M_t \right)^2/(T-1)$$, to get: $= -\sum_t^{T-1} M_t^2 -\frac{\left(\sum_t^{T-1} M_t\right)^2}{T-1} + \frac{\left(\sum_t^{T-1} M_t\right)^2}{T-1} - 2 a \sum M_t + (T-1)a^2$ $= -\sum_t^{T-1} M_t^2 -\frac{\left(\sum_t^{T-1} M_t\right)^2}{T-1} +(T-1)\left( \left(\frac{\sum_t^{T-1} M_t}{T-1}\right)^2 - \frac{2 a \sum M_t}{T-1} + a^2\right)$ $= -\sum_t^{T-1} M_t^2 -\frac{\left(\sum_t^{T-1} M_t\right)^2}{T-1} +(T-1)\left( \frac{\sum_t^{T-1} M_t}{T-1} - a\right)^2$ Returning this expression into our exponential in place of the sum of squares, we have $P(r, \beta, \sigma | X) = \frac{1}{\sqrt{2 \pi \sigma^2}^{T-1}} \exp\left(-\frac{(T-1)\left( \frac{\sum_t^{T-1} M_t}{T-1} - a\right)^2}{2 \sigma^2}\right) \exp\left(\frac{-\sum_t^{T-1} M_t^2 -\frac{\left(\sum_t^{T-1} M_t\right)^2}{T-1} }{2 \sigma^2}\right)$ Note that the second $$\exp$$ term does not depend on $$a$$. The remaining argument has Gaussian form in $$da$$, so after pulling out the constant terms we can easily integrate this over $$da$$. (Note that we have an implicit uniform prior on $$a$$ here). $\int \exp\left(\frac{-\left(\frac{\sum_t^{T-1} M_t}{T-1} - a \right)^2}{2 \sigma^1 (T-1)^{-1} } \right) d a = \sqrt{\frac{2\pi\sigma^2}{T-1}}$ which we can combine with the remaining terms to recover $\frac{1}{(T-1)\left(2 \pi \sigma^2 \right)^{T-2}} \exp\left(\frac{-\sum M_t^2 + \frac{\left(\sum M_t\right)^2}{T-1}}{2\sigma^2} \right)$ ## marginalizing over $$\sigma$$ Now that we have effectively eliminated the parameter $$r$$ from our posterior calculation, we wish to also integrate out the second parameter, $$\sigma$$. Once again we can “integrate by analogy;” the expression above in the variable $$\sigma^2$$ looks like a Gamma distribution, $\int x^{\alpha - 1} e^{-\beta x} dx = \frac{\beta^{\alpha}}{\Gamma(\alpha)}$ Where we take $\alpha = T/2$ and $\beta = \frac{1}{2} \left( \sum M_t^2 - \frac{ \left( \sum M_t \right)^2}{T - 1}\right)$, leaving us with $\frac{1}{(T-1)\sqrt{2 \pi}^{T-2} } \frac{\tfrac{1}{2}^{T/2} \left( \sum M_t^2 - \frac{ \left( \sum M_t \right)^2}{T - 1}\right)^{T/2}}{\Gamma(T/2)}$ ## Additional recruitment functions The above derivation can be followed identically for the three-parameter recruitment functions I refer to as the Allen and Myers models after an appropriate choice of $$M_t$$. In both the Ricker and Allen models we must first reparamaterize the models to isolate the $$\alpha$$ term correctly. #### Ricker The original parameterization $X_{t+1} = X_t e^{r \left( 1 - \frac{X_t}{K}\right)}$ does not partition into the form above. Taking $$\beta = \tfrac{r}{K}$$ and $$a = r$$, we can write $$M_t$$ as above, $M_t := \log X_{t+1} - \log X_t + \beta X_t$ #### Myers $X_{t+1} = \frac{ r X_t^{\theta} }{1 + \frac{X_t^{\theta}}{K} } Z_t$ For $$Z_t$$ lognormal with log-mean zero and log-standard-deviation $$\sigma$$, The log-likelihood takes the form and thus we can write $$M_t$$ as $M_t := \log X_{t+1} - \theta \log X_t + \log\left(1 + \frac{X_t^{\theta}}{K} \right)$ #### Allen The original parameterization $X_{t+1} = Z_t X_t e^{r \left( 1 - \frac{X_t}{K} \right) \frac{\left(X_t - \theta\right)}{K}}$ does not let us isolate an additive constant (log-mean term) as we did in the example above. Writing the argument of the exponent in standard quadratic form, $X_{t+1} = Z_t X_t e^{c + b X+t + a X_t^2}$ Where $c = \tfrac{-rC}{K}$ $b = \tfrac{r}{K}\left(\tfrac{C}{K} + 1\right)$ $a = \tfrac{r}{K^2}$ then $M_t := \log X_{t+1} - \log X_t - b X_t + a X_t^2$ #### DOI != citable 03 Jun 2013 pageviews: 489 I feel I see this kind of comment almost daily: Again and again, researchers suggest that DOI to makes something “citable”. And this frustrates me. Don’t get me wrong. I love DOIs, and I love CrossRef. And I bang on the table when I have some old journal article that doesn’t yet have a DOI. I use DOIs every day in many ways. I use CrossRef’s APIs all the time to draw in metadata for citations in my notebook (through my knitcitations package), and to import metadata into my reference manager, Mendeley. I’ve written my own implementations in R and ruby, and keep an eye on their exciting new tools on the Crossref Github page. I wrote to bibsonomy when I realized they were not using the CrossRef API to look up metadata by DOIs, and they have now implemented this feature. I use DOIs to look up papers I’ve come across, and to share content I am reading. (Crossref’s DOI shortener is great for this). I even use DOI-based links to embed semantic information into links and citations of articles. But I still have no idea what researchers mean when they suggest that this makes something citable. ### Some background on DOIs At its heart, a DOI is a very simple concept. It is a “permanent identifier”. All this means is that is is really just a URL redirect. Type http://dx.doi.org/mnn into any browser and get redirected to where the article actually lives. Why does that make it permanent? Because if the journal decides to change their URL structure, the DOI’s redirect can just be mapped to the new address and voila, it still works. That is, a DOI is simply a tool to fight link-rot. So you might ask, why does the ability to remap the address have anything to do with being “permanent?” It doesn’t, really. The permanence comes not so much from the technology as from the social contract that goes with it. As CrossRef’s Geoffery Bilder eloquently explains, a publisher can only receive DOIs if they promise to keep these redirects up-to-date. A publisher who fails to maintain this responsibility would presumably lose their right to receive DOIs. A brilliant, simple, social incentive. This still does not guarantee permanence – e.g. what would happen to the content if the publisher disappears. That problem is not addressed by the DOI technology itself, but by a robust backup archiving solution, such as CLOCKSS, which provides a geo-politically distributed network of backup copies for many journals. Again the social contract comes into play – presumably CrossRef would not provide a publisher with DOIs if they did not have such a robust archival solution in place. So far we have seen two crucial functions of the DOI – as a permanent identifier that can be used to reach the content despite link rot, and as an incentive to maintain good archival backups of the content and the links to it. ### What do we mean by citations, anyway? So what does this have to do with being citable? Obviously these are nice properties to have for things we cite – but they are by no means a requirement. (As Noam Ross observes, try finding a permanent identifier for “Personal Communication”). Books, reports, and other grey literature frequently appear in citations, as do links to websites. MLA even has guidelines on the proper format to cite a tweet (which, incidentally, come closer to having a permanent identifier and an archival strategy than most other things in this list). So what do we mean by citable anyway? But what about the reference list? While a publisher may be just fine including some link to your software, is it really cited if it isn’t in the reference list? Journals restrict what appears in the reference list because these references are indexed by the infamous citation counters like Thompson-Reuters. (A frequent complaint is that many journals do not similarly index citations appearing in the reference list of the supplementary materials, making it difficult or impossible to give appropriate attribution to large numbers of data providers, for instance). Does having a DOI address this problem? #### Citation counts in DOIs Counting citations depends on who is counting them. The most well-known is Thompson-Reuters, which has their own process for deciding what gets counted (based on publisher), so no guarantee there. Meanwhile Google Scholar counts anything meeting it’s indexing requirements & arbitrary selection. I have recently learned that CrossRef just launched it’s own internal citation counting, which is available from the CrossRef metadata (totals only for the public, publishers can resolve which articles did the citing…). However, most proposals to make some alternative research product “citable” by giving it a DOI use DataCite DOIs (e.g. figshare, PeerJ Preprints), which lag behind in this feature. Moving the control of citation data beyond the grasp of particular publishing companies like TR is undoubtedly an important step forward. The Open Citation Project is a more comprehensive, if very young, move in this direct. (Hat tip to Martin Fenner for explaining CrossRef citations to me). ### Additional Metadata In addition to resolving links, DOI providers also serve a rich collection of metadata about the publication that can be queried by DOI or by other elements like author and title. Rich semantic formats and disambiguation of author names by connections to ORCID IDs are among the many advantages of this. Because many of these tools are publicly accessible by through their APIs, it is easy for other developers to build services upon them. ## Conclusions While DOI providers have done an excellent job in ensuring persistent URLs, archived content, and valuable metadata, these things are largely the product of the social contract between publisher and the DOI provider. It is not possible for an author or organization to simply “get DOIs” for all their content. But it is not the only way to provide these features, either. While I understand the value in providing a simple and reliable way to encapsulate each of these concepts as “has a DOI,” it also appears to put these features beyond the reach of individual researchers. If issues of persistent URLs, archived content, and rich metadata tools are always reduced to “has a DOI,” publishers become the only path to achieve these ends. On the contrary, a rich collection of tools is available to researchers. So what do we mean when we say a DOI makes something ‘citable?’ If this is shorthand for the properties we would want in something citable: persistent identifier, archival content, machine-readable metadata, than we should start to recognize other things that share these features. Further innovation requires valuing the features the DOI provides, not simply a “brand name” researchers recognize. ## Alternative tools In a recent post in a series on technical features of my open notebook, I discuss some of the tools available to address these challenges. In particular: • The use of PURLs for persistent identifiers • Git for archival redundancy • Greycite for metadata extraction Of course, if you ever need a DOI for a research product, there is always figshare. #### Admb Basic Example 03 Jun 2013 pageviews: 12 After a few iterations I have a working minimal example (below). Hoping that ADMB is a bit more robust than vanilla optim out of R as I loop over data sets for the sensitivity analysis (#32). Does not seem to hold in simple example here, not sure why. # Learning ADMB Plotting and knitr options, (can generally be ignored) ### Model and parameters f <- function(x,h,p) x * exp(p[1] * (1 - x / p[2]) * (x - p[3]) / p[2] ) p <- c(1, 10, 5) K <- 10 # approx, a li'l' less Xo <- 6 # approx, a li'l' less Various parameters defining noise dynamics, grid, and policy costs. sigma_g <- 0.1 z_g <- function() rlnorm(1,0, sigma_g) x_grid <- seq(0, 1.5 * K, length=50) Tobs <- 40 set.seed(123) ### Sample Data x <- numeric(Tobs) x[1] <- Xo for(t in 1:(Tobs-1)) x[t+1] = z_g() * f(x[t], h=0, p=p) qplot(1:Tobs, x) ## Maximum Likelihood “by hand” STABLIZE = 1e-10 n = length(x) mloglik <- function(pars){ r = pars[1]; k = pars[2]; c = pars[3]; s = pars[4]; mu = (x+STABLIZE) * exp( r * (1 - x / (k+STABLIZE)) * (x - c) / (k + STABLIZE)); mu = pmin(1e100, mu) # avoid infinite values f = 0.5 * n * log(2 * pi) + n * log(s + STABLIZE) + 0.5 * sum(x - mu + STABLIZE)^2/ (s + STABLIZE)^2; f } Starting from the true values we mostly just shrink the noise parameter: init <- c(p, sigma_g) mloglik(init) #true minus loglik [1] -35.72 o <- optim(init, mloglik, method="L", lower=1e-5, upper=1e5) o$value
[1] -247.6
o$par [1] 0.9967297 9.9813554 5.1742699 0.0008183 While starting from arbitrary values we still find the optim. init <- c(1,1,1,1) init <- c(p, sigma_g) mloglik(init) #true minus loglik [1] -35.72 o <- optim(init, mloglik, method="L", lower=1e-5, upper=1e5) o$value
[1] -247.6
o\$par
[1] 0.9967297 9.9813554 5.1742699 0.0008183

Okay, now lets try admb. We use R2admb which is just a convenient way to write our data and parameters into an admb file.

# install_github("R2admb", "bbolker", subdir="R2admb") # dev version
library(R2admb)

We still need to define the model using ADMB notation in the procedure section. This is mostly like R or C++, with the exception of special functions like square in place of ^2, norm2 for the sum of squares, and elem_prod istead of * for the element-wise product of two arrays. The constant pi is given as M_PI, as typical of C/C++ libraries. Where these other functions are defined I’m not sure, but some useful guides to ADMB vector/matrix operations or an (undefined) list of keywords

The equivalent model

model <-
paste("
PARAMETER_SECTION
vector mu(1,n) // per capita mort prob

PROCEDURE_SECTION
mu = log(x) + r * elem_prod((1 - x / k), (x - c) / k);
f = 0.5 * n * log(2 * M_PI) + n * log(s) + 0.5 * norm2(x - exp(mu)) / square(s);
")
writeLines(model, "model.tpl")

Without explicit handling of the overflow errors, ADMB does not give us reliable estimates with arbitrary starting conditions

setup_admb("/var/admb")
[1] "/var/admb"

df <- data.frame(x=x)
params <- list(r = 1, k = 1, c = 1, s = 1) ## starting parameters
bounds <- list(r = c(1e-10, 1e3), k=c(1e-10, 1e3), c=c(1e-10, 1e3), s = c(1e-5,1e3)) ## bounds
dat <- c(list(n = nrow(df)), df)
data = dat,
params = params,
bounds = bounds,
run.opts = run.control(checkparam="write",
checkdata="write", clean=FALSE))
m1
Model file: model_gen
Negative log-likelihood: 146.5
Coefficients:
r      k      c      s
0.9992 1.0023 1.0004 9.4369 

But does fine with good starting values. Hmm.. thought that was supposed to be the other way around…

params <- list(r = 1, k = 10, c = 5, s = .1) ## starting parameters

data = dat,
params = params,
bounds = bounds,
run.opts = run.control(checkparam="write",
checkdata="write"))
Warning: running command './model_gen > model_gen.out' had status 1
m1
Model file: model_gen
Negative log-likelihood: -423.8
Coefficients:
r         k         c         s
2.025e-10 9.498e+00 1.051e+01 1.000e-05 

Which finds a better optim (though substantailly overfit in reality)

Hans suggests adding an error term in the function definitions rather than in limiting the bounds or log transforming the variables:

The most common plase where this goes wrong is 1/0, log(0), sqrt(0), pow(0,1) etc. Your suggestion is OK, but usually I prefer to put in log(1e-10+my_expression), sqrt(1e-10+my_expression), pow(1e-10+my_expression,1)

## Misc

• Finished off posts regarding DOIs and digital archiving. A shorter version appears in my answer to opendata.stackexchange on blog archiving.

• Interesting discussion on using PURL redirects with SHA hash to link to repositories. For instance, the commit matching our arXiv submission of the ews-review paper is found at cboettig/ews-review/33dfb58e24ceb5861dad7cf756cffe2c5d66a655. If the hash shown in the PURL doesn’t match the one at the repository then we know something has gone wrong, since it is impossible to change the contents without changing the hash. This gives us a version-stable identifier that can always be remapped to this commit, even if Github should disappear. Of course nothing guarentees that the PURL maintainer / package author does this updating, but in principle the system is more robust than simply linking to Github.

• In other news, I should add some automatic link checking, #94 to the notebook.

#### Notebook features: digital archiving

31 May 2013

pageviews: 28

Note: this entry is part of a series of posts which explain some of the technical features of my lab notebook.

Archival preservation of digital scholarly content is an important challenge throughout the research community. As the notebook is the permanent record of my regular scholarly endeavors, this provides the opportunity to experiment with tools and techniques for digital archiving while also better preserving the notebook. In the experimental spirit that drives all my exploration here, I am testing several ways to accomplish this. In so doing, I learn which approaches are easiest to implement, to use, and gather feedback from, while also hedging my bets in the event that any given strategy should fail.

Archiving digital content involves two fundamental challenges that can be difficult to satisfy simultaneously: providing a robust backup copy of the content, and providing a consistent location (such as a URL) where the content can be retrieved.

## A custom domain

The simplest archival measure employed in the notebook comes from hosting through my own domain, carlboettiger.info rather than an external server. By controlling the domain structure myself, I am not tied to a University server that can be altered by an IT department without my knowledge, thereby breaking my links. When I choose to move platforms, as I did in migrating from Wordpress to Jekyll, I could ensure that links would be appropriately mapped. This was not the case when I started my open notebook on the OpenWetWare platform, since links are all mapped to the openwetware.org domain which I obviously cannot control, even though I could at least migrate my content. HTML redirects make sure links still resolve when I change structure (e.g. carlboettiger.info/archives/211). I don’t have to worry about moving my domain when I change institutions, and can seamlessly migrate to a different server or DNS provider to get better rates or uptime performance.

Of course these advantages are also the greatest weaknesses of this approach – they all depend entirely on me. I could make or forget to make any number of changes that could cause this all to break. Time has shown that even the best-intentioned researchers are not the best curators of there own data, and no doubt I am no exception. How can the content and its identifying addresses outlive me or my interest in it?

## PURLs: preserving identifiers

PURLs, or Persistent Uniform Resource Locators, provide a DOI-like mechanism for addressing the challenge of link-rot. As Geoffrey Bilder eloquently argues, the technological solution is quite simple, the real challenge lies on the social side of the implementation – a social contract that promises content providers will maintain their identifiers if they want to continue to receive identifiers. Though users must register to be able to create PURLs, PURL does not provide such enforcement.

The PURLs solution is a bit more web-native solution than DOIs, in being more democratic, using a URL structure, and being built upon widely distributed servers and open-source web technology. Not surprisingly, other web-native systems such as most of the major semantic web ontology providers rely on PURLs, e.g. Dublin Core uses purl.org/dc/terms/. The PURL FAQ provides a great overview.

Implementing PURLs for the notebook was very straight-forward. After registering as a user at purl.org I applied for my own top-level domain: cboettig, which I then mapped to my current domain, carlboettiger.info By enabling partial redirects, each page on my site will also be resolved using this top-level domain followed by my existing page structure. Following my existing structure is not necessary – I could map each page to an arbitrary path in my domain, but would have to enter these somewhat manually. While the partial redirect is simpler to implement, it does require that I maintain the rest of the link structure.

In this way, purl.org/cboettig/lab-notebook.html now resolves to carlboettiger.info/lab-notebook.html. Likewise, each page in the notebook can be similarly resolved from the purl.org domain instead of my personal carlboettiger.info domain. Should I ever somehow lose control of carlboettiger.info, I could re-assign my PURL to redirect to my new domain URL. This provides DOI-like technology of permanent identifiers for every page in the notebook.

## GitHub: preserving content and versions

Committing content to an external repository is the recommended way to avoid link-rot from the user errors and website changes that so frequently plague self-archiving of scholarly content. Keeping multiple copies of content in geographically distinct locations is the time-honored approach of digital archiving. Git and GitHub make this easy. Not only does this mean that a backup copy is publicly available and forkable online, but it is also easy to clone copies on each of the machines I work on and rely on git to keep them in sync. Should Github disappear, a little git remote add and everything will be effortlessly deployed with complete history elsewhere.

The notebook has two Github repositories: the “source-code” consisting of plain-text (markdown) content and Jekyll-based templates on labnotebook, and a second for the rendered HTML cboettig.github.com (which also now hosts the website).

While a custom domain and PURLs provide persistent locators for the content, distributed copies on Git help archive the content itself. Should my domain vanish or Github disappear, copies of the content, complete with version history, would remain distributed across various machines with a copy of the repository. Links to Github would break in that process, unless we had remapped all links from the notebook to Github using PURLs.

## Greycite: Programmatic access and indexing of metadata

I think of good metadata as the third leg to proper digital archiving, in addition to permanent identifiers and backup of content. We want to be able to point a tool at the permanent identifier / URL of an entry and extract reliable information on the author, time published and last modified, title, author, key words, etc. that might be useful in citing or categorizing the content. Providing this information is really the subject of adding Semantic metadata to the site, and is covered in another entry in this series. Meanwhile, the Greycite tool and it’s API are an excellent way to extract this metadata into a variety of useful formats, working much the same way that CrossRef’s tool does using DOIs. Here is an example query

## Robust archiving with figshare

Depositing a copy of the notebook on figshare is one of the most robust archival solutions of which I am currently aware. Not so much because it has the coveted DOI solution to the permanent identifier problem but because it has the promise of CLOCKSS archiving, should anything ever happen to figshare.

Nevertheless, it raises several challenges. The native home for the content is as rendered HTML at my domain, not as raw HTML on an archive completely unassociated with that domain, difficult to view, and divorced from my usual workflow, unlike my usual publishing source-code to Github and website to my domain. It also raises questions of just what to archive and when. I discuss some of these strengths and challenges as a separate post, archiving the lab notebook on figshare: advantages and challenges.

## Conclusions

Digital archiving is a challenging problem that is not completely addressed by any one of these approaches. In the end, robust archiving will be best left in the hands of its experts. Unfortunately, the best examples currently available (such as CLOCKSS, national libraries, etc.) will not archive a researcher’s web page directly. The solutions explored here are not perfect, but they are free and simple to implement. I’d love to hear what others think.

31 May 2013

pageviews: 16

## Robust archiving through CLOCKSS

One of the most comprehensive approaches I have come across so far uses figshare. This offers the most promising avenue for content preservation, but is weakest in managing the URIs and associating them with the original content. All figshare content is archived by CLOCKSS, an international library cooperation providing redundant and geopolitically distributed backup of the archives around the world (and used by many academic journals, both subscription based & open access). Should figshare vanish from the face of the planet, it will trigger the release of all of its content to resolve through the CLOCKSS servers, with the same appearance and resolving at the same URLs as the original figshare content. Presumably the DOIs provided to figshare content will also continue to resolve there.

## Challenges

It would certainly be preferable to have the notebook archived by CLOCKSS directly, since the association between the original online content at carlboettiger.info is lost in archiving the entries on figshare. More problematically, the content as archived on figshare is not recognized by search engines, etc., as a separate HTML pages to index, but merely as a bundle of attached text files. On the upside, the content becomes part of the global scientific datasets preserved and indexed by figshare with appropriate metadata, etc., increasing the chances for discovery through that venue. Also, figshare provides a convenient API that can help automate deposition.

### What content? What format?

Deciding just what to archive in the figshare database is also less straight forward than it may seem. I have gone through a few iterations:

1. Archiving the markdown.
2. Archiving external images with Data URIs.
3. Archiving the HTML versions of pages alone
4. Archiving the whole git repository, _site HTML included (?)

I began by archiving the markdown files that I write to create each entry. These are plain-text files that can be easily read in any text editor, even if the conventions for rendering them as HTML are lost. Like HTML, figures are linked to external files, and are thus not captured by this solution. To work around this, I adopted the trick of using Data URIs to embed images. This places a binary encoding of the image in the text itself, which can be rendered by almost any browser as the appropriate image. While this keeps the content together, the long data URI strings are rather out-of-place inside a plain text document. Further, the markdown loses all of the valuable semantics that are automatically added to each page when Jekyll renders them to HTML. As Phil Lord argues, if there’s any format that future archivists can read successfully – it will be HTML. Consequently I’ve settled on archiving the HTML versions of each notebook entry, with images embedded as Data URIs. Each HTML file contains rich metadata in the header, sidebar, and footer, that give more information about the content and its context in the notebook (relative path, categories and tags, timestamps and SHA hashes, etc). I have archived these entries in annual chunks following the year/month/day directory structure already employed on the site.

Lastly there is the concern of preserving the version history of entries. Though figshare provides versioning of its content, this doesn’t capture finer resolution of individual page changes available through the Github repository. At the expense of creating an ever more cumbersome archival object, one could include the .git history, either for the HTML rendered version (which lives at cboettig.github.com) or the source files used to create it (labnotebook).
Of course this fails to address the preservation of externally linked content. The most frequent outbound links point to other publications through, usually their DOIs, which we hope will take care of themselves. The most important externally linked content in the notebook entries are the links to scripts, functions, and manuscripts in the various project repositories on Github. The simplest solution is to embed the most important scripts in the notebook entries themselves. Archiving the project repositories is an additional challenge, but if a user can recover a copy of the project repository (along with it’s .git history) then it would be possible to identify the linked file using the SHA hash from these links (by matching it against the SHAs in the log). See my entry on SHA hashes for more on this topic.