# Lab Notebook

### (Introduction)

#### Coding

• cboettig pushed to master at cboettig/labnotebook: update site draft update layout tag 05:05 2013/12/06
• cboettig pushed to master at ropensci/reml: eml to rdf 11:46 2013/12/05
• cboettig pushed to master at ropensci/reml: example of how hf205.xml might be built in reml 11:45 2013/12/05
• cboettig commented on issue ropensci/reml#62: @emhart that would be awesome. I've just figured out how to get the text: library(RWordXML) f <- wordDoc("inst/examples/methods.docx") doc <- methods… 09:37 2013/12/05
• cboettig opened issue ropensci/reml#62: Parse a .docx file to get methods and other text 09:23 2013/12/05

#### Discussing

• RT @_inundata: Excited to announce that we’re (@ropensci) writing a book on open science. ETA July 2014. http://t.co/TyIY1WN5aE

10:28 2013/12/02
• RT @researchremix: Great read on CC0 vs CC-BY by @dancohen: http://t.co/7EeR8XdO7Q #openaccess #opendata

10:27 2013/12/02
• RT @pebourne: Who is willing to measure the reproducibility of research in their own lab? We did in this PLOS ONE paper http://t.co/hKZIfhO…

10:24 2013/12/02
• @davidjayharris I'm so glad someone actually looked at the paper links, they are far better than my babble. That one is delightful.

07:03 2013/11/26
• RT @davidjayharris: An actual paper title, via @cboettig: Are exercises like this a good use of anybody's time?

07:02 2013/11/26

• James Wilson White, Louis W Botsford, Alan Hastings et al. 2013. Stochastic models reveal conditions for cyclic dominance in sockeye salmon populations Ecological Monographs 10.1890/12-1796.1
• David Lindenmayer, Gene E. Likens. 2013. Benchmarking Open Access Science Against Good Science Bulletin of the Ecological Society of America 94 4 10.1890/0012-9623-94.4.338
• Elizabeth Eli Holmes, John L Sabo, Steven Vincent Viscido et al. 2007. A statistical approach to quasi-extinction forecasting. Ecology letters 10 12 10.1111/j.1461-0248.2007.01105.x
• Jan Esper, Ulf Büntgen, David C Frank et al. 2007. 1200 Years of Regular Outbreaks in Alpine Insects. Proceedings. Biological sciences / The Royal Society 274 1610 10.1098/rspb.2006.0191
• Santiago Salinas, Simon C. Brown, Marc Mangel et al. 2013. Non-genetic inheritance and changing environments Non-Genetic Inheritance 1 10.2478/ngi-2013-0005

19 Nov 2013

pageviews: 8

### RNeXML

• Feedback from Rutger, need to add about attributes so that RDFa abstraction references the right level of the DOM (issue #35).
• Looking for strategy for distilling RDF from RDFa in R, see my question on SO. Hopefully don’t have to wrap some C library myself…

### nonparametric-bayes

Writing writing.

• Update pandoc templates to use yaml metadata for author, affiliation, abstract, etc. Avoids having to manually edit the elsarticle.latex template with this metadata. Added fork for my templates, e.g. see my elsarticle.latex. Example metadata in manuscript.

• fixing xtable caption (as argument)

• Extended discussion. Adjustments to figures. See commit log /diffs for details.

Mace (2013) , e.g.

a new kind of ecology is needed that is predicated on scaling up efforts, data sharing and collaboration

hear hear.

• PNAS with a somewhat confused take on error rates, suggesting a revised threshold p-value…

• AmNat Asilomar schedule (pdf) is up.

17 Nov 2013

pageviews: (not calculated)

(From issue #20)

a question of how the user queries that metadata. Currently we have a metadata function that simply extracts all the metadata at the specified level (nexml, otus, trees, tree, etc) and returns a named character string in which the name corresponds to the rel or property and the value corresponds to the content or href, e.g.:

birds <- read.nexml("birdOrders.xml")
meta <- get_metadata(birds) 

prints the named string with the top-level (default-level) metadata elements as so:

> meta
##                                             dc:date
##                                        "2013-11-17"
## "http://creativecommons.org/publicdomain/zero/1.0/"

Which we can subset by name, e.g. meta["dc:date"]. This is probably simplest to most R users; though exactly what the namespace prefix means may be unclear if they haven’t worked with namespaces before. (The user can always print a summary of the namespaces and prefixes in the nexml file using birds@namespaces).

This approach is simple, albeit a bit limited.

### XPath queries

For instance, the R user has a much more natural and powerful way to handle these issues of prefixes and namespaces using either the XML or rrdf libraries. For instance, if we extract meta nodes into RDF-XML, we could handle the queries like so:

xpathSApply(meta, "//dc:title", xmlValue)

which uses the namespace prefix defined in the nexml; or

xpathSApply(meta, "//x:title", xmlValue, namespaces=c(x = "http://purl.org/dc/elements/1.1/"))

defining the custom prefix x to the URI

### Sparql queries

Pretty exciting that qe can make arbitrary SPARQL queries of the metadata as well.

library(rrdf)
sparql.rdf(ex, "SELECT ?title WHERE { ?x <http://purl.org/dc/elements/1.1/title> ?title })

Obviously the XPath or SPARQL queries are more expressive / powerful than drawing out the metadata from the S4 structure directly. On the other hand, because both of these approaches use just the distilled metadata, the original connection between metadata elements and the structure of the XML tree is lost unless stated explicitly. An in-between solution is to use XPath on the nexml XML instead, though I think we cannot make use of the namespaces in that case, since they appear in attribute values rather than structure.

Anyway, it’s nice to have these options in R, particularly for more complex queries where we might want to make some use of the ontology as well. On the other hand, simple presentation of basic metadata is probably necessary for most users.

Would be nice to illustrate with a query that required some logical deduction from the ontology.

#### Mbi Day Five Notes

08 Nov 2013

pageviews: 27

Panel discussion

• Hugh’s question on the usefulness of dynamic vs static models: do we have dynamical systems envy?
• Chris: are temporal dynamics historical artefact, and space the new frontier?
• Hugh: though decision theory is fundamentally temporal. really question of sequential decision vs single decision
• Hugh, on what would be his priority if he had time for new question: Solve the 2 player, 2 step SDP competition closed form.
• Paul: the narrow definitions of “math biology” with 1980s flavor.
• @mathbiopaul: Formulating the hard problems arising in application in an appropriate abstraction that mathematicians will attack.
• Leah raises issue of publishing software and reproducibility
• Julia mentions Environmental modeling and software journal

### pdg-control

Trying to understand pattern of increasing ENPV with increasing stochasticity. Despite having the same optimal policy inferred under increasing stochasticity (i.e. still in Reed’s self-sustaining criterion, below $$\sigma_g$$ of 0.2 or so) the average over simulated replicates is higher. We don’t seem to obtain the theoretical ENPV, but something less, in either case. See code noise_effects.md.

### ropensci

Schema.org defines a vocabulary for datasets (microdata/rdfa)

Rutger gives a one-liner solution for tolweb to nexml using bio-phylo perl library:

perl -MBio::Phylo::IO=parse -e 'print parse->to_xml' format tolweb as_project 1 url 'http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=52643'

Hmm, there’s a journal of Ecological Informatics.

## References

• Rebecca S. Epanchin-Niell, Robert G. Haight, Ludek Berec, John M. Kean, Andrew M. Liebhold, Helen Regan, (2012) Optimal Surveillance And Eradication of Invasive Species in Heterogeneous Landscapes. Ecology Letters 15 803-812 10.1111/j.1461-0248.2012.01800.x
• Rebecca S. Epanchin-Niell, James E. Wilen, (2012) Optimal Spatial Control of Biological Invasions. Journal of Environmental Economics And Management 63 260-270 10.1016/j.jeem.2011.10.003
• J. Esper, U. Buntgen, D. C Frank, D. Nievergelt, A. Liebhold, (2007) 1200 Years of Regular Outbreaks in Alpine Insects. Proceedings of The Royal Society B: Biological Sciences 274 671-679 10.1098/rspb.2006.0191
• unknown Fagan, unknown Meir, unknown Prendergast, unknown Folarin, unknown Karieva, (2001) Characterizing Population Vulnerability For 758 Species. Ecology Letters 4 132-138 10.1046/j.1461-0248.2001.00206.x
• A. R. Hall, A. D. Miller, H. C. Leggett, S. H. Roxburgh, A. Buckling, K. Shea, (2012) Diversity-Disturbance Relationships: Frequency And Intensity Interact. Biology Letters 8 768-771 10.1098/rsbl.2012.0282
• Elizabeth Eli Holmes, John L. Sabo, Steven Vincent Viscido, William Fredric Fagan, (2007) A Statistical Approach to Quasi-Extinction Forecasting. Ecology Letters 10 1182-1198 10.1111/j.1461-0248.2007.01105.x
• Brian Leung, Nuria Roura-Pascual, Sven Bacher, Jaakko Heikkilä, Lluis Brotons, Mark A. Burgman, Katharina Dehnen-Schmutz, Franz Essl, Philip E. Hulme, David M. Richardson, Daniel Sol, Montserrat Vilà, Marcel Rejmanek, (2012) Teasing Apart Alien Species Risk Assessments: A Framework For Best Practices. Ecology Letters 15 1475-1493 10.1111/ele.12003
• A. D. Miller, S. H. Roxburgh, K. Shea, (2011) How Frequency And Intensity Shape Diversity-Disturbance Relationships. Proceedings of The National Academy of Sciences 108 5643-5648 10.1073/pnas.1018594108
• Adam David Miller, Stephen H. Roxburgh, Katriona Shea, (2011) Timing of Disturbance Alters Competitive Outcomes And Mechanisms of Coexistence in an Annual Plant Model. Theoretical Ecology 5 419-432 10.1007/s12080-011-0133-1

07 Nov 2013

pageviews: 17

## Morning session

• Chadès, I., Carwardine, J., Martin, T.G., Nicol, S., Sabbadin, R. & Buffet, O. (2012) MOMDPs: a solution for modelling adaptive management problems. The Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI-12), pp. 267-273. Toronto, Canada.

• 10.1098/rspb.2013.0325 Migratory connectivity magnifies the consequences of habitat loss from sea-level rise for shorebird populations

#### Jake LaRiviera

presents the challenges of the uncertainty table. Additional challenges in making an apples-to-apples comparison of the benefit of decreasing noise of different systems (e.g. in pricing information?)

#### Me

Some good questions following talk, primarily on BNP part.

• Where does the risk-adverse vs risk-prone behavior come from? Adjusting curvature of the uncertainty appropriately.
• Any lessons after stock collapsed, e.g. Rebuild a stock rather than mantain it? (Perhaps, but may face hysteresis in a way the intial collapse does not).
• Brute-force passive learning?

## Afternoon discussion

1. Is an active learning approach more or less valuable in a changing environment
2. Embracing surprise: how do we actually mathematically do this.
3. Limitations due to constraints on frequency of updating. (e.g. we don’t get to change harvest, we get to set a TAC once every ten years).
4. Uncertainty affecting net present value vs affecting model behavior

06 Nov 2013

pageviews: 17

## MBI Workshop

#### Becky Epanchin-Niell

• Motivating example of Rabies spatial control in Switzerland

• 2012 Eco Let: should New Zealand survey for bark beetle? Cost of survellience, control, and damage. Epanchin-Niell et al. (2012)

• 2012 JEEM spatial spread of star thistle: control spread and eradication as an integer programming problem in deterministic context. (with Wilen) Epanchin-Niell & Wilen (2012)

• Breaking landscape into individual landowners makes it less valuable to control early.

#### Brian Leung

“Data, uncertainty and risk in biological invasions”

• Alien risk assessments Leung et al. (2012) , shows a dominance of rank scoring over truely quantitative approaches. Limitations to each.
• It’s not the model complexity, but the implementation interface that poses the real barrier. Also need better integration needs.

Scott Barrett

### PDG Control

• back-and-forth with Paul: is second column redundant? Seems not: NPV when paying penalty that doesn’t exist is: profit under penalty ($$\Pi_0(x,h)$$) minus zero (adjustment cost), while scaling is set such that profit under penalty minus adjustment cost, $$\Pi_0(x,h) - \Pi_1(h, c_2)$$. Also, better to normalize everything against the adjustment-free ENPV ($$Pi_0$$) than to normalize by the truth/simulation model (which differs in different cases).

See commit log for updated versions.

## References

bibliography()
• Rebecca S. Epanchin-Niell, Robert G. Haight, Ludek Berec, John M. Kean, Andrew M. Liebhold, Helen Regan, (2012) Optimal Surveillance And Eradication of Invasive Species in Heterogeneous Landscapes. Ecology Letters 15 803-812 10.1111/j.1461-0248.2012.01800.x
• Rebecca S. Epanchin-Niell, James E. Wilen, (2012) Optimal Spatial Control of Biological Invasions. Journal of Environmental Economics And Management 63 260-270 10.1016/j.jeem.2011.10.003
• A. R. Hall, A. D. Miller, H. C. Leggett, S. H. Roxburgh, A. Buckling, K. Shea, (2012) Diversity-Disturbance Relationships: Frequency And Intensity Interact. Biology Letters 8 768-771 10.1098/rsbl.2012.0282
• Brian Leung, Nuria Roura-Pascual, Sven Bacher, Jaakko Heikkilä, Lluis Brotons, Mark A. Burgman, Katharina Dehnen-Schmutz, Franz Essl, Philip E. Hulme, David M. Richardson, Daniel Sol, Montserrat Vilà, Marcel Rejmanek, (2012) Teasing Apart Alien Species Risk Assessments: A Framework For Best Practices. Ecology Letters 15 1475-1493 10.1111/ele.12003
• A. D. Miller, S. H. Roxburgh, K. Shea, (2011) How Frequency And Intensity Shape Diversity-Disturbance Relationships. Proceedings of The National Academy of Sciences 108 5643-5648 10.1073/pnas.1018594108
• Adam David Miller, Stephen H. Roxburgh, Katriona Shea, (2011) Timing of Disturbance Alters Competitive Outcomes And Mechanisms of Coexistence in an Annual Plant Model. Theoretical Ecology 5 419-432 10.1007/s12080-011-0133-1

05 Nov 2013

pageviews: 21

## MBI Workshop

#### Paul Armsworth

Very nice example of control in creating a market for ecosystem services for landowners. Key feature is that multiple land-owners respond by adjusting their prices, and so payments can be divided into fraction going into subsidy and fraction going into conservation. When one land-owner controls land that is particularly good cost per conservation accomplished, they also stand to gain largest value.

Also looked at auction mechanism and impact of cooperation among landowners to create hold outs.

#### Hugh Possingham

Hot spot assignment. Marxan and success while ignoring dynamics. {to what extent is data substitute for dynamics}

Dynamics (but no heterogeneity)

#### Bill Fagan

Linking individual movements to population dynamics. Home range vs migration vs nomadic motion.

#### Afternoon breakout

Looking at role of institutions and tractability of implementation problems. Interesting observation from Lou in the long-tail effect of certain individuals in explaining heterogeneity in implementation success.

### PDG Control

Working on table following issue #41

Based on errors_table.Rmd, plot from plot_table.Rmd

shows the effect of greater noise actually reducing the impact of being wrong (either by ignoring adjustment costs that exist or assuming adjustment costs that don’t exist). Bigger induced reduction in NPV (higher cost) naturally decreases value.

Functional form doesn’t matter when assuming costs that don’t exist, since these are calibrated to be equivalent by selecting the coefficients in order to have equal reduction in NPV. Obviously functional form does matter when ignoring penalties that do exist, and it seems that ignoring L2 penalties is most damaging, ignoring L1 penalties the least damaging?

penalty_fn ignore_cost ignore_fraction assume_cost assume_fraction sigma_g reduction
1 L1 14536.09 1.00 16857.61 1.00 0.05 0.10
2 L2 11020.10 0.76 15538.44 0.92 0.05 0.10
3 fixed 13584.97 0.92 17582.78 1.05 0.05 0.10
4 L1 9273.66 1.09 17561.81 1.04 0.05 0.20
5 L2 11020.10 0.76 15538.44 0.92 0.05 0.20
6 fixed 8332.45 0.84 17401.82 1.03 0.05 0.20
7 L1 -641.78 0.21 15451.25 0.92 0.05 0.30
8 L2 -272586.11 -24.90 11567.12 0.69 0.05 0.30
9 fixed -1213.01 -0.46 15519.95 0.92 0.05 0.30
10 L1 18281.97 1.01 20973.56 0.99 0.20 0.10
11 L2 12462.51 0.72 19392.85 0.91 0.20 0.10
12 fixed 16691.94 1.03 20143.80 0.95 0.20 0.10
13 L1 12788.95 0.96 20663.21 0.97 0.20 0.20
14 L2 12462.51 0.72 19392.85 0.91 0.20 0.20
15 fixed 12399.01 0.88 21289.21 1.00 0.20 0.20
16 L1 9099.48 0.73 19999.07 0.94 0.20 0.30
17 L2 -6337.76 -0.39 17765.33 0.83 0.20 0.30
18 fixed 7399.01 0.62 21969.84 1.03 0.20 0.30
19 L1 30227.43 1.07 31377.07 0.94 0.50 0.10
20 L2 33472.18 1.00 33472.18 1.00 0.50 0.10
21 fixed 28926.73 1.09 30500.85 0.91 0.50 0.10
22 L1 24157.93 1.10 31767.92 0.95 0.50 0.20
23 L2 19091.31 0.93 29225.46 0.87 0.50 0.20
24 fixed 23573.19 1.02 30069.80 0.90 0.50 0.20
25 L1 21086.12 1.12 32035.93 0.96 0.50 0.30
26 L2 19091.31 0.93 29225.46 0.87 0.50 0.30
27 fixed 17209.56 0.85 32946.24 0.98 0.50 0.30

## rOpenSci / ecoinformatics

• Exploring strategies for addressing certificate authentication workflow.
• Plans to merge dataone and rdataone, shedding the current rJava dependency and dealing with existing vs new namespacing #1

04 Nov 2013

pageviews: 20

## Lou Gross

• Big picture: big data, computational challenges.
• “Convergence” as the new interdisciplinary (National Academies)
• Rise of synthesis centers
• “Enabling architecture for next gen life science research” – National Academies report Lou Gross (2013)
• Comp Science for Natural Resource Management – Fuller, Wang, Gross (2007)

• language barriers: What’s a model? Mouse, drosophila? logistic, ricker? GIS map layers? …

Contraints frequently dominate, not the control or the state equation.

Let stakeholders make their own rankings. Scenario analysis vs optimal control. Uncertainies! “Relative assessment protocol” Fuller, Gross, Duke-Sylvester, Palmer. “Testing the robustness of management descions under uncertainty” (ATLSS modeling)

### Breakouts

notes from my breakout session:

#### Questions

• Sensitivity analysis of Scenario Rankings
• What does a Resilience approach add
• Generalities

#### Tools for decision under uncertainty

• optimization
• threshold planning
• scenario planning
• resilience thinking

#### optimization

• Info gap “theory”
• Satisfiability / mini-max
• model approximation methods
• dynamic programming

• To what extent are these approaches different sides of the same coin?
• Are there truly non-optimization based approaches?
• Almost-optimal approaches
• Including constraints

#### what we do well

• Optimize easy problems
• open loop

#### State-of-the-art

• Starting to: simulate optimal solutions to simple problems under more realistic circumstances
• Starting to find multiple “optima”
• feedback control (SDP)
• large state space

#### Open challenges / what we do poorly

• Dual control under parameter uncertainty (without restrictive assumptions on parameters)
• high-dimensional problems
• multiple stake-holders / game-theory solutions (outside fisheries)
• mapping between control and implementation (partial controllability)
• large action space
• (multiple) delayed effect actions

#### Open challenges: Multiple stake-holder games

• beyond 2-player differential games (with feedback)
• simultaneous player actions

#### Open challenges: adaptive management timescales

• frequency of revisiting decisions
• biological timescales
• political timescales

#### Known nuisances

• curse of dimensionality
• data collection methods
• numerical methods
• local vs global

#### Missed things

• spatial data, using rich data under the curse of dimensionality

## ropensci

Writing out proof-of-principle interface to the dataone REST API, see rdataone and Introduction to the package

Key things:

• We can accomplish handling of certificates from httr, just add config = list(sslcert = ); see ?httr::config, e.g. archive a file with:
httr::PUT(paste0("https://knb.ecoinformatics.org/knb/d1/mn/v1/archive/",
config=config(sslcert = "/tmp/x509up_u1000"))

Posting new data requires writing a system metadata XML file. Currently have a crude minimal version of this, write_sysmeta.R, should see how dataone package is handling this.

#### Rnexml Semantic Considerations

17 Oct 2013

pageviews: 18

Some very good feedback from Hilmar as I tackle some of the semantic capabilities of NeXML in RNeXML. As the complete discussion is already archived in the Github issues tracking, (see in particular #26 and #24) I only paraphrase here.

One of the central advantages we can offer in programmatic generation of NeXML in the R environment is the ability to validate names and enhance the metadata included in the resulting nexml file using programmatic queries to taxonomic name resolution services such as ITIS, EOL, NCBI, as provided in the taxize package.

A subtly to this approach is discussed in issue #24. Whenever we provide new data, we should also provide future users the appropriate metadata describing where it came from, if we had found an exact match or maybe only a close match (perhaps an alternative spelling of the species name was used). Specifying the provenance exactly can become quite verbose, for each taxonomic unit we have:

which corresponds to adding RDFa to the NeXML looking something like:

<otus id="tax1">
<otu label="Struthioniformes" id="t1">
<meta xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:obo="http://purl.obolibrary.org/obo/"
xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:tnrs="http://phylotastic.org/terms/tnrs.rdf#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
class="rdf2rdfa">
typeof="obo:CDAO_0000138">
<meta property="rdfs:label" content="Panthera tigris HQ263408"/>
<meta rel="tnrs:resolvesAs">
<meta class="description" typeof="tnrs:NameResolution">
<meta property="tnrs:matchCount" content="2"/>
<meta rel="tnrs:matches">
<meta class="description" typeof="tnrs:Match">
<meta property="tnrs:acceptedName" content="Panthera tigris"/>
<meta property="tnrs:matchedName" content="Panthera tigris"/>
<meta property="tnrs:score" content="1.0"/>
<meta rel="tc:toTaxon" resource="http://www.ncbi.nlm.nih.gov/taxonomy/9694"/>
<meta rel="tnrs:usedSource">
typeof="tnrs:ResolutionSource">
<meta property="dc:description" content="NCBI Taxonomy"/>
<meta property="tnrs:hasRank" content="3"/>
<meta property="tnrs:sourceStatus" content="200: OK"/>
<meta property="dc:title" content="NCBI"/>
</meta>
</meta>
</meta>
</meta>
<meta rel="tnrs:matches">
<meta class="description" typeof="tnrs:Match">
<meta property="tnrs:acceptedName" content="Megalachne"/>
<meta property="tnrs:matchedName" content="Pantathera"/>
<meta property="tnrs:score" content="0.47790686999749"/>
<meta rel="tc:toTaxon" resource="http://www.tropicos.org/Name/40015658"/>
<meta rel="tnrs:usedSource">
typeof="tnrs:ResolutionSource">
<meta property="dc:description"
content="The iPlant Collaborative TNRS provides parsing and fuzzy matching for plant taxa."/>
<meta property="tnrs:hasRank" content="2"/>
<meta property="tnrs:sourceStatus" content="200: OK"/>
<meta property="dc:title" content="iPlant Collaborative TNRS v3.0"/>
</meta>
</meta>
</meta>
</meta>
<meta rel="dcterms:source">
<meta class="description"
typeof="tnrs:ResolutionRequest">
<meta property="tnrs:submitDate" content="Mon Jun 11 20:25:16 2012"/>
<meta rel="tnrs:usedSource" resource="http://tnrs.iplantcollaborative.org/"/>
<meta rel="tnrs:usedSource" resource="http://www.ncbi.nlm.nih.gov/taxonomy"/>
</meta>
</meta>
<meta property="tnrs:submittedName" content="Panthera tigris"/>
</meta>
</meta>
</meta>
</meta>
</otu>

Why provenance? This can be particularly important in tracking down errors or inconsistencies. For instance, image the taxanomic barcode id number we provide for a taxon is later divided into multiple species. Ideally this would be reflected in the updated entries of the barcode service, establishing new id numbers for the split members and identifying that the old id was split – after all, a barcode system is supposed to facilitate addressing these kinds of issues.

Still, this appears much more verbose than say, what TreeBase does when automatically adding identifiers as annotations to the NeXML otu nodes.

Meanwhile, in more concrete terms, we seem to have some consensus on using NCBI taxonomic ids, since its API queries are pretty fast:

<otus id="tax1">
<otu label="Struthioniformes" id="t1">
<meta xsi:type="ResourceMeta" href="http://ncbi.nlm.nih.gov/taxonomy/8798" rel="tc:toTaxon"/>
</otu>

note this example uses the tc: http://rs.tdwg.org/ontology/voc/TaxonConcept# toTaxon concept to provide a ontological definition of the link as a taxon identier.

NCBI does not do partial matching, so we simply warn when a user’s taxonomic names do not match an NCBI id, giving them a chance to correct them if in error (either manually or automatically using the partial name matching functions in taxize)

#### Is it time to retire Pagel's lambda?

11 Oct 2013

pageviews: 482

Pagel’s $$\lambda$$ (lambda), introduced in Pagel 1999 as a potential measure of “phylogenetic signal,” the extent to which correlations in traits reflect their shared evolutionary history (as approximated by Brownian motion).

Numerous critiques and ready alternatives have not appeared to decrease it’s popularity. There are many issues with the statistic, some of which I attempt to summarise below.

The $$\lambda$$ statistic is defined by the Brownian motion model together with a transformation of the branch lengths: multiply all internal branches by $$\lambda$$. The motivation for the definition is obvious: $$\lambda = 1$$ the tree is unchanged and the model equivalent to Brownian motion, while for $$\lambda = 0$$ the tree becomes a star phylogeny and the model is equivalent to completely independent random walks. $$0 < \lambda < 1$$ provides an intermediate range where the correlations are weaker than expected.

Problem 1: It is biological nonsense to treat tips different from other edges.

All other problems arise from this. While it is okay that a statistic does not have a corresponding evolutionary model, being part of an explicit model might have helped avoid this sillyness. Technically $$\lambda$$ is a model, but one that treats evolution along “tips” as special, as if evolution should follow completely different rules for a species alive today relative to it’s former evolutionary history. Sounds almost creationist.

Problem 2: The statistic doesn’t measure what is says it measures.

To demonstrate this, we can consider two cases in which phylogeny has the identical effect of explaining trait correlations, and yet have very different lambdas. Consider that Researcher 1 examines the phylogeny in Figure 1 and estimates very little phylogenetic signal, $$\lambda = 0.1$$.

library(ape)
cat("(((A_sp:10,B_sp:10):1,C_sp:11):1,D_sp:12);", file = "ex.tre", sep = "\n")
plot(ex)

Now Researcher 2 discovers closely related sister species of some of the taxa originally studied, as in Figure 2.

cat("((((A_sp:1, A2_sp:1):9,(B_sp:1, B2_sp:1):9):1,(C_sp:1, C2_sp:1):10):1,(D_sp:1, D2_sp:1):11);",
file = "ex2.tre", sep = "\n")
plot(ex2)

There traits of sister taxa are very similar (indeed let us assume the sister species are hard to distinguish morphologically - perhaps why they were overlooked by Researcher 1). The OU or BM model estimates made by researcher 1 will closely agree with with those of Researcher 1, since the sister taxa have quite similar traits. Yet the $$\lambda$$ estimates differs greatly – all of a sudden the phylogenetic signal must be quite high!

And yet the underlying evolutionary process by which we have simulated the data has been unchanged! The difference arises because what formerly appeared as long tips have become short tips. How do we intepret a metric that depends so heavily on whether or not all sister species are present in the data? As noted, this problem does not impact other phylogenetic comparative methods to nearly the same extent.

Problem 3 The statistic has no notion of timescale or depth in the phylogeny.

In $$\lambda$$ (and other definitions such as Blomberg’s K), phylogenetic signal is an all-or-nothing proposition. If we find that really recently diverged species that happen to resemble each-other, while species that have diverged for longer than, say, a couple million years show no correlation – is this phylogenetic signal or not? This ‘extinction’ of phylogenetic signal as we go far enough back in time seems like a biologically reasonable concept that is perfectly well expressed in the $$\alpha$$ parameter of the OU model, but is lost in the consideration of $$\lambda$$. If folks really want to estimate a continuous quantity to measure phylogenetic signal, I suggest $$\alpha$$ is a far more meaningful number (note that it has units! (1/time or 1/branch length)).

Consider the returning force alpha in the OU model (i.e. stabilizing selection). When alpha is near zero, the model is essentially Brownian, (i.e. ‘strong phylogenetic signal,’ where more recently diverged species are more similar on average than distantly related ones). When alpha is very large, traits reflect the selective constraint of the environment rather than their history, and so recently diverged species are no more or less likely to be similar than distant ones (provided all species in question are under the same OU model / same selection strength for the trait in question). The size of alpha gives the timescale over which ‘phylogenetic signal’ is lost (in units of the branch length). Two very recently diverged sister-taxa may thus show some phylogenetic correlation because their divergence time is of order 1/alpha, while those with longer divergence times act phylogeneticly independent, such as in our Figure 2 above. I find this an imperfect but reasonable meaning of phylogenetic signal.

If we restrict $$\lambda$$ to be strictly 1 or 0 these problems are alliviated, though then it is unnessary to define the statistic as such as we may instead consider a star tree (sometimes called the “white noise” model of evolution).

#### other such statistics

Pagel’s $$\delta$$ is a transformation on node depth, which is again problematic as there is no meaningfully consistent way to describe what is a node (think about deep speciation events with no present day ancestor.) I believe $$\kappa$$ would also be problematic to interpret as it is a nonlinear transform of branch length – raises branch length to a power – and thus would have a rather different effect depending on the units in which branch length were measured. (For instance, consider the case where the tree is scaled to length unity, so all branch lengths are less than one and thus become shorter with large exponents, vs one in which lengths are all larger than one). Fortunately these statistics are far less popular than $$\lambda$$.