Units in EML
Overview of how units are determined in the EML package:
library(EML)
dat <- data.set(river = factor(c("SAC", "SAC", "AM")),
spp = c("Oncorhynchus tshawytscha", "Oncorhynchus tshawytscha", "Oncorhynchus kisutch"),
stg = ordered(c("smolt", "parr", "smolt"), levels=c("parr", "smolt")), # levels indicates increasing level, eg. parr < smolt
ct = c(293L, 410L, 210L),
day = as.Date(c("2013-09-01", "2013-09-1", "2013-09-02")),
stringsAsFactors = FALSE,
col.defs = c("River site used for collection",
"Species scientific name",
"Life Stage",
"count of live fish in traps",
"day traps were sampled (usually in morning thereof)"),
unit.defs = list(c(SAC = "The Sacramento River", # Factor, levels defined explicitly
AM = "The American River"),
"Scientific name", # Character string (levels not defined)
c(parr = "third life stage", # Ordered factor
smolt = "fourth life stage"),
c(unit = "number", precision = 1, bounds = c(0, Inf)), # Integer
c(format = "YYYY-MM-DD", precision = 1))) # Date
The EML package provides a variety of interfaces to transform this into EML format, depending on the level of granularity desired. At the highest level, as user can directly call eml_write
(aliased as write.eml
to mimic other write file conventions),
eml_write(dat, "example.xml", contact = "Carl Boettiger <cboettig@ropensci.org")
This call works simply by calling the lower level functions. If dat
is already the S4 object representation for eml
or dataset
object, they are dealt with directly. Otherwise, the function simply passes its arguments to the eml
constructor function. The main difference between the eml
constructor and write_eml
is that the eml
function returns an eml
S4 object, while the write_eml
function takes the additional step of transforming the S4 eml
structure into XML and writing it to the desired file.
The eml
function, also available to the user, has more optional named arguments than write_eml
, though these can all be given to the higher level write_eml
as well since they are passed through the ...
mechanism. (Its default arguments illustrate calls to some of the lower-level constructors, such as eml_coverage
, and will automatically try and read creator
and contact
from the configuration environment if they are not provided)
> eml
function (dat = NULL, title = "metadata",
creator = get("defaultCreator", envir = EMLConfig),
contact = get("defaultContact", envir = EMLConfig),
coverage = eml_coverage(scientific_names = NULL,
dates = NULL,
geographic_description = NULL,
NSEWbox = NULL),
methods = new("methods"),
...,
additionalMetadata = c(new("additionalMetadata")),
citation = NULL,
software = NULL,
protocol = NULL)
Because all other arguments are optional, it is sufficient to call this function with the dat
argument alone. The data
object is allowed to be NULL
if at least one of the other top-level alternatives to a dataset
is provided: citation
, software
or protocol
.
my_eml <- eml(dat)
This function follows the same logic as before: The function first constructs a unique identifier for the EML packageId
. If dat
is already an S4 dataset
or dataTable
object it is added immediately to a new("eml"
object; otherwise a dataTable
object is constructed with the next helper function, eml_dataTable
. (These constructors prefaced with (eml_
) are always just thin wrappers around the direct construction of these S4 objects with new("classname", ...)
, and exist just to simplify certain things wich are usually and frequently automated, such as creating unique ide elements.)
So far our data.set
object dat
is just passed unchanged from write_eml
to eml
to eml_dataTable
.
my_dataTable <- eml_dataTable(dat)
This constructor is the first to peer inside the dat
object, extracting metadata for the attributeList
elements. (The dat
object is also passed to the eml_physical
constructor, which will use the actual data.frame
to write the csv file.)
The metadata extraction is performed in two steps. First, the helper function detect_class
extracts a list of the necessary metadata from the data.set
. Then this list is coerced into an EML attributeList
(Note in this review it becomes clear that the coercion is not a flexible and robust way to handle this, so this task is now performed by eml_attributeList
and in turn, eml_attribute
, following the same logic as above.) Currently, detect_class
takes the legacy format of having a data.frame
and a list of meta
objects, structured as column name, col definitions, and unit defintions (each as character vectors, like in a data.set
. Here is where things get tricky. detect_class
uses the declared class of the column to decide how to interpret the column and unit metadata, using the following mapping:
numeric or integer : ratio
ordered factor: ordinal/enumeratedDomain
factor : nominal/enumeratedDomain
POSIXlt, POSIXct, Date : dateTime
character : nominal/textDomain
Looks like to would be best for eml_attribute
to handle this mapping itself, particulary since the different conventions bifurcate at different spots (e.g. we must know if a nominal
is enumerated or text). This could also allow for finer handling of optional unit information, such as the bounds or precision.