Notes

Units in EML

Overview of how units are determined in the EML package:

library(EML)
dat <- data.set(river = factor(c("SAC",  "SAC",   "AM")),
               spp   = c("Oncorhynchus tshawytscha",  "Oncorhynchus tshawytscha", "Oncorhynchus kisutch"),
               stg   = ordered(c("smolt", "parr", "smolt"), levels=c("parr", "smolt")), # levels indicates increasing level, eg. parr < smolt
               ct    = c(293L,    410L,    210L),
               day   = as.Date(c("2013-09-01", "2013-09-1", "2013-09-02")),

               stringsAsFactors = FALSE,

               col.defs = c("River site used for collection",
                            "Species scientific name",
                            "Life Stage",
                            "count of live fish in traps",
                            "day traps were sampled (usually in morning thereof)"),

               unit.defs = list(c(SAC = "The Sacramento River",                         # Factor, levels defined explicitly
                                  AM = "The American River"),
                                "Scientific name",                                      # Character string (levels not defined)
                                c(parr = "third life stage",                            # Ordered factor
                                  smolt = "fourth life stage"),
                                c(unit = "number", precision = 1, bounds = c(0, Inf)),  # Integer
                                c(format = "YYYY-MM-DD", precision = 1)))               # Date

The EML package provides a variety of interfaces to transform this into EML format, depending on the level of granularity desired. At the highest level, as user can directly call eml_write (aliased as write.eml to mimic other write file conventions),

eml_write(dat, "example.xml", contact = "Carl Boettiger <cboettig@ropensci.org")

This call works simply by calling the lower level functions. If dat is already the S4 object representation for eml or dataset object, they are dealt with directly. Otherwise, the function simply passes its arguments to the eml constructor function. The main difference between the eml constructor and write_eml is that the eml function returns an eml S4 object, while the write_eml function takes the additional step of transforming the S4 eml structure into XML and writing it to the desired file.

The eml function, also available to the user, has more optional named arguments than write_eml, though these can all be given to the higher level write_eml as well since they are passed through the ... mechanism. (Its default arguments illustrate calls to some of the lower-level constructors, such as eml_coverage, and will automatically try and read creator and contact from the configuration environment if they are not provided)

> eml
function (dat = NULL, title = "metadata",
          creator = get("defaultCreator", envir = EMLConfig),
          contact = get("defaultContact", envir = EMLConfig),
          coverage = eml_coverage(scientific_names = NULL,
                                  dates = NULL,
                                  geographic_description = NULL,
                                  NSEWbox = NULL),
          methods = new("methods"),
          ...,
          additionalMetadata = c(new("additionalMetadata")),
          citation = NULL,
          software = NULL,
          protocol = NULL)

Because all other arguments are optional, it is sufficient to call this function with the dat argument alone. The data object is allowed to be NULL if at least one of the other top-level alternatives to a dataset is provided: citation, software or protocol.

my_eml <- eml(dat)

This function follows the same logic as before: The function first constructs a unique identifier for the EML packageId. If dat is already an S4 dataset or dataTable object it is added immediately to a new("eml" object; otherwise a dataTable object is constructed with the next helper function, eml_dataTable. (These constructors prefaced with (eml_) are always just thin wrappers around the direct construction of these S4 objects with new("classname", ...), and exist just to simplify certain things wich are usually and frequently automated, such as creating unique ide elements.)

So far our data.set object dat is just passed unchanged from write_eml to eml to eml_dataTable.

my_dataTable <- eml_dataTable(dat)

This constructor is the first to peer inside the dat object, extracting metadata for the attributeList elements. (The dat object is also passed to the eml_physical constructor, which will use the actual data.frame to write the csv file.)

The metadata extraction is performed in two steps. First, the helper function detect_class extracts a list of the necessary metadata from the data.set. Then this list is coerced into an EML attributeList (Note in this review it becomes clear that the coercion is not a flexible and robust way to handle this, so this task is now performed by eml_attributeList and in turn, eml_attribute, following the same logic as above.) Currently, detect_class takes the legacy format of having a data.frame and a list of meta objects, structured as column name, col definitions, and unit defintions (each as character vectors, like in a data.set. Here is where things get tricky. detect_class uses the declared class of the column to decide how to interpret the column and unit metadata, using the following mapping:

numeric or integer : ratio
ordered factor: ordinal/enumeratedDomain
factor : nominal/enumeratedDomain
POSIXlt, POSIXct, Date : dateTime
character : nominal/textDomain

Looks like to would be best for eml_attribute to handle this mapping itself, particulary since the different conventions bifurcate at different spots (e.g. we must know if a nominal is enumerated or text). This could also allow for finer handling of optional unit information, such as the bounds or precision.