Writing out some advice for our GSOC student on XML parsing. (Now filed as RNeXML/#11
Here is some quick background on different ways we might go about extracting NeXML into an R object we want to work with. We can use S4 classes, R’s native data.frame
and list
types, or extract specific terms of interest with xpath. I illustrate each of these below using the example “trees.xml” in your repository.
It would be good to decide on which strategy you want. As you seem to be leaning to the S4 already, note how I have made the slot labels correspond exactly to the attribute and node labels of the NeXML. This lets us use the xmlToS4
method:
Using S4 classes
We can define S4 class structures to correspond to the elements we want to extract. Elements omitted from our definition will be ignored.
setClass("node",
representation(id = "character",
label = "character",
otu = "character",
about = "character",
meta = "meta"))
Note that class “meta” is not yet defined, we will define it below. Note also that we could omit any slots we don’t care about, and they would simply be ignored. (Try deleting the meta
line, or the about
line.)
setClass("edge",
representation(source = "character",
target = "character",
id = "character",
length = "numeric"))
setClass("meta",
representation(id ="character",
property="character",
content="logical",
'xsi:type'="character",
datatype="character"))
setAs("XMLInternalElementNode", "meta", function(from) xmlToS4(from))
In order for xmlToS4
to work on nodes, I need the Coercion method for meta
too, defined above. We are now ready to rock and roll:
require(XML)
doc <- xmlParse("tests/examples/trees.xml")
edges <- getNodeSet(doc, "//x:tree[@id = 'tree1']/x:edge", namespaces="x")
edges <- lapply(edges, xmlToS4)
nodes <- getNodeSet(doc, "//x:tree[@id = 'tree1']/x:node", namespaces="x")
nodes <- lapply(nodes, xmlToS4)
My opinion is that we should not bother writing validator methods etc for these classes we have just defined at this time. We can already validate the XML produced against the schema, and as the class definitions follow those from the schema, we should be reasonably robust. Happy to hear counter-arguments.
Extracting Attributes or values directly by xpath query
A different route does not define class types at all, but just extracts the attributes we want. This is a bit more fast and loose.
edges = xpathSApply(doc, "//x:tree[@id = 'tree1']/x:edge", namespaces = "x", xmlAttrs)
nodes = xpathSApply(doc, "//x:tree[@id = 'tree1']/x:node", namespaces = "x",
function(x)
c(xmlGetAttr(x, "id", NA, as.character),
xmlGetAttr(x, "otu", NA, as.character)))
Coercing into standard R forms
This approach can be very quick and powerful if the data conforms to this structure.
XML:::xmlAttrsToDataFrame(getNodeSet(doc, "//x:tree[@id = 'tree1']/x:node", namespaces="x"))
XML:::xmlAttrsToDataFrame(getNodeSet(doc, "//x:tree[@id = 'tree1']/x:edge", namespaces="x"))
without XPath
We can skip over XPath based expressions by using xmlToList immediately:
nex <- xmlToList(doc)
Unfortunately, this will do some possibly unexpected things: for instance, attributes are converted to list elements (node$id
, node$otu
, etc) on simple nodes (e.g. <nodes>
without a <meta>
node), while if they contain another node, attributes are one sub-list and the containing nodes another (e.g. node$meta
and node$.attr
, etc)
Also note that instead of xpath queries as in the xpathSApply
and getNodeSet
examples above, we can index nodes the way would lists:
e.g. xmlRoot(doc)[["trees"]][["tree"]]
instead of getNodeSet(doc, "//trees/tree")[[1]]
(note that the former returns the first <tree>
node in the <trees>
node, while the latter returns all <tree>
nodes unless we just ask for the [[1]]
So this gives us a table of all node and edge elements:
XML:::xmlAttrsToDataFrame(xmlRoot(doc)[["trees"]][["tree"]])
Which gives us:
id label root otu about source target length
1 n1 n1 true <NA> <NA> <NA> <NA> <NA>
2 n2 n2 <NA> t1 <NA> <NA> <NA> <NA>
3 n3 n3 <NA> <NA> <NA> <NA> <NA> <NA>
4 n4 n4 <NA> <NA> #n4 <NA> <NA> <NA>
5 n5 n5 <NA> t3 <NA> <NA> <NA> <NA>
6 n6 n6 <NA> t2 <NA> <NA> <NA> <NA>
7 n7 n7 <NA> <NA> <NA> <NA> <NA> <NA>
8 n8 n8 <NA> t5 <NA> <NA> <NA> <NA>
9 n9 n9 <NA> t4 <NA> <NA> <NA> <NA>
10 e1 <NA> <NA> <NA> <NA> n1 n3 0.34534
11 e2 <NA> <NA> <NA> <NA> n1 n2 0.4353
12 e3 <NA> <NA> <NA> <NA> n3 n4 0.324
13 e4 <NA> <NA> <NA> <NA> n3 n7 0.3247
14 e5 <NA> <NA> <NA> <NA> n4 n5 0.234
15 e6 <NA> <NA> <NA> <NA> n4 n6 0.3243
16 e7 <NA> <NA> <NA> <NA> n7 n8 0.32443
17 e8 <NA> <NA> <NA> <NA> n7 n9 0.2342
Sometimes this notation is cleaner than the xpath, sort of up to you.