Using dplyr
calls on the back-end of the rfishbase
re-write means working around the non-standard evaluation (NSE), as described in the dplyr
vignette.
Grab the data I was using for this:
library("dplyr")
downloader::download("https://github.com/cboettig/2015/raw/fc0d9185659e7976927d0ec91981912537ac6018/assets/data/2015-02-06-taxa.csv", "taxa.csv")
all_taxa <- read.csv("taxa.csv")
Consider a simple NSE dplyr
call:
x <- filter(all_taxa, Family == 'Scaridae')
The best SE version of this just needs to use the formula expression, ~
, the _
SE version of the function and it’s .dots
argument:
.dots <- list(~Family == 'Scaridae')
x1 <- filter_(all_taxa, .dots=.dots)
identical(x, x1)
[1] TRUE
This lets us treat the arguments (e.g. values of the factor on which we filter) as variables:
family <- 'Scaridae'
.dots <- list(~Family == family)
x2 <- filter_(all_taxa, .dots=.dots)
identical(x, x2)
[1] TRUE
If we want both the key and value to vary, we need to get pretty fancy to subvert the non-standard evaluation:
library(lazyeval)
family <- 'Scaridae'
field <- 'Family'
.dots <- list(interp(~y == x,
.values = list(y = as.name(field), x = family)))
x3 <- filter_(all_taxa, .dots=.dots)
identical(x, x3)
[1] TRUE
At bit more fun to wrap this into a function where we take arbitrary number of arguments as name-value pairs:
query <- list(Family = 'Scaridae', SpecCode = 5537)
dots <- lapply(names(query), function(level){
value <- query[[level]]
interp(~y == x,
.values = list(y = as.name(level), x = value))
})
x3 <- filter_(all_taxa, .dots = dots)
More fun standardizing NSE
The previous examples show only applications to filter_()
. While the general idea is the same, this pattern doesn’t translate directly for other functions, such as mutate_
. Here’s some common patterns I’ve adopted when using mutate_()
. First consider the familiar NSE useage:
df <- mutate(mtcars, displ_l = disp / 61.0237)
head(df)
mpg cyl disp hp drat wt qsec vs am gear carb displ_l
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2.621932
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2.621932
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1.769804
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4.227866
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5.899347
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.687092
Again we use list(interp(
pattern, but note that we specify the name for our new column using setNames
(naming the elements of the list).
dots <- setNames(list(lazyeval::interp(~x / y, x = quote(disp), y=61.0237)), "displ_l")
df2 <- mutate_(mtcars, .dots = dots)
identical(df, df2)
[1] TRUE
Of course the use y
could be skipped for a more direct value if that was not a variable.
More dplyr
patterns
Also thought I would scribble down some other common dplyr
patterns I find myself re-using.
- applying a function that returns a
data.frame
to each element of alist
and coercing the combined output to adata.frame
:
mylist %>% lapply(myfun) %>% dplyr::bind_rows()
To place this deeper in the hadleyverse
, purrr::map
could be dropped in for lapply
in the above example.
- Another common pattern for me is
expand.grid() %>% group_by() %>% do()
, Here’s a recent example of mine
Also includes an example of how to define group_by_all()
since that is usually the grouping I need from an expand.grid()
call (that is, I want to apply over all combinations of some parameter settings, etc)
Something I hope is not a common pattern but one I struggled with for a bit: making recursive calls of the above pattern for nested lists. This code in RNeXML illustrates my solution, which required both function recursion and function closure.