Accessing the database
Connect to the database (connection info is public), works fine:
library("dplyr")
library("RPostgreSQL")
library("tidyr")
library("ggplot2")
mydb <- src_postgres(dbname = "srdb",
host="nautilus-vm.mathstat.dal.ca",
user = "srdbuser",
password = "srd6us3r!",
port = 5432)
mydb
src: postgres 8.4.10 [srdbuser@nautilus-vm.mathstat.dal.ca:5432/srdb]
tbls: area, assessment, assessmethod, assessor, biometrics,
bioparams, brptots, fishbasesaupcodes, geometry_columns, lmerefs,
lmes, lmetostocks, management, mostrecent, recorder, referencedoc,
reference_point_units_view, reference_point_values_view, risfields,
risfieldvalues, spatial_ref_sys, stock, taxonomy, timeseries,
timeseries_units_view, timeseries_values_view, tsmetrics,
tsrelative_explicit_view
However, the expected mechanism for accessing a complete table seems to fail:
tbl(mydb, "stock")
Error in postgresqlExecStatement(conn, statement, ...): RS-DBI driver: (could not Retrieve the result : ERROR: relation "stock" does not exist
LINE 1: SELECT * FROM "stock" WHERE 0=1
^
)
Filed as a bug report in RPostgres/#32.
Meanwhile, direct sql queries work (note we need the full table address, e.g. dbname.tablename
.)
tbl(mydb, sql("SELECT * FROM srdb.stock"))
Source: postgres 8.4.10 [srdbuser@nautilus-vm.mathstat.dal.ca:5432/srdb]
From: <derived table> [?? x 8]
stockid tsn scientificname commonname
1 COD1 164712 Gadus morhua Atlantic cod
2 COD2J3KL 164712 Gadus morhua Atlantic cod
3 COD2J3KLIS 164712 Gadus morhua Atlantic cod
4 COD3M 164712 Gadus morhua Atlantic cod
5 COD3NO 164712 Gadus morhua Atlantic cod
6 COD3Ps 164712 Gadus morhua Atlantic cod
7 COD3Pn4RS 164712 Gadus morhua Atlantic cod
8 COD4T 164712 Gadus morhua Atlantic cod
9 COD4VsW 164712 Gadus morhua Atlantic cod
10 COD4TVn 164712 Gadus morhua Atlantic cod
.. ... ... ... ...
Variables not shown: areaid (chr), stocklong (chr), inmyersdb (int),
myersstockid (chr)
Since these tables easily fit into memory, it is generally faster to just import them into R rather than leaving dplyr
to just work with them remotely. The dplyr::collect()
function does this. So we create local copies of each table of interest like so:
timeseries <- collect(tbl(mydb, sql("SELECT * FROM srdb.timeseries")))
values <- collect(tbl(mydb, sql("SELECT * FROM srdb.timeseries_values_view")))
units <- collect(tbl(mydb, sql("SELECT * FROM srdb.timeseries_units_view")))
assessment <- collect(tbl(mydb, sql("SELECT * FROM srdb.assessment")))
area <- collect(tbl(mydb, sql("SELECT * FROM srdb.area")))
stock <- collect(tbl(mydb, sql("SELECT * FROM srdb.stock")))
method <- collect(tbl(mydb, sql("SELECT * FROM srdb.assessmethod")))
assessor <- collect(tbl(mydb, sql("SELECT * FROM srdb.assessor")))
management <- collect(tbl(mydb, sql("SELECT * FROM srdb.management")))
taxonomy <- collect(tbl(mydb, sql("SELECT * FROM srdb.taxonomy")))
lmerefs <- collect(tbl(mydb, sql("SELECT * FROM srdb.lmerefs")))
lmestock <- collect(tbl(mydb, sql("SELECT * FROM srdb.lmetostocks")))
biometrics <- collect(tbl(mydb, sql("SELECT * FROM srdb.biometrics")))
bioparams <- collect(tbl(mydb, sql("SELECT * FROM srdb.bioparams")))
tsmetrics <- collect(tbl(mydb, sql("SELECT * FROM srdb.tsmetrics")))
Many of the tables contain observations of variables that describe each given assessment (assessid
), including the species stock
assessed, area
assessed, the method used, and so forth. Since these all follow the same schema of a row being a unique stock assessment and a column being an attribute of that assessment, it makes sense to combine this into a single metadata table. (Especially as these datasets fit so easily into memory anyway.)
meta <- assessment %>%
rename(methodshort = assessmethod) %>%
left_join(method) %>%
left_join(stock) %>%
left_join(area) %>%
left_join(units) %>%
left_join(assessor) %>%
left_join(management) %>%
left_join(taxonomy)
all_areas <- stock %>%
select(stockid, areaid) %>%
left_join(area) %>%
right_join(lmestock) %>%
left_join(lmerefs)
sapply(all_areas, function(x) length(levels(factor(x))))
stockid areaid country
391 174 10
areatype areacode areaname
24 167 167
alternateareaname lme_number stocktolmerelation
5 34 5
lme_name
34
bioparams
: Fixed parameter values of a study.biometrics
: definitions of said parameters.
The column is called biounique
in biometrics
table but bioid
in bioparams
table, so we fix that:
parameters <- biometrics %>%
rename(bioid = biounique) %>%
left_join(bioparams)
tsmetrics
defines the factor levels and the units used intimeseries
tsid
column.
tsmetrics <- tsmetrics %>% rename(tsid = tsunique)
For example, we can see what measurements are available
ids <- timeseries %>%
filter(assessid == "NWFSC-COWCODSCAL-1900-2007-BRANCH") %>%
distinct(tsid) %>%
select(tsid)
ids
Source: local data frame [7 x 1]
tsid
1 SSB-MT
2 R-E03
3 F-1/yr
4 TB-MT
5 TC-MT
6 TL-MT
7 STB1+-MT
Looking up these ids in the tsmetrics
table tells us what these seven time series are:
inner_join(ids, tsmetrics) %>% select(tsshort, tslong, tsunitsshort, tsunitslong)
Source: local data frame [7 x 4]
tsshort
1 SSB
2 R
3 F
4 TB
5 TC
6 TL
7 STB1+
Variables not shown: tslong (chr), tsunitsshort (chr), tsunitslong
(chr)
cowcod <- timeseries %>%
filter(assessid == "NWFSC-COWCODSCAL-1900-2007-BRANCH") %>%
left_join(tsmetrics) %>%
ggplot(aes(tsyear, tsvalue, col=tsid)) + geom_line()
cowcod
cowcod + scale_y_log10()
(Note TC (total catch) and TL (total landings) are equivalent in this context, implying neglible discards.)
Unfortunately, there is a lot of heterogeneity in the metrics measured by each assessment: tsmetrics
defines 151 units, (though only 93 appear in timeseries)
length(unique(tsmetrics$tsid))
[1] 151
length(table(timeseries$tsid))
[1] 93
Most are variations differing only by units, as we see from the most commonly used metrics:
unit_occurs <-
timeseries %>%
group_by(tsid) %>%
distinct(assessid) %>%
summarize(occurs = n()) %>%
left_join(tsmetrics) %>%
arrange(desc(occurs)) %>%
select(tsid, tslong, tsunitsshort, occurs)
unit_occurs
Source: local data frame [93 x 4]
tsid
1 TB-MT
2 SSB-MT
3 R-E03
4 TC-MT
5 TL-MT
6 F-1/T
7 F-1/yr
8 ER-ratio
9 Yield-SSB-dimensionless
10 YEAR-yr
.. ...
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)
These are all variations of the same several variables, but measured in different units. For instance, we see many series use a catch to biomass ratio (ER) instead of a fishing mortality.
unit_occurs %>% filter(grepl("^SSB", tsid))
Source: local data frame [12 x 4]
tsid
1 SSB-MT
2 SSB-E03eggs
3 SSB-STDDEV-MT
4 SSB-1-MT
5 SSB-2-MT
6 SSB-relative
7 SSB-E03
8 SSB-E06larvae
9 SSB-3-MT
10 SSB-4-MT
11 SSB-E03pertow
12 SSB-FemaleGonadMT
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)
unit_occurs %>% filter(grepl("^R", tsid))
Source: local data frame [7 x 4]
tsid
1 R-E03
2 R-1-E03
3 R-2-E03
4 R-MT
5 R-relative
6 R-3-E03
7 R-4-E03
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)
unit_occurs %>% filter(grepl("^TC", tsid))
Source: local data frame [7 x 4]
tsid
1 TC-MT
2 TC-E03
3 TC-E00
4 TC-1-E03
5 TC-1-MT
6 TC-2-MT
7 TC-2-E03
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)
unit_occurs %>% filter(grepl("^TL", tsid))
Source: local data frame [5 x 4]
tsid
1 TL-MT
2 TL-1-MT
3 TL-2-MT
4 TL-3-MT
5 TL-E00
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)
unit_occurs %>% filter(grepl("^CPUE", tsid))
Source: local data frame [10 x 4]
tsid
1 CPUE-kgpertow
2 CPUEstand-1-C/E
3 CPUEstand-2-C/E
4 CPUEstand-C/E
5 CPUEraw-C/E
6 CPUEstand-3-C/E
7 CPUEstand-4-C/E
8 CPUEsmooth-E00pertow
9 CPUEstand-5-C/E
10 CPUEstand-6-C/E
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)
The values
table appears to be derived from the timeseries
table, presumably standardizing on consistent metrics(?)
values
Source: local data frame [16,308 x 9]
assessid tsyear pt_avail ssb r
1 ADFG-HERRPWS-1980-2006-COLLIE 1980 5 48270.35 414070
2 ADFG-HERRPWS-1980-2006-COLLIE 1981 5 51090.88 335200
3 ADFG-HERRPWS-1980-2006-COLLIE 1982 5 47402.54 112300
4 ADFG-HERRPWS-1980-2006-COLLIE 1983 5 56449.01 103740
5 ADFG-HERRPWS-1980-2006-COLLIE 1984 5 64461.28 1062360
6 ADFG-HERRPWS-1980-2006-COLLIE 1985 5 79124.61 99630
7 ADFG-HERRPWS-1980-2006-COLLIE 1986 5 65601.06 74880
8 ADFG-HERRPWS-1980-2006-COLLIE 1987 5 70646.42 84820
9 ADFG-HERRPWS-1980-2006-COLLIE 1988 5 93508.20 1006440
10 ADFG-HERRPWS-1980-2006-COLLIE 1989 5 105135.41 119970
.. ... ... ... ... ...
Variables not shown: total (dbl), f (dbl), cpue (dbl), catch_landings
(dbl)
We can see this by transforming our example:
x <- timeseries %>%
filter(assessid == "NWFSC-COWCODSCAL-1900-2007-BRANCH") %>%
spread(tsid, tsvalue) %>%
rename(ssb=`SSB-MT`, r = `R-E03`, total = `TB-MT`, f = `F-1/yr`, catch_landings = `TL-MT`) %>%
select(assessid, tsyear, ssb, r, total, f, catch_landings)
which is indeed identical to corresponding assessment in the values
table
y <- values %>% filter(assessid == "NWFSC-COWCODSCAL-1900-2007-BRANCH") %>% select(-pt_avail, -cpue)
identical(x,y)
[1] TRUE
So what happens when the units differ?
timeseries %>% filter(tsid=="SSB-E03eggs") %>% distinct("assessid")
Source: local data frame [1 x 5]
assessid tsid tsyear tsvalue
1 TAFI-TASGIANTCRABTAS-1990-2007-JENSEN SSB-E03eggs 1998 188.9256
Variables not shown: "assessid" (chr)
x <- timeseries %>%
filter(assessid == "TAFI-TASGIANTCRABTAS-1990-2007-JENSEN") %>%
spread(tsid, tsvalue) %>%
select(ssb=`SSB-E03eggs`)
y <- values %>% filter(assessid == "TAFI-TASGIANTCRABTAS-1990-2007-JENSEN") %>% select(ssb)
identical(x,y)
[1] TRUE
No transformation has been done, hence the units of the values
columns vary depending on the assessment id. Nonetheless it is quite useful to have the metrics split into their corresponding 5 types rather than as 93 unique subtypes. As long as we are not comparing magnitudes across different assessments directly though, this should not be an issue.
It would probably be useful to reconstruct the code to generate the values
table from the timeseries table directly. One might hope that the mappings between tsid
values and the five column headings in the values
table would be defined in the database, e.g. in perhaps the tscategory
column of tsmetrics
:
unique(tsmetrics$tscategory)
[1] "FISHING MORTALITY"
[2] "TOTAL BIOMASS"
[3] "TIME UNITS"
[4] "SPAWNING STOCK BIOMASS or CPUE"
[5] "RECRUITS (NOTE: RECRUITS ARE OFFSET IN TIME SERIES BY THE AGE OF RECRUITMENT)"
[6] "CATCH or LANDINGS"
[7] "OTHER TIME SERIES DATA"
but alas this does not quite appear to be the case (e.g. CPUE and SSB are a single category.) Some combination of this information and splitting on the tsid
strings would probably suffice.
ts <- meta %>% select(assessid, commonname) %>% right_join(values)