Ram Legacy Database Explore

Accessing the database

Connect to the database (connection info is public), works fine:

library("dplyr")
library("RPostgreSQL")
library("tidyr")
library("ggplot2")

mydb <- src_postgres(dbname = "srdb", 
                     host="nautilus-vm.mathstat.dal.ca", 
                     user = "srdbuser", 
                     password =  "srd6us3r!", 
                     port = 5432)
mydb

src:  postgres 8.4.10 [srdbuser@nautilus-vm.mathstat.dal.ca:5432/srdb]
tbls: area, assessment, assessmethod, assessor, biometrics,
  bioparams, brptots, fishbasesaupcodes, geometry_columns, lmerefs,
  lmes, lmetostocks, management, mostrecent, recorder, referencedoc,
  reference_point_units_view, reference_point_values_view, risfields,
  risfieldvalues, spatial_ref_sys, stock, taxonomy, timeseries,
  timeseries_units_view, timeseries_values_view, tsmetrics,
  tsrelative_explicit_view

However, the expected mechanism for accessing a complete table seems to fail:

tbl(mydb, "stock")

Error in postgresqlExecStatement(conn, statement, ...): RS-DBI driver: (could not Retrieve the result : ERROR:  relation "stock" does not exist
LINE 1: SELECT * FROM "stock" WHERE 0=1
                      ^
)

Filed as a bug report in RPostgres/#32.

Meanwhile, direct sql queries work (note we need the full table address, e.g. dbname.tablename.)

tbl(mydb, sql("SELECT * FROM srdb.stock"))

Source: postgres 8.4.10 [srdbuser@nautilus-vm.mathstat.dal.ca:5432/srdb]
From: <derived table> [?? x 8]

      stockid    tsn scientificname   commonname
1        COD1 164712   Gadus morhua Atlantic cod
2    COD2J3KL 164712   Gadus morhua Atlantic cod
3  COD2J3KLIS 164712   Gadus morhua Atlantic cod
4       COD3M 164712   Gadus morhua Atlantic cod
5      COD3NO 164712   Gadus morhua Atlantic cod
6      COD3Ps 164712   Gadus morhua Atlantic cod
7   COD3Pn4RS 164712   Gadus morhua Atlantic cod
8       COD4T 164712   Gadus morhua Atlantic cod
9     COD4VsW 164712   Gadus morhua Atlantic cod
10    COD4TVn 164712   Gadus morhua Atlantic cod
..        ...    ...            ...          ...
Variables not shown: areaid (chr), stocklong (chr), inmyersdb (int),
  myersstockid (chr)

Since these tables easily fit into memory, it is generally faster to just import them into R rather than leaving dplyr to just work with them remotely. The dplyr::collect() function does this. So we create local copies of each table of interest like so:

timeseries <- collect(tbl(mydb, sql("SELECT * FROM srdb.timeseries")))
values <- collect(tbl(mydb, sql("SELECT * FROM srdb.timeseries_values_view")))

units <- collect(tbl(mydb, sql("SELECT * FROM srdb.timeseries_units_view")))
assessment <- collect(tbl(mydb, sql("SELECT * FROM srdb.assessment")))
area <- collect(tbl(mydb, sql("SELECT * FROM srdb.area")))
stock <- collect(tbl(mydb, sql("SELECT * FROM srdb.stock")))
method <- collect(tbl(mydb, sql("SELECT * FROM srdb.assessmethod")))
assessor <- collect(tbl(mydb, sql("SELECT * FROM srdb.assessor")))
management  <- collect(tbl(mydb, sql("SELECT * FROM srdb.management")))
taxonomy <- collect(tbl(mydb, sql("SELECT * FROM srdb.taxonomy")))

lmerefs <- collect(tbl(mydb, sql("SELECT * FROM srdb.lmerefs")))
lmestock <- collect(tbl(mydb, sql("SELECT * FROM srdb.lmetostocks")))


biometrics  <- collect(tbl(mydb, sql("SELECT * FROM srdb.biometrics")))
bioparams <- collect(tbl(mydb, sql("SELECT * FROM srdb.bioparams")))

tsmetrics <- collect(tbl(mydb, sql("SELECT * FROM srdb.tsmetrics")))

Many of the tables contain observations of variables that describe each given assessment (assessid), including the species stock assessed, area assessed, the method used, and so forth. Since these all follow the same schema of a row being a unique stock assessment and a column being an attribute of that assessment, it makes sense to combine this into a single metadata table. (Especially as these datasets fit so easily into memory anyway.)

meta <- assessment %>% 
  rename(methodshort = assessmethod) %>% 
  left_join(method) %>% 
  left_join(stock) %>% 
  left_join(area) %>%
  left_join(units) %>%
  left_join(assessor) %>%
  left_join(management) %>%
  left_join(taxonomy)

all_areas <- stock %>% 
  select(stockid, areaid) %>% 
  left_join(area) %>% 
  right_join(lmestock) %>% 
  left_join(lmerefs)

sapply(all_areas, function(x) length(levels(factor(x))))

           stockid             areaid            country 
               391                174                 10 
          areatype           areacode           areaname 
                24                167                167 
 alternateareaname         lme_number stocktolmerelation 
                 5                 34                  5 
          lme_name 
                34

bioparams: Fixed parameter values of a study.
biometrics: definitions of said parameters.

The column is called biounique in biometrics table but bioid in bioparams table, so we fix that:

parameters <- biometrics %>% 
  rename(bioid = biounique) %>% 
  left_join(bioparams)

tsmetrics defines the factor levels and the units used in timeseries tsid column.

tsmetrics <- tsmetrics %>% rename(tsid = tsunique)

For example, we can see what measurements are available

ids <- timeseries %>% 
  filter(assessid == "NWFSC-COWCODSCAL-1900-2007-BRANCH") %>%
  distinct(tsid) %>% 
  select(tsid)
ids

Source: local data frame [7 x 1]

      tsid
1   SSB-MT
2    R-E03
3   F-1/yr
4    TB-MT
5    TC-MT
6    TL-MT
7 STB1+-MT

Looking up these ids in the tsmetrics table tells us what these seven time series are:

inner_join(ids, tsmetrics) %>% select(tsshort, tslong, tsunitsshort, tsunitslong)

Source: local data frame [7 x 4]

  tsshort
1     SSB
2       R
3       F
4      TB
5      TC
6      TL
7   STB1+
Variables not shown: tslong (chr), tsunitsshort (chr), tsunitslong
  (chr)

cowcod <- timeseries %>% 
  filter(assessid == "NWFSC-COWCODSCAL-1900-2007-BRANCH") %>%
  left_join(tsmetrics) %>%
  ggplot(aes(tsyear, tsvalue, col=tsid)) + geom_line() 
cowcod

cowcod + scale_y_log10()

(Note TC (total catch) and TL (total landings) are equivalent in this context, implying neglible discards.)

Unfortunately, there is a lot of heterogeneity in the metrics measured by each assessment: tsmetrics defines 151 units, (though only 93 appear in timeseries)

length(unique(tsmetrics$tsid))

[1] 151

length(table(timeseries$tsid))

[1] 93

Most are variations differing only by units, as we see from the most commonly used metrics:

unit_occurs <- 
timeseries %>% 
  group_by(tsid) %>% 
  distinct(assessid) %>%
  summarize(occurs = n()) %>% 
  left_join(tsmetrics) %>% 
  arrange(desc(occurs)) %>% 
  select(tsid, tslong, tsunitsshort, occurs)
unit_occurs

Source: local data frame [93 x 4]

                      tsid
1                    TB-MT
2                   SSB-MT
3                    R-E03
4                    TC-MT
5                    TL-MT
6                    F-1/T
7                   F-1/yr
8                 ER-ratio
9  Yield-SSB-dimensionless
10                 YEAR-yr
..                     ...
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)

These are all variations of the same several variables, but measured in different units. For instance, we see many series use a catch to biomass ratio (ER) instead of a fishing mortality.

unit_occurs %>% filter(grepl("^SSB", tsid))

Source: local data frame [12 x 4]

                tsid
1             SSB-MT
2        SSB-E03eggs
3      SSB-STDDEV-MT
4           SSB-1-MT
5           SSB-2-MT
6       SSB-relative
7            SSB-E03
8      SSB-E06larvae
9           SSB-3-MT
10          SSB-4-MT
11     SSB-E03pertow
12 SSB-FemaleGonadMT
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)

unit_occurs %>% filter(grepl("^R", tsid))

Source: local data frame [7 x 4]

        tsid
1      R-E03
2    R-1-E03
3    R-2-E03
4       R-MT
5 R-relative
6    R-3-E03
7    R-4-E03
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)

unit_occurs %>% filter(grepl("^TC", tsid))

Source: local data frame [7 x 4]

      tsid
1    TC-MT
2   TC-E03
3   TC-E00
4 TC-1-E03
5  TC-1-MT
6  TC-2-MT
7 TC-2-E03
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)

unit_occurs %>% filter(grepl("^TL", tsid))

Source: local data frame [5 x 4]

     tsid
1   TL-MT
2 TL-1-MT
3 TL-2-MT
4 TL-3-MT
5  TL-E00
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)

unit_occurs %>% filter(grepl("^CPUE", tsid))

Source: local data frame [10 x 4]

                   tsid
1         CPUE-kgpertow
2       CPUEstand-1-C/E
3       CPUEstand-2-C/E
4         CPUEstand-C/E
5           CPUEraw-C/E
6       CPUEstand-3-C/E
7       CPUEstand-4-C/E
8  CPUEsmooth-E00pertow
9       CPUEstand-5-C/E
10      CPUEstand-6-C/E
Variables not shown: tslong (chr), tsunitsshort (chr), occurs (int)

The values table appears to be derived from the timeseries table, presumably standardizing on consistent metrics(?)

values

Source: local data frame [16,308 x 9]

                        assessid tsyear pt_avail       ssb       r
1  ADFG-HERRPWS-1980-2006-COLLIE   1980        5  48270.35  414070
2  ADFG-HERRPWS-1980-2006-COLLIE   1981        5  51090.88  335200
3  ADFG-HERRPWS-1980-2006-COLLIE   1982        5  47402.54  112300
4  ADFG-HERRPWS-1980-2006-COLLIE   1983        5  56449.01  103740
5  ADFG-HERRPWS-1980-2006-COLLIE   1984        5  64461.28 1062360
6  ADFG-HERRPWS-1980-2006-COLLIE   1985        5  79124.61   99630
7  ADFG-HERRPWS-1980-2006-COLLIE   1986        5  65601.06   74880
8  ADFG-HERRPWS-1980-2006-COLLIE   1987        5  70646.42   84820
9  ADFG-HERRPWS-1980-2006-COLLIE   1988        5  93508.20 1006440
10 ADFG-HERRPWS-1980-2006-COLLIE   1989        5 105135.41  119970
..                           ...    ...      ...       ...     ...
Variables not shown: total (dbl), f (dbl), cpue (dbl), catch_landings
  (dbl)

We can see this by transforming our example:

x <- timeseries %>% 
  filter(assessid == "NWFSC-COWCODSCAL-1900-2007-BRANCH") %>%
  spread(tsid, tsvalue) %>% 
  rename(ssb=`SSB-MT`, r = `R-E03`, total = `TB-MT`, f = `F-1/yr`, catch_landings = `TL-MT`) %>%
  select(assessid, tsyear, ssb, r, total, f, catch_landings)

which is indeed identical to corresponding assessment in the values table

y <- values %>% filter(assessid == "NWFSC-COWCODSCAL-1900-2007-BRANCH") %>% select(-pt_avail, -cpue)
identical(x,y)

[1] TRUE

So what happens when the units differ?

timeseries %>% filter(tsid=="SSB-E03eggs") %>% distinct("assessid")

Source: local data frame [1 x 5]

                               assessid        tsid tsyear  tsvalue
1 TAFI-TASGIANTCRABTAS-1990-2007-JENSEN SSB-E03eggs   1998 188.9256
Variables not shown: "assessid" (chr)

x <- timeseries %>% 
  filter(assessid == "TAFI-TASGIANTCRABTAS-1990-2007-JENSEN") %>%
  spread(tsid, tsvalue) %>% 
  select(ssb=`SSB-E03eggs`)

y <- values %>% filter(assessid == "TAFI-TASGIANTCRABTAS-1990-2007-JENSEN") %>% select(ssb)
identical(x,y)

[1] TRUE

No transformation has been done, hence the units of the values columns vary depending on the assessment id. Nonetheless it is quite useful to have the metrics split into their corresponding 5 types rather than as 93 unique subtypes. As long as we are not comparing magnitudes across different assessments directly though, this should not be an issue.

It would probably be useful to reconstruct the code to generate the values table from the timeseries table directly. One might hope that the mappings between tsid values and the five column headings in the values table would be defined in the database, e.g. in perhaps the tscategory column of tsmetrics:

unique(tsmetrics$tscategory)

[1] "FISHING MORTALITY"                                                            
[2] "TOTAL BIOMASS"                                                                
[3] "TIME UNITS"                                                                   
[4] "SPAWNING STOCK BIOMASS or CPUE"                                               
[5] "RECRUITS (NOTE: RECRUITS ARE OFFSET IN TIME SERIES BY THE AGE OF RECRUITMENT)"
[6] "CATCH or LANDINGS"                                                            
[7] "OTHER TIME SERIES DATA"

but alas this does not quite appear to be the case (e.g. CPUE and SSB are a single category.) Some combination of this information and splitting on the tsid strings would probably suffice.

ts <- meta %>% select(assessid, commonname) %>% right_join(values)