vignettes/mudata_create.Rmd
mudata_create.Rmd
As demonstrated in vignette("mudata2", package = "mudata2")
, mudata objects are easy to use and have a quick data-to-analysis time. In contrast, getting data into the format takes a little more time, and requires some familiarity with dplyr and tidyr. This process is essentially the data cleaning step, except that instead of discarding all the information that you don’t need (or won’t fit in the output data structure), you can keep almost everything, possibly adding some documentation that didn’t previously exist. This is a front-end investment of time that will make subsequent users of the data better informed about how and why the data were collected in the first place.
(Mostly) universal data (mudata) objects are created using the mudata()
function, which at minimum takes a data frame/tibble with one row per measurement. As an example, I’ll use the data table from the ns_climate
dataset:
## # A tibble: 115,541 x 7
## dataset location param date value flag flag_text
## <chr> <chr> <chr> <date> <dbl> <chr> <chr>
## 1 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-01-01 NA M Missing
## 2 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-02-01 NA M Missing
## 3 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-03-01 NA M Missing
## 4 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-04-01 NA M Missing
## 5 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-05-01 NA M Missing
## 6 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-06-01 NA M Missing
## 7 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-07-01 NA M Missing
## 8 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-08-01 NA M Missing
## 9 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-09-01 NA M Missing
## 10 ecclimate_month… SABLE ISLAND 6… mean_max_t… 1897-10-01 12.2 <NA> <NA>
## # … with 115,531 more rows
At minimum the data table must contain the columns param
and value
. The param
column contains the identifier of the measured parameter (a character vector), and the value
column contains the value of the measurement (there is no restriction on what type this is except that it has to be the same type for all parameters; see below for ways around this). To represent measurements at more than one location, you can include a location column with location identifiers (a character vector). To represent measurements at more than one point in time, you can include a column between param
and value
specifying at what time the measurement was taken. To the right of the value
column, you can include any columns needed to add context to value
(I typically use this for uncertainty, detection limits, and comments on a particular measurement).
In the context of ns_climate
, the location
column contains station names like “SABLE ISLAND”, the param
column contains measurement names like “mean_max_temp”, and the point in time the measurement was taken is included in the date
column. To the right of the value
column, there are two columns that add extra “flag” information provided by Environment Canada. These data are distributed with Environment Canada climate downloads, but are often discarded because the 12 paired columns in the standard wide data format in which they are distributed are a bit unwieldy.
In general, the steps to create a mudata object are:
mudata()
.update_locations()
, update_params()
, and update_datasets()
.update_columns_table()
to include the metadata columns you just added in the columns table.update_columns()
.write_mudata()
.As an example, I’m going to use a small subset of the sediment chemistry data that I work with on a regular basis. Instead of being aligned along the “time” or “date” axis, these data are aligned along the “depth” axis, or in other words, the columns that identify each measurement are location
(the sediment sample ID), param
(the chemical that was measured), and depth
(the position in the sediment sample). This dataset is included in the package as pocmaj
and pocmajsum
.
I’ll use the tidyverse for data wrangling, and the pocmaj
and pocmajsum
datasets to illustrate how to get from common data formats to the parameter-long, one-row-per-measurement data needed by the mudata()
function.
Parameter-wide, summarised data is the probably the most common form of data. If you’ve gotten this far, there is a good chance that you have data like this hanging around somewhere:
core | depth | Ca | V | Ti |
---|---|---|---|---|
MAJ-1 | 0 | 1885 | 78 | 2370 |
MAJ-1 | 1 | 1418 | 70 | 2409 |
MAJ-1 | 2 | 1550 | 70 | 2376 |
MAJ-1 | 3 | 1448 | 64 | 2485 |
MAJ-1 | 4 | 1247 | 57 | 2414 |
MAJ-1 | 5 | 1412 | 81 | 1897 |
POC-2 | 0 | 1622 | 33 | 2038 |
POC-2 | 1 | 1488 | 36 | 2016 |
POC-2 | 2 | 2416 | 79 | 3270 |
POC-2 | 3 | 2253 | 79 | 3197 |
POC-2 | 4 | 2372 | 87 | 3536 |
POC-2 | 5 | 2635 | 87 | 3890 |
This is a small subset of paleolimnological data for two sediment cores near Halifax, Nova Scotia. The data is a multi-parameter spatiotemporal dataset because it contains multiple parameters (calcium, titanium, and vanadium concentrations) measured along a common axis (depth in the sediment core) at discrete locations (cores named MAJ-1 and POC-2). Currently, our columns are not named properly: for the mudata format the terminology is ‘location’ not ‘core’. The rename()
function is the easiest way to do this.
Finally, we need to get the data into a parameter-long format, with a column named param
and our actual values in a single column called value
. This can be done using the gather()
function.
The (first six rows of the) data now look like this:
location | depth | param | value |
---|---|---|---|
MAJ-1 | 0 | Ca | 1885 |
MAJ-1 | 1 | Ca | 1418 |
MAJ-1 | 2 | Ca | 1550 |
MAJ-1 | 3 | Ca | 1448 |
MAJ-1 | 4 | Ca | 1247 |
MAJ-1 | 5 | Ca | 1412 |
The last important thing to consider is the axis on which the data are aligned. This sounds complicated but isn’t: these axes are the same axes you might use to plot the data, in this case depth
. The mudata()
constructor needs to know which column this is, either by explicitly passing x_columns = "depth"
or by placing the column between “param” and “value”. In most cases (like this one) it can be guessed (you’ll see a message telling you which columns were assigned this value).
Now the data is ready to be put into the mudata()
constructor. If it isn’t, the constructor will throw an error telling you how to fix the data.
## Guessing x columns: depth
## A mudata object aligned along "depth"
## distinct_datasets(): "default"
## distinct_locations(): "MAJ-1", "POC-2"
## distinct_params(): "Ca", "Ti", "V"
## src_tbls(): "data", "locations" ... and 3 more
##
## tbl_data() %>% head():
## # A tibble: 6 x 5
## dataset location param depth value
## <chr> <chr> <chr> <int> <dbl>
## 1 default MAJ-1 Ca 0 1885.
## 2 default MAJ-1 Ca 1 1418
## 3 default MAJ-1 Ca 2 1550
## 4 default MAJ-1 Ca 3 1448
## 5 default MAJ-1 Ca 4 1247
## 6 default MAJ-1 Ca 5 1412.
Data is often output in a format similar to the format above, but with uncertainty information in paired columns. Data from an ICP-MS, for example is often in this format, with the concentration and a +/- column next to it. One of the advantages of a long format is the ability to include this information in a way that makes plotting with error bars easier. The pocmajsum
dataset is a version of the dataset described above, but with standard deviation values in paired columns with the value itself.
core | depth | Ca | Ca_sd | Ti | Ti_sd | V | V_sd |
---|---|---|---|---|---|---|---|
MAJ-1 | 0 | 1885 | 452 | 2370 | 401 | 78 | 9 |
MAJ-1 | 1 | 1418 | NA | 2409 | NA | 70 | NA |
MAJ-1 | 2 | 1550 | NA | 2376 | NA | 70 | NA |
MAJ-1 | 3 | 1448 | NA | 2485 | NA | 64 | NA |
MAJ-1 | 4 | 1247 | NA | 2414 | NA | 57 | NA |
MAJ-1 | 5 | 1412 | 126 | 1897 | 81 | 81 | 12 |
POC-2 | 0 | 1622 | 509 | 2038 | 608 | 33 | 5 |
POC-2 | 1 | 1488 | NA | 2016 | NA | 36 | NA |
POC-2 | 2 | 2416 | NA | 3270 | NA | 79 | NA |
POC-2 | 3 | 2253 | NA | 3197 | NA | 79 | NA |
POC-2 | 4 | 2372 | NA | 3536 | NA | 87 | NA |
POC-2 | 5 | 2635 | 143 | 3890 | 45 | 87 | 8 |
As above, we need to rename the core
column to location
using the rename()
function.
Then (also as above), we need to gather()
the data to get it into long form. Because we have paired columns, this is handled by a different function (from the mudata package) called parallel_gather()
.
pocmajlong <- parallel_gather(
pocmajwide,
key = "param",
value = c(Ca, Ti, V),
sd = c(Ca_sd, Ti_sd, V_sd)
)
location | depth | param | value | sd |
---|---|---|---|---|
MAJ-1 | 0 | Ca | 1885 | 452 |
MAJ-1 | 1 | Ca | 1418 | NA |
MAJ-1 | 2 | Ca | 1550 | NA |
MAJ-1 | 3 | Ca | 1448 | NA |
MAJ-1 | 4 | Ca | 1247 | NA |
MAJ-1 | 5 | Ca | 1412 | 126 |
The data is now ready to be fed to the mudata()
constructor:
## Guessing x columns: depth
## A mudata object aligned along "depth"
## distinct_datasets(): "default"
## distinct_locations(): "MAJ-1", "POC-2"
## distinct_params(): "Ca", "Ti", "V"
## src_tbls(): "data", "locations" ... and 3 more
##
## tbl_data() %>% head():
## # A tibble: 6 x 6
## dataset location param depth value sd
## <chr> <chr> <chr> <int> <dbl> <dbl>
## 1 default MAJ-1 Ca 0 1885. 452.
## 2 default MAJ-1 Ca 1 1418 NA
## 3 default MAJ-1 Ca 2 1550 NA
## 4 default MAJ-1 Ca 3 1448 NA
## 5 default MAJ-1 Ca 4 1247 NA
## 6 default MAJ-1 Ca 5 1412. 126.
When mudata objects are created using only the data table, the package creates the necessary tables for parameter, location, and dataset metadata (if you have these tables prepared already, you can pass them as the arguments locations
, params
, and datasets
). These tables provide a place to put metadata, but doesn’t create any by default. This data is usually needed later, and including it in the object at the point of creation avoids others or future you from scratching their (your) heads with the question “where did core POC-2 come from anyway…”. To do this, you can update the tables using update_params()
, update_locations()
, and update_datasets()
. The first argument of these functions is a vector of identifiers to update (or all of them if not specified), followed by key/value pairs.
## # A tibble: 3 x 2
## dataset param
## <chr> <chr>
## 1 default Ca
## 2 default Ti
## 3 default V
# parameter table with metadata
md %>%
update_params(method = "Portable XRF Spectrometer (Olympus X-50)") %>%
tbl_params()
## # A tibble: 3 x 3
## dataset param method
## <chr> <chr> <chr>
## 1 default Ca Portable XRF Spectrometer (Olympus X-50)
## 2 default Ti Portable XRF Spectrometer (Olympus X-50)
## 3 default V Portable XRF Spectrometer (Olympus X-50)
## # A tibble: 2 x 2
## dataset location
## <chr> <chr>
## 1 default MAJ-1
## 2 default POC-2
# location table with metadata
md %>%
update_locations(
"MAJ-1",
latitude = -64.298, longitude = 44.819, lake = "Lake Major"
) %>%
update_locations(
"POC-2",
latitude = -65.985, longitude = 44.913, lake = "Pockwock Lake"
) %>%
tbl_locations()
## # A tibble: 2 x 5
## dataset location latitude longitude lake
## <chr> <chr> <dbl> <dbl> <chr>
## 1 default MAJ-1 -64.3 44.8 Lake Major
## 2 default POC-2 -66.0 44.9 Pockwock Lake
The concept of a “dataset” is intended to refer to the source of a dataset, but could be anything that applies to data, params, and locations labelled with that dataset. In this case it would make sense to add that the source data is the mudata2 package. The default name is “default”, which you can change in the mudata()
function by passing dataset_id
or by using rename_datasets()
.
## # A tibble: 1 x 1
## dataset
## <chr>
## 1 default
# datasets table with metadata
md %>%
update_datasets(source = "R package mudata2") %>%
tbl_datasets()
## # A tibble: 1 x 2
## dataset source
## <chr> <chr>
## 1 default R package mudata2
All together, the param/location/dataset documentation looks like this:
md_doc <- md %>%
update_params(method = "Portable XRF Spectrometer (Olympus X-50)") %>%
update_locations(
"MAJ-1",
latitude = -63.486, longitude = 44.732, lake = "Lake Major"
) %>%
update_locations(
"POC-2",
latitude = -63.839, longitude = 44.794, lake = "Pockwock Lake"
) %>%
update_datasets(source = "R package mudata2")
The mudata()
constructor automatically generates a barebones columns table (tbl_columns()
), but since the creation of the object we have created new columns that need documentation. Thus, before documenting columns using update_columns()
, it is necessary to call update_columns_table()
to synchronize the columns table with the object.
Then, you can use update_columns()
to add information about various columns to the object.
## # A tibble: 16 x 4
## dataset table column type
## <chr> <chr> <chr> <chr>
## 1 default data dataset character
## 2 default data location character
## 3 default data param character
## 4 default data depth integer
## 5 default data value double
## 6 default data sd double
## 7 default locations dataset character
## 8 default locations location character
## 9 default locations latitude double
## 10 default locations longitude double
## 11 default locations lake character
## 12 default params dataset character
## 13 default params param character
## 14 default params method character
## 15 default datasets dataset character
## 16 default datasets source character
# columns with metadata
md_doc %>%
update_columns("depth", description = "Depth in sediment core (cm)") %>%
update_columns("sd", description = "Standard deviation uncertainty of n=3 values") %>%
tbl_columns() %>%
select(dataset, table, column, description, type)
## # A tibble: 16 x 5
## dataset table column description type
## <chr> <chr> <chr> <chr> <chr>
## 1 default data dataset <NA> charact…
## 2 default data location <NA> charact…
## 3 default data param <NA> charact…
## 4 default data depth Depth in sediment core (cm) integer
## 5 default data value <NA> double
## 6 default data sd Standard deviation uncertainty of n=3 v… double
## 7 default locations dataset <NA> charact…
## 8 default locations location <NA> charact…
## 9 default locations latitude <NA> double
## 10 default locations longitude <NA> double
## 11 default locations lake <NA> charact…
## 12 default params dataset <NA> charact…
## 13 default params param <NA> charact…
## 14 default params method <NA> charact…
## 15 default datasets dataset <NA> charact…
## 16 default datasets source <NA> charact…
You’ll notice there’s a type
column that is also automatically generated, which I suggest that you don’t mess with (it will get overwritten by default before you write the object to disk). If something is the wrong type, you should use the mudate_*()
family of functions to fix the column type, then run update_columns_table()
again. From the top, the documentation looks like this:
md_doc <- md %>%
update_params(method = "Portable XRF Spectrometer (Olympus X-50)") %>%
update_locations(
"MAJ-1",
latitude = -63.486, longitude = 44.732, lake = "Lake Major"
) %>%
update_locations(
"POC-2",
latitude = -63.839, longitude = 44.794, lake = "Pockwock Lake"
) %>%
update_datasets(source = "R package mudata2") %>%
update_columns_table() %>%
update_columns("depth", description = "Depth in sediment core (cm)") %>%
update_columns("sd", description = "Standard deviation uncertainty of n=3 values")
There are three possible formats to which mudata objects can be read: A directory of CSV files (one per table), a ZIP archive of the directory format, and a JSON encoding of the tables. You can write all of them using write_mudata()
with a filename
of the appropriate extension:
# write to directory
write_mudata(poc_maj, "poc_maj.mudata")
# write to ZIP
write_mudata(poc_maj, "poc_maj.mudata.zip")
# write to JSON
write_mudata(poc_maj, "poc_maj.mudata.json")
Then, you can read the file/directory using read_mudata()
:
# read from directory
read_mudata("poc_maj.mudata")
# read from ZIP
read_mudata("poc_maj.mudata.zip")
# read from JSON
read_mudata("poc_maj.mudata.json")
The convention of using ".mudata.*" isn’t necessary, but seems like a good idea to point potential data users in the direction of this package.
That is most of what there is to creating mudata objects. For more reading, I suggest looking at the documentation for mudata()
, update_locations()
, mudata_prepare_column()
, and read_mudata()
.