Tutorial 2 Working with Tables using the Tidyverse

In this tutorial we will introduce the tibble (also called a data frame), or the object type that R uses to store tables. Most of the data you will work with in R can be represented by a table (think an excel sheet), and one of the main advantages of using R is that the data frame is a powerful and intuitive interface for tabular data. In this tutorial we will use the tidyverse to manipulate and summarise tabular data. The tutorial is a companion to the Data transformation chapter in R for Data Science.

2.1 Prerequisites

The prerequisite for this tutorial is the tidyverse package. If this package isn’t installed, you’ll have to install it using install.packages().

install.packages("tidyverse")

Load the packages when you’re done! If there are errors, you may have not installed the above packages correctly!

library(tidyverse)

Finally, you will need to load the example data. For now, copy and paste the following code to load the Halifax geochemistry dataset (we will learn how to read various types of files into R in the preparing and loading data tutorial).

halifax_geochem <- read_csv(
  "http://paleolimbot.github.io/r4paleolim/data/halifax_geochem.csv",
  col_types = cols(.default = col_guess())
)

It’s worth mentioning a little bit about what this data frame contains, since we’ll be working with it for the rest of this tutorial. The data contains several bulk geochemical parameters from a recent study of Halifax drinking water reservoirs (Dunnington et al. 2018), including Pockwock Lake, Lake Major, Bennery Lake, Lake Fletcher, Lake Lemont, First Chain Lake, First Lake, and Second Lake. (Later, we will take a look at the core locations as well as the geochemical data).

2.2 Viewing a Data Frame

The variable we have just created (halifax_geochem) is a tibble, which is a table of values much like you would find in a spreadsheet (you will notice that we loaded it directly from an Excel spreadhseet). Each column in the tibble represents a variable (in this case, the core identifier, depth of the sample, age represented by that sample, and several geochemical parameters), and each row in the table represents an observation (in this case, a sample from a sediment core).

In RStudio’s “Environment” tab (usually at the top right of the screen), you should see a variable called halifax_geochem in the list. You can inspect it by clicking on the variable name, after which a tab will appear displaying the contents of the variable you just loaded. Clicking the little arrow to the left of the name will display the structure of the data frame, including the column names and some sample values. You can also do both of these things using the R commands View() and glimpse(), respectively. Also useful is the head() function, which will display the first few rows of a data frame.

View(halifax_geochem) # will display a graphic table browser

glimpse(halifax_geochem) # will display a text summary of the object

## Observations: 326
## Variables: 9
## $ core_id       <chr> "BEN15-2", "BEN15-2", "BEN15-2", "BEN15-2", "BEN...
## $ depth_cm      <dbl> 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5...
## $ age_ad        <dbl> 2015.903, 2015.188, 2014.474, 2012.950, 2011.425...
## $ C_percent     <dbl> 14.75718, 14.65701, 14.94983, 14.54558, 14.40408...
## $ `C/N`         <dbl> 12.15765, 12.17829, 11.92338, 11.67900, 11.61200...
## $ d13C_permille <dbl> -30.24752, -30.31042, -30.35799, -30.33835, -30....
## $ d15N_permille <dbl> 2.461962, 2.447662, 2.336219, 2.528572, 2.662515...
## $ K_percent     <dbl> 1.0026000, 1.0857000, 0.9782000, 0.9423000, 1.07...
## $ Ti_percent    <dbl> 0.1693000, 0.1823000, 0.1678000, 0.1664000, 0.18...

head(halifax_geochem) # will display the first few rows of the data

## # A tibble: 6 x 9
##   core_id depth_cm age_ad C_percent `C/N` d13C_permille d15N_permille
##   <chr>      <dbl>  <dbl>     <dbl> <dbl>         <dbl>         <dbl>
## 1 BEN15-2      0    2016.      14.8  12.2         -30.2          2.46
## 2 BEN15-2      0.5  2015.      14.7  12.2         -30.3          2.45
## 3 BEN15-2      1    2014.      14.9  11.9         -30.4          2.34
## 4 BEN15-2      1.5  2013.      14.5  11.7         -30.3          2.53
## 5 BEN15-2      2    2011.      14.4  11.6         -30.4          2.66
## 6 BEN15-2      2.5  2010.      14.4  11.9         -30.3          2.48
## # ... with 2 more variables: K_percent <dbl>, Ti_percent <dbl>

2.3 Selecting Columns

One way to subset halifax_geochem is to subset by column, for which we will use the select() function. For example, we may only be interested in the stable isotope information, represented by the columns d13C_permille and d15N_permille.

stable_isotope_data <- select(
  halifax_geochem, 
  core_id, depth_cm, age_ad, 
  d13C_permille, d15N_permille
)

The first argument to the select() function is the original data frame (in this case, halifax_geochem), and the remaining arguments are the names of the columns to be selected. To select the core_id, age_ad, Ti, and K columns, you would use the following R command:

geochem_data <- select(halifax_geochem, core_id, depth_cm, age_ad, Ti_percent, K_percent)

Some column names in halifax_geochem contain characters that could be interpreted as an operation (e.g., C/N, which is the name of the column and not C divided by N). To select these columns, you will need to surround the column name in backticks:

select(halifax_geochem, core_id, depth_cm, age_ad, `C/N`)

2.3.1 Excercises

Use View(), glimpse(), and head() to preview the two data frames we just created. Do they have the columns you would expect?
Use select() to select core_id, depth_cm, C/N, d13C, and d15N, and assign it to the variable cn_data.

2.4 Filtering Rows

Another way to subset halifax_geochem is by filtering rows using column values, similar to the filter feature in Microsoft Excel. This is done using the filter() function. For example, we may only be interested in the core from Pockwock Lake.

pockwock_data <- filter(halifax_geochem, core_id == "POC15-2")

Just like select(), the first argument to filter() is the original data frame, and the subsequent arguments are the conditions that each row must satisfy in order to be included in the output. Column values are referred to by the column name (in the above example, core_id), so to include all rows where the value in the core_id column is POC15-2, we use core_id == "POC15-2". Passing multiple conditions means each row must satisfy all of the conditions, such that to obtain the data from core POC15-2 where the depth in the core was 0 cm, we can use the following call to filter():

pockwock_surface_data <- filter(halifax_geochem, core_id == "POC15-2", depth_cm == 0)

It is very important that there are two equals signs within filter()! The == operator tests for equality (e.g. (2 + 2) == 4), whereas the = operator assigns a value or passes a named argument to a function, which is not what you’re trying to do within filter(). Other common operators that are useful within filter are != (not equal to), > (greater than), < (less than), >= (greater than or equal to), <= (less than or equal to), and %in% (tests if the value is one of several values). Using these, we could find out which observations are representative of the era 1950 to present:

data_recent <- filter(halifax_geochem, age_ad >= 1950)

We could also find observations from multiple cores:

pockwock_major_data <- filter(halifax_geochem, core_id %in% c("POC15-2", "MAJ15-1"))

2.4.1 Exercises

Use View(), glimpse(), and head() to preview the data frames we just created. Do they have the rows you would expect?
Use filter() to find observations from the core FCL16-1 with an age between 1900 and present, and assign it to a name of your choosing.
Are there any observations with a C/N value greater than 20? (hint: you will have to surround C/N in backticks)

2.5 Selecting and Filtering

Often we need to use both select() and filter() to obtain the desired subset of a data frame. To do this, we need to pass the result of select() to filter(), or the result of filter() to select. For example, we could create a data frame of recent (age greater than 1950) stable isotope measurements (you’ll recall that we selected stable isotope columns in the data frame stable_isotope_data):

recent_stable_isotopes <- filter(stable_isotope_data, age_ad >= 1950)
recent_stable_isotopes2 <- select(
  data_recent,
  core_id, depth_cm, age_ad, 
  d13C_permille, d15N_permille
)

2.5.1 Exercises

Use View(), glimpse(), and/or head() to verify that recent_stable_isotopes and recent_stable_isotopes_2 are identical.

2.6 The Pipe (%>%)

There is an easier way! Instead of creating intermediary variables every time we want to subset a data frame using select() and filter(), we can use the pipe operator (%>%) to pass the result of one function call to another. Thus, creating our recent_stable_isotopes data frame from above becomes one line with one variable assignment instead of two.

recent_stable_isotopes_pipe <- halifax_geochem %>% 
  filter(age_ad >= 1950) %>%
  select(core_id, depth_cm, age_ad, d13C_permille, d15N_permille)

What %>% does is pass the left side into the first argument of the function call on the right side. Thus, filter(halifax_geochem, age_ad >= 1950) becomes halifax_geochem %>% filter(age_ad >= 1950). When using the tidyverse family of packages, you should use the pipe as often as possible! It usually makes for more readable, less error-prone code, and reduces the number of temporary variables you create that clutter up your workspace. When using filter() and select() with other tidyverse manipulations like arrange(), group_by(), summarise(), and mutate(), the pipe becomes indispensable.

2.6.1 Exercises

Inspect recent_stable_isotopes_pipe to ensure it is identical to recent_stable_isotopes.
Create a data frame of stable isotope data from surface samples (depth_cm == 0) using halifax_geochem, filter(), select(), and %>% and assign it to a variable of a suitable name.

2.7 Arranging (sorting) A Data Frame

Sometimes it is desirable to view rows in a particular order, which can be used to quickly determine min and max values of various parameters. You can do this in the interactive editor using View(), but sometimes rows need to be in particular order for plotting or other analysis. This is done using the arrange() function. For example, it may make sense to view halifax_geochem in ascending core_id and depth_cm order (most recent first):

halifax_geochem %>%
  arrange(core_id, depth_cm)

## # A tibble: 326 x 9
##    core_id depth_cm age_ad C_percent `C/N` d13C_permille d15N_permille
##    <chr>      <dbl>  <dbl>     <dbl> <dbl>         <dbl>         <dbl>
##  1 BEN15-2      0    2016.      14.8  12.2         -30.2          2.46
##  2 BEN15-2      0.5  2015.      14.7  12.2         -30.3          2.45
##  3 BEN15-2      1    2014.      14.9  11.9         -30.4          2.34
##  4 BEN15-2      1.5  2013.      14.5  11.7         -30.3          2.53
##  5 BEN15-2      2    2011.      14.4  11.6         -30.4          2.66
##  6 BEN15-2      2.5  2010.      14.4  11.9         -30.3          2.48
##  7 BEN15-2      3    2008.      14.4  11.9         -30.3          2.53
##  8 BEN15-2      3.5  2005.      14.3  12.0         -30.2          2.60
##  9 BEN15-2      4    2002.      14.0  12.0         -30.2          2.60
## 10 BEN15-2      4.5  1999.      13.7  12.1         -30.2          2.48
## # ... with 316 more rows, and 2 more variables: K_percent <dbl>,
## #   Ti_percent <dbl>

Or descending depth order (most recent last):

halifax_geochem %>%
  arrange(core_id, desc(depth_cm))

## # A tibble: 326 x 9
##    core_id depth_cm age_ad C_percent `C/N` d13C_permille d15N_permille
##    <chr>      <dbl>  <dbl>     <dbl> <dbl>         <dbl>         <dbl>
##  1 BEN15-2       29  1742.      14.5  13.4         -29.3          3.54
##  2 BEN15-2       28  1751.      14.5  13.5         -29.3          3.60
##  3 BEN15-2       27  1759.      15.1  13.4         -29.4          3.60
##  4 BEN15-2       26  1768.      15.9  13.5         -29.5          3.57
##  5 BEN15-2       25  1776.      16.7  13.4         -29.6          3.42
##  6 BEN15-2       24  1784.      16.8  13.4         -29.5          3.42
##  7 BEN15-2       23  1793.      16.5  13.5         -29.4          3.39
##  8 BEN15-2       22  1801.      17.2  13.4         -29.4          3.41
##  9 BEN15-2       21  1810.      17.3  13.6         -29.4          3.22
## 10 BEN15-2       20  1818.      17.6  13.5         -29.4          3.18
## # ... with 316 more rows, and 2 more variables: K_percent <dbl>,
## #   Ti_percent <dbl>

The arrange() function takes columns as arguments, surrounded by desc() if that column should be sorted in descending order.

2.8 Distinct Values

It is often useful to know which values exist in a data frame. For example, I’ve told you that the core locations are for various lakes in the halifax area, but what are they actually called in the dataset? To do this, we can use the distinct() function.

halifax_geochem %>%
  distinct(core_id)

## # A tibble: 8 x 1
##   core_id
##   <chr>  
## 1 BEN15-2
## 2 FCL16-1
## 3 FLE16-1
## 4 FLK12-1
## 5 LEM16-1
## 6 MAJ15-1
## 7 POC15-2
## 8 SLK13-1

The distinct() function can take any number of column names as arguments, although in this particular dataset there isn’t a good example for this.

2.9 Calculating columns using `mutate()`

To create a brand-new column, we can use the mutate() function. This creates a column in a way that we can use existing column names to calculate a new column. For example, we could convert the age_ad column to years before 1950:

halifax_geochem %>%
  mutate(age_bp = 1950 - age_ad) %>%
  select(core_id, age_ad, age_bp)

## # A tibble: 326 x 3
##    core_id age_ad age_bp
##    <chr>    <dbl>  <dbl>
##  1 BEN15-2  2016.  -65.9
##  2 BEN15-2  2015.  -65.2
##  3 BEN15-2  2014.  -64.5
##  4 BEN15-2  2013.  -62.9
##  5 BEN15-2  2011.  -61.4
##  6 BEN15-2  2010.  -59.6
##  7 BEN15-2  2008.  -57.8
##  8 BEN15-2  2005.  -54.9
##  9 BEN15-2  2002.  -52.1
## 10 BEN15-2  1999.  -49.3
## # ... with 316 more rows

Or, we could convert the K_percent and Ti_percent columns to parts per million:

halifax_geochem %>%
  mutate(
    K_ppm = K_percent * 10000,
    Ti_ppm = Ti_percent * 10000
  ) %>%
  select(core_id, K_percent, K_ppm, Ti_percent, Ti_ppm)

## # A tibble: 326 x 5
##    core_id K_percent  K_ppm Ti_percent Ti_ppm
##    <chr>       <dbl>  <dbl>      <dbl>  <dbl>
##  1 BEN15-2     1.00  10026       0.169  1693 
##  2 BEN15-2     1.09  10857.      0.182  1823 
##  3 BEN15-2     0.978  9782       0.168  1678 
##  4 BEN15-2     0.942  9423       0.166  1664 
##  5 BEN15-2     1.08  10784.      0.183  1832.
##  6 BEN15-2     1.09  10863       0.183  1830 
##  7 BEN15-2     1.04  10374.      0.176  1762 
##  8 BEN15-2     0.97   9700       0.167  1670 
##  9 BEN15-2     1.12  11175       0.179  1791 
## 10 BEN15-2     1.01  10064.      0.17   1700.
## # ... with 316 more rows

2.10 Summarising A Data Frame

So far we have looked at subsets of halifax_geochem, but what if we want per-core averages instead of raw data values? Using the tidyverse, we can group_by() the core_id column, and summarise():

halifax_geochem %>%
  group_by(core_id) %>%
  summarise(mean_CN = mean(`C/N`))

## # A tibble: 8 x 2
##   core_id mean_CN
##   <chr>     <dbl>
## 1 BEN15-2    12.8
## 2 FCL16-1    14.2
## 3 FLE16-1    12.4
## 4 FLK12-1    12.8
## 5 LEM16-1    12.6
## 6 MAJ15-1    NA  
## 7 POC15-2    NA  
## 8 SLK13-1    NA

Here group_by() gets a list of columns, for which each unique combination of values will get one row in the output. summarise() gets a list of expressions that are evaluated for every unique combination of values defined by group_by() (e.g., mean_CN is the mean() of the C/N column for each core). Often, we want to include a number of summary columns in the output, which we can do by pasing more expressions to summarise():

halifax_geochem %>%
  group_by(core_id) %>%
  summarise(
    mean_CN = mean(`C/N`),
    min_CN = min(`C/N`),
    max_CN = max(`C/N`),
    sd_CN = sd(`C/N`)
  )

## # A tibble: 8 x 5
##   core_id mean_CN min_CN max_CN  sd_CN
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>
## 1 BEN15-2    12.8   11.6   13.6  0.648
## 2 FCL16-1    14.2   12.1   16.5  1.05 
## 3 FLE16-1    12.4   10.5   13.3  0.830
## 4 FLK12-1    12.8   10.6   14.9  1.02 
## 5 LEM16-1    12.6   11.8   13.1  0.307
## 6 MAJ15-1    NA     NA     NA   NA    
## 7 POC15-2    NA     NA     NA   NA    
## 8 SLK13-1    NA     NA     NA   NA

You will notice that in for several cores the summary values are NA, or missing. This is because R propogates missing values unless you explicitly tell it not to. To fix this, you could replace mean(`C/N`) with mean(`C/N`, na.rm = TRUE). Other useful functions to use inside summarise() include mean(), median(), sd(), sum(), min(), and max(). These all take a vector of values and produce a single aggregate value suitable for use in summarise(). One special function, n(), you can use (with no arguments) inside summarise() to tell you how many observations were aggregated to produce the values in that row.

halifax_geochem %>%
  group_by(core_id) %>%
  summarise(
    mean_CN = mean(`C/N`, na.rm = TRUE),
    min_CN = min(`C/N`, na.rm = TRUE),
    max_CN = max(`C/N`, na.rm = TRUE),
    sd_CN = sd(`C/N`, na.rm = TRUE),
    n = n()
  )

## # A tibble: 8 x 6
##   core_id mean_CN min_CN max_CN sd_CN     n
##   <chr>     <dbl>  <dbl>  <dbl> <dbl> <int>
## 1 BEN15-2    12.8   11.6   13.6 0.648    35
## 2 FCL16-1    14.2   12.1   16.5 1.05     49
## 3 FLE16-1    12.4   10.5   13.3 0.830    37
## 4 FLK12-1    12.8   10.6   14.9 1.02     33
## 5 LEM16-1    12.6   11.8   13.1 0.307    35
## 6 MAJ15-1    15.7   14.3   18.4 1.09     51
## 7 POC15-2    15.2   13.6   17.4 1.26     52
## 8 SLK13-1    11.4   10.3   11.9 0.443    34

It’s always a good idea to include n() inside summarise(), if nothing else as a check to make sure you’ve used group_by() with the correct columns.

2.10.1 Excercises

Assign the data frame we just created to a variable, and inspect it using View() and str(). Which cores have the most terrestrial C/N signature? Which cores have the most aquatic signature?
Create a similar data frame to the one we just created but using C_percent. Which cores had the highest peak organic value.
Which cores had the oldest estimated basal date?

2.11 Extracting Columns

When we use select() or distinct(), we get back a data frame, however occasionally we need one or a few of the vectors that make up the data frame (recall from the last tutorial that data frames are a collection of column vectors). If we needed just the temperature values, we can use the $ operator or the pull() function to extract a column vector.

halifax_geochem$C_percent

##   [1] 14.757176 14.657012 14.949832 14.545579 14.404084 14.403361 14.417744
##   [8] 14.279836 14.013717 13.703397 13.403708 12.918529 12.905447 13.359170
##  [15] 13.852618 14.386124 14.533912 14.593866 15.667289 14.765232 15.028548
##  [22] 15.290797 16.107091 16.828205 17.336513 17.591619 17.311306 17.183589
##  [29] 16.492232 16.808112 16.701580 15.903060 15.069473 14.535988 14.529308
##  [36] 16.865924 15.026170 15.495396 14.920478 14.986076 15.500813 15.359202
##  [43] 15.174549 15.809170 15.493308 13.107648 14.348712 13.259158 13.556510
##  [50] 13.119736 12.610362 12.626289 13.510481 13.516503 13.673926 13.739241
##  [57] 13.319716 12.752857 12.449856 13.769313 13.594019 14.074802 15.168339
##  [64] 16.618788 16.359283 16.105118 16.514953 18.517095 17.978341 17.371192
##  [71] 16.311194 17.566880 17.477789 17.054982 17.685215 17.953537 19.507351
##  [78] 20.374763 20.849903 18.816935 19.815698 20.777122 20.852904 21.074918
##  [85]  6.263526  6.173985  6.054831  5.702408  5.113311  4.834282  4.572133
##  [92]  4.370639  4.433741  4.321790  4.511906  4.997175  5.026973  5.507729
##  [99]  5.613024  6.005021  6.368445  6.736808  7.413337  7.826077  8.387502
## [106]  8.641869  9.651663 10.471288 11.024577 11.867441  9.647337 11.459685
## [113] 11.809470 12.079710 11.911566 12.071773 12.279458 12.478523 12.579303
## [120] 12.571033 12.838081  4.533660  4.587350  4.313170  4.358330  4.424750
## [127]  4.803090  7.139540  9.071560  9.337890  9.629480  9.318650 10.515000
## [134]  9.964780 11.282600 10.048000 10.303900 11.186500 10.948100  9.480070
## [141]  8.809490  8.579200  8.531470  9.635170  9.138650  9.447680  8.802510
## [148]  9.611150  9.787400  9.432920  8.768380  8.822690  9.401530  9.579060
## [155] 14.321988 13.170802 12.511922 10.561964  9.652961  8.523713  7.645318
## [162]  7.462763  7.544005  7.347211  7.537033  7.310795  7.341749  7.476305
## [169]  7.421047  7.381802  7.398633  7.934676  8.815771  9.335244 10.760830
## [176] 11.400979 11.250975 12.138173 12.768363 12.735964 12.427341 12.086503
## [183] 13.644940 14.713456 15.641560 16.972044 17.678566 17.798798 18.278828
## [190]        NA 16.269317 15.259672 14.197484 14.270182 14.714715 13.955317
## [197] 13.719997 13.917566 16.450386 20.130244 21.488213 21.169039 20.821857
## [204] 18.802373 16.951176 15.031508 13.885433 13.419708 13.391310 13.438673
## [211] 12.997013 12.731380 13.096725 12.781592 12.687495 11.733670 11.623405
## [218] 12.029799 12.674604 13.561877 15.560322 16.826027 17.238919 16.435055
## [225] 16.156588 16.386247 15.994739 15.808491 15.335339 15.501906 15.750462
## [232] 15.813097 15.818069 15.763004 15.763415 15.848448 15.543236 15.343314
## [239] 15.133554 14.878146        NA 20.122789        NA 21.543915 20.494810
## [246] 20.208873 20.149571 19.043943 17.930155 16.835701 17.243781 17.922027
## [253] 17.457315 15.637815 15.053572 15.151438 14.823894 14.666839 14.135732
## [260] 13.915358 14.326561 14.778377 15.763145 16.940269 17.786186 18.217911
## [267] 17.992235 18.166406 19.032422 19.143826 18.636600 18.702145 18.130120
## [274] 18.365359 18.074177 18.062602 18.280654 18.331464 17.700223 17.002926
## [281] 17.245358 17.465165 17.076685 16.854514 16.893483 16.955922 16.795793
## [288] 16.874876 16.822462 16.450231 16.734635 16.486585  5.704334  5.306090
## [295]        NA  5.350935  5.207219  5.303574  5.609634  6.246534  6.315260
## [302]  6.913474  7.164375  7.701242  7.572774  6.123106  6.953925  6.500808
## [309]  9.940868  7.538145 11.425805  9.094246 12.094420 12.084451 12.244675
## [316] 12.116512 12.069895 12.102366 11.977507 11.889317 11.935524 10.761811
## [323] 11.463138  7.890573 11.445759 11.556214

halifax_geochem %>% pull(C_percent)

##   [1] 14.757176 14.657012 14.949832 14.545579 14.404084 14.403361 14.417744
##   [8] 14.279836 14.013717 13.703397 13.403708 12.918529 12.905447 13.359170
##  [15] 13.852618 14.386124 14.533912 14.593866 15.667289 14.765232 15.028548
##  [22] 15.290797 16.107091 16.828205 17.336513 17.591619 17.311306 17.183589
##  [29] 16.492232 16.808112 16.701580 15.903060 15.069473 14.535988 14.529308
##  [36] 16.865924 15.026170 15.495396 14.920478 14.986076 15.500813 15.359202
##  [43] 15.174549 15.809170 15.493308 13.107648 14.348712 13.259158 13.556510
##  [50] 13.119736 12.610362 12.626289 13.510481 13.516503 13.673926 13.739241
##  [57] 13.319716 12.752857 12.449856 13.769313 13.594019 14.074802 15.168339
##  [64] 16.618788 16.359283 16.105118 16.514953 18.517095 17.978341 17.371192
##  [71] 16.311194 17.566880 17.477789 17.054982 17.685215 17.953537 19.507351
##  [78] 20.374763 20.849903 18.816935 19.815698 20.777122 20.852904 21.074918
##  [85]  6.263526  6.173985  6.054831  5.702408  5.113311  4.834282  4.572133
##  [92]  4.370639  4.433741  4.321790  4.511906  4.997175  5.026973  5.507729
##  [99]  5.613024  6.005021  6.368445  6.736808  7.413337  7.826077  8.387502
## [106]  8.641869  9.651663 10.471288 11.024577 11.867441  9.647337 11.459685
## [113] 11.809470 12.079710 11.911566 12.071773 12.279458 12.478523 12.579303
## [120] 12.571033 12.838081  4.533660  4.587350  4.313170  4.358330  4.424750
## [127]  4.803090  7.139540  9.071560  9.337890  9.629480  9.318650 10.515000
## [134]  9.964780 11.282600 10.048000 10.303900 11.186500 10.948100  9.480070
## [141]  8.809490  8.579200  8.531470  9.635170  9.138650  9.447680  8.802510
## [148]  9.611150  9.787400  9.432920  8.768380  8.822690  9.401530  9.579060
## [155] 14.321988 13.170802 12.511922 10.561964  9.652961  8.523713  7.645318
## [162]  7.462763  7.544005  7.347211  7.537033  7.310795  7.341749  7.476305
## [169]  7.421047  7.381802  7.398633  7.934676  8.815771  9.335244 10.760830
## [176] 11.400979 11.250975 12.138173 12.768363 12.735964 12.427341 12.086503
## [183] 13.644940 14.713456 15.641560 16.972044 17.678566 17.798798 18.278828
## [190]        NA 16.269317 15.259672 14.197484 14.270182 14.714715 13.955317
## [197] 13.719997 13.917566 16.450386 20.130244 21.488213 21.169039 20.821857
## [204] 18.802373 16.951176 15.031508 13.885433 13.419708 13.391310 13.438673
## [211] 12.997013 12.731380 13.096725 12.781592 12.687495 11.733670 11.623405
## [218] 12.029799 12.674604 13.561877 15.560322 16.826027 17.238919 16.435055
## [225] 16.156588 16.386247 15.994739 15.808491 15.335339 15.501906 15.750462
## [232] 15.813097 15.818069 15.763004 15.763415 15.848448 15.543236 15.343314
## [239] 15.133554 14.878146        NA 20.122789        NA 21.543915 20.494810
## [246] 20.208873 20.149571 19.043943 17.930155 16.835701 17.243781 17.922027
## [253] 17.457315 15.637815 15.053572 15.151438 14.823894 14.666839 14.135732
## [260] 13.915358 14.326561 14.778377 15.763145 16.940269 17.786186 18.217911
## [267] 17.992235 18.166406 19.032422 19.143826 18.636600 18.702145 18.130120
## [274] 18.365359 18.074177 18.062602 18.280654 18.331464 17.700223 17.002926
## [281] 17.245358 17.465165 17.076685 16.854514 16.893483 16.955922 16.795793
## [288] 16.874876 16.822462 16.450231 16.734635 16.486585  5.704334  5.306090
## [295]        NA  5.350935  5.207219  5.303574  5.609634  6.246534  6.315260
## [302]  6.913474  7.164375  7.701242  7.572774  6.123106  6.953925  6.500808
## [309]  9.940868  7.538145 11.425805  9.094246 12.094420 12.084451 12.244675
## [316] 12.116512 12.069895 12.102366 11.977507 11.889317 11.935524 10.761811
## [323] 11.463138  7.890573 11.445759 11.556214

The problem with doing this is that our mean temperature values no longer have any context! They come from multiple cores, but this is not reflected without the other columns. Nevertheless, many R functions outside of the tidyverse require input as vectors (including many you’ve used so far, including mean(), max(), min(), etc.), and you will often see the $ used in code written in other places to refer to columns. Functions in the tidyverse allow you to refer to columns by name (without the $) when used within specific functions (summarise() is a good example), so you should do this whenever you can!

2.12 Base R Subsetting vs. select() and filter()

In the wild, there are many ways to select columns and filter rows. I highly reccomend using filter() and select() to do this when writing new code, but you may see R code that subsets a data frame using square brackets in the form my_data_frame[c("column_name_1", "column_name_2")] or my_data_frame[my_data_frame$column_name_1 > some_number, c("column_name_1", "column_name_2")]. The latter is equivalent to my_data_frame %>% select(column_name_1, column_name_2) %>% filter(column_name_1 > some_number). The tidyverse method of subsetting I find to be much more clear and far less error-prone, but it’s worth knowing the other form so you can read R code written by others!

2.13 Summary

In this tutorial we introduced the use of select(), filter(), arrange(), distinct(), and the pipe (%>%). We also used group_by() and summarise() to provide summary statistics from a data frame. These functions are the building blocks of other powerful tools in the tidyverse. For more information, see the Data transformation chapter in R for Data Science. Another good resource is the tidyverse, visualization, and manipulation basics tutorial from Garrett Grolemund.

References

Dunnington, Dewey W., I. S. Spooner, Wendy H. Krkošek, Graham A. Gagnon, R. Jack Cornett, Chris E. White, Benjamin Misiuk, and Drake Tymstra. 2018. “Anthropogenic Activity in the Halifax Region, Nova Scotia, Canada, as Recorded by Bulk Geochemistry of Lake Sediments.” https://doi.org/10.1080/10402381.2018.1461715.