Tutorial 5 Creating Stratigraphic Diagrams

Creating beautiful stratigraphic diagrams is difficult in any program, and while it is not necessarily easier to create beautiful digrams in R the first time, R becomes useful when the data used to create the graph get updated, requiring you to re-make the plot. In the course of a thesis, one can re-make a figure many, many times with various modifications. Learning to create publication-ready stratigraphic plots in R is a front-end investment of time that becomes useful after the first few rounds of revisions. This tutorial is designed to give you the tools to make such diagrams with your own data.

5.1 Prerequisites

The prerequisites for this tutorial are the tidyverse package, the gridExtra package, and the analogue package. If you haven’t installed the analogue or gridExtra packages yet, you’ll have to install them using install.packages().

Load the tidyverse when you’re done! We will load the other packages when we use functions that require them below.

Finally, you will need to obtain the example data. In this tutorial, we will use the Lake Arnold diatom counts tidy CSV version of the data (Whitehead et al. 1989), and the Halifax geochemistry data. If you have these files downloaded you can load them yourself (see Tutorial 4), or you can copy/paste the following code to load the two datasets.

5.2 Workflow

It’s worth mentioning a bit about how one might go about integrating R into one’s figure-creating workflow. I suggest creating a file (something like create_figures.R) that loads the data, creates the figures, and saves the figures, within an RStudio project that also contains the data files. Keeping the data needed to create the figures and the script used to create the figures close means that you can reproduce the figure if you need to change the script (or change the data!), and using an RStudio project means you can move the folder around your computer (or to somebody else’s computer) and your scripts won’t change. I usually have an RStudio project with a subdirectory for data-raw (the data from the instrument/tech/emailed from other lab), data (the user-modified version of the files in data-raw that are R-friendly), and figures (the R-generated figures).

5.3 General stratigraphic diagrams

Stratigraphic diagrams are at heart, a set of plots that share a Y axis. The Y-axis represents time, which can be expressed as depth or as some unit of actual time (e.g., AD 2018, or 1200 years BP), and the X-axis is the value of each parameter. We will use the ggplot2 package to create these graphics for non-species data, and the analogue package to create the graphics for species composition data.

5.3.1 Data preparation

Getting your data in a form that is usable by the plotting function is most of the battle to creating a figure. In the case of ggplot(), we need the data to be in a “parameter-long” form, with one row for each measurement (each point on the graph). In Tutorial 4 we learned how to use gather() to turn a “parameter-wide” data frame into a “parameter-long” data frame. We are mostly going to focus on plotting the Pockwock Lake core, so we will also filter the data to include only one core for now.

## # A tibble: 312 x 5
##    core_id depth_cm age_ad param     value
##    <chr>      <dbl>  <dbl> <chr>     <dbl>
##  1 POC15-2      0    2016. C_percent  NA  
##  2 POC15-2      0.5  2015. C_percent  20.1
##  3 POC15-2      1    2014. C_percent  NA  
##  4 POC15-2      1.5  2013. C_percent  21.5
##  5 POC15-2      2    2011. C_percent  20.5
##  6 POC15-2      2.5  2009. C_percent  20.2
##  7 POC15-2      3    2006. C_percent  20.1
##  8 POC15-2      3.5  2003. C_percent  19.0
##  9 POC15-2      4    2000. C_percent  17.9
## 10 POC15-2      4.5  1997. C_percent  16.8
## # ... with 302 more rows

5.3.2 Plotting a single core

Plotting a single core with ggplot() involves several steps:

  • The initial ggplot() call, where we set which columns will get mapped to the x and y values for each layer.
  • Two layers: geom_path() and geom_point() (we don’t use geom_line() because it was written by non-paleolimnologists and connects the line in odd ways)
  • Specify how data are divided between facets using facet_wrap(). We need the scales to be free on the X-axis because each parameter has a different range of values, however the Y-axis should be the same for all facets.
  • Reverse the Y-axis, so that a depth of 0 is at the top.
  • Remove the X label, and set the Y label to “Depth (cm)”.

5.3.3 Changing the colours

The default colour scheme in ggplot2 is great for the web, but not so much for publication. I usually use theme_bw(), which is the black-and-white version of the default theme. There are several others, including theme_minimal(), theme_classic(), theme_linedraw(), and many more. You can add them to the end of your plot like so:

Alternatively, you can use the following line to set your theme for the rest of your R session! I usually set the theme to theme_bw() right under my call to library(tidyverse) in a script.

5.3.4 Facet configuration

The facet_wrap() function has nrow and ncol arguments that allow customization of how facets are laid out. Sometimes changing the figure size is also helpful, which can be done when the figure is exported using ggsave(). In general, you should only set one of nrow or ncol.

5.3.5 Reordering facets

To control the orderering of the facets, we need to turn the param column into a factor(), which is kind of like a “multiple choice” vector that stores the order of its choices. We first need to create a character vector that defines the order, use mutate() to create a new column containing the facet label as a factor, then make facet_wrap() use that column to create the facets instead of param.

5.3.6 Relabeling facets

When using facet_wrap(), ggplot2 uses the value of each unique value in a column to create the label. This means that we need to rename each item to a suitable label prior to feeding it in to ggplot(). I think the easiest way to do this is fct_recode(), because it also keeps the order of the input (if you’ve already set the input to be a factor()), and you only have to rename the values that need renaming. (Note: I’ve use the unicode superscript 1-9 characters to get the superscript effect, because the other way is much harder. You can copy and paste these, or google superscript 1 unicode).

If the unicode superscript effect doesn’t work on your computer (it doesn’t work for \(\delta^{15}\text{N}\)) on my computer), you will have to resort to plotmath. This is a custom way R invented to render complex formatting, and you can use it with ggplot2 in various ways. The best way to implement this to start is to copy and paste, and check out ?plotmath if you’re curious what the syntax is. The key is to relabel the facets to very specific looking text (if in doubt, surround everything in single quotes like this: 'C/N'), and then use labeller = label_parsed within facet_wrap() (also works in facet_grid()).

5.3.7 Adding annotations

In general, adding text to the plot is difficult and not advisable. Adding horizontal lines and rectangles to highlight a particular region of the plot is much easier, and I have rarely had to resort to adding actual text to a plot using ggplot2. If this is critical to your application, you can use ggsave() to export a .svg file or .pdf file, and import it into your favourite vector drawing program (I happen to like Inkscape).

Horizontal lines can be added using geom_hline(). When using the yintercept argument, a horizontal line will be drawn on all panels.

To only draw on some panels, you will need to create a tibble with a column that specifies the facet. A small example:

Rectangles can also be useful to draw in the background to highlight a range of depths. Rectangles also require tibble that specifies their appearance, which can contain the facet variable if the rectangle should only be drawn on some facets. Using xmin = -Inf and xmax = Inf ensures that the rectangles touch the edge of each facet.

5.3.8 A second axis for ages

Depending on the application, ages can be used directly on the Y-axis, or you can include depth on the Y-axis and use a second axis on the right for ages. This can be done using scale_y_reverse() with the sec.axis argument. This axis is created using a one-to-one transformation, which requires the age-depth information as a tibble. Here I use a transformation function based on approx(), which, if you make sure your age-depth data frame is the same as the pockwock one, you should be able to safely copy and paste to add to your plot.

5.3.9 Multiple cores

Plotting multiple cores can happen in a few ways. First, you can map the core_id column to the colour aesthetic (you could use the linetype aesthetic if you need black and white output). You can specify the legend label using the labs() function.

Second, you can use facet_grid() instead of facet_wrap(). Here you can use age or depth on the y axis, although you can’t use the second axis trick from above to have dates as a second axis.

For both of these, you can use the same tricks we used to change the order of the facets to change the order and labelling of the end result.

5.3.10 Overlapping x-axis labels

The default placement of labels in ggplot2 can result in labels that overlap and look ugly in the final product. There are a few solutions, including (1) Rotate the labels by 90 degrees (this works in most circumstances):

You can also (2) make the font size for the labels smaller (the value is in points):

Finally, you can (3) suggest a number of breaks to use in scale_x_continuous():

5.3.11 Saving a plot

You can use the same strategy we used in Tutorial 3 to save the plots we created in this section. The easiest way to do this is to save the plot to an object, then use ggsave() to write the object to disk.

## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 10 rows containing missing values (geom_point).

You can also save files as .pdf and .svg, but .png is probably the most reliable for importing into Word documents. If you need to edit the file in a vector drawing program afterward (or you need to give the final result to a journal), using .pdf or .svg is probably best. Note that exporting to .svg requires the svglite package.

5.4 Exercises

  • Make a simple stratigraphic diagram of the “BEN15-2” core. Use age_ad on the Y-axis. Your plot should look like this:

  • Make a stratigraphic diagram of the “FCL16-1” and “LEM16-1” cores. For bonus points, use proper sub/superscripts for the delta parameters and keep C/N, C, and stable isotopes together, include units in parentheses for all parameters, and use age_ad on the Y-axis. In the end, your plot (with bonus marks) should look like this:

5.5 Species composition diagrams

Species composition diagrams are like other types of strat diagrams, except they have long variable names and generally require scaling of the space such that 10% on one facet is the same distance as 10% on another facet. There are two approaches for this: using ggplot2 is more verbose but more flexible (you can use most of the tips/tricks above to add secondary axes, horizontal lines, etc.), and using the Stratiplot() function in the analogue package is less verbose but requires that you have your data in a very specific form. We will go over both approaches here, but if you can make ggplot2 work for you, the plot looks much nicer.

5.5.1 Using ggplot2

5.5.1.1 Data setup

Similar to non-speices data, ggplot2 needs the data to be in “parameter-long” form. Usually the value plotted on the X-axis of species composition plots is a relative abundance (in percent), which we can calculate using a grouped mutate().

## # A tibble: 2,128 x 5
##    depth_cm age_bp taxon                  valve_count relative_abundance_…
##       <int>  <int> <chr>                        <int>                <dbl>
##  1        0    -39 Achnanthes marginulata           8                 2.07
##  2        5     12 Achnanthes marginulata          11                 2.25
##  3       45    445 Achnanthes marginulata           7                 1.76
##  4       75    795 Achnanthes marginulata          12                 3.26
##  5      115   1373 Achnanthes marginulata           6                 1.24
##  6      175   2314 Achnanthes marginulata          13                 3.39
##  7      235   3469 Achnanthes marginulata          21                 6.07
##  8      275   4243 Achnanthes marginulata          25                 6.56
##  9      325   5218 Achnanthes marginulata          20                 5.90
## 10      375   6196 Achnanthes marginulata           4                 1.02
## # ... with 2,118 more rows

Because there are many taxa, we will restrict our plot to those with a maximum relative abundance of 10% or more. We need this as a vector so that we can filter() the arnold_counts data above and change the taxon column to a factor() so that it is plotted in a specific order. Generally this order is some preference gradient, but that isn’t a part of this particular dataset. Instead, we will order from least abundant to most abundant.

##  [1] "Surirella delicatissima"             
##  [2] "Anomoeoneis serians var. brachysira" 
##  [3] "Navicula subtilissima"               
##  [4] "Fragilaria pinnata"                  
##  [5] "Fragilaria brevistriata"             
##  [6] "Pinnularia microstauron"             
##  [7] "Pinnularia biceps"                   
##  [8] "Melosira distans"                    
##  [9] "Cymbella hebridica"                  
## [10] "Fragilaria virescens var. birostrata"
## [11] "Fragilaria construens"
## # A tibble: 176 x 5
##    depth_cm age_bp taxon                   valve_count relative_abundance…
##       <int>  <int> <fct>                         <int>               <dbl>
##  1        0    -39 Surirella delicatissima          49              12.7  
##  2        5     12 Surirella delicatissima          37               7.57 
##  3       45    445 Surirella delicatissima          16               4.02 
##  4       75    795 Surirella delicatissima          23               6.25 
##  5      115   1373 Surirella delicatissima          22               4.55 
##  6      175   2314 Surirella delicatissima          28               7.29 
##  7      235   3469 Surirella delicatissima          11               3.18 
##  8      275   4243 Surirella delicatissima          24               6.30 
##  9      325   5218 Surirella delicatissima           5               1.47 
## 10      375   6196 Surirella delicatissima           2               0.508
## # ... with 166 more rows

5.5.1.2 Plotting one core

Using ggplot2 to plot species composition strat diagrams is similar to plotting geochemical data, with three important differences: the geometry is usually a horizontal bar plot, the facet labels are so long that they need to be rotated, and the horizontal distance in each facet must be equal between facets (so that 10% relative abundance is the same on the leftmost facet as the rightmost facet). We solve the first by replacing geom_path() and geom_point() with geom_segment(), and the third by replacing facet_wrap() with facet_grid() using space = "free_x" to keep the distance on each facet comparable in the x direction. Setting the breaks on the x-axis helps keep the axes less cluttered and maintains the idea of equal space visually.

The problem of rotated and partially overlapping facet labels is one that takes some code to deal with. The good news is, the code is the same regardless of which plot you are trying to create, so you can copy/paste it for any plot that needs rotated facet labels. We do this in two steps: rotate and align the facet labels within theme(), and then use what can only be described as voodoo to eliminate the default clipping used by ggplot2.

5.5.1.3 Saving a plot

Because we have modified the plot after it is has been built, we need to use the base-R way of saving things. This is done using the pdf(), png() and svg() functions (and calling dev.off() afterward). Call grid::grid.draw() in the middle like this:

5.5.1.4 Adjusting spacing

For more fine-tuned control of the appearance, you can use various arguments within theme(). The two that I’ve used in this situation are plot.margin to add some padding to the right (this helps with the occasional cutoff rightmost facet label), and panel.spacing.x to adjust the spacing between panels.

5.5.1.5 Non-species data

Often strat diagrams include some other kind of data in addition to the species data, like a PC1 summary or diatom-inferred pH. I’ll calculate a few diversity indicies for this dataset as a demonstration.

When using ggplot2, there is no easy way to combine this with a species plot (you will see a way to do this in the analogue package below). The only way I know of to do this is to with ggplot2 is to (1) modify the non-species data such that it has no y-axis label, no y-axis ticks, and no y-axis break labels, (2) remove the plot background from the second plot (to avoid clipping), (3) fudge the plot margins on the second plot so that the plots align properly (below I set the top and bottom margins to 1.74 and 0.30 inches, respectively, but you will have to fudge this yourself), and (4) use the grid.arrange() function from the gridExtra package to combine the two plots (you can set the relative widths, but you should have the number of rows be 1). Make sure to use the “grob” version of the species plot, which kept the species names from being clipped. Make sure that your depth values actually align before removing the axis labels completely!

You can save this plot by enclosing the grid.arrange() call in png() (or pdf()) and dev.off().

5.5.2 Using analogue::Stratiplot

The analogue package by Gavin Simpson has a number of useful functions for paleolimnologists, including a Stratiplot() function designed to produce stratigraphic plots. I find the ggplot2 version to be more flexible, but the default look and feel works for you, Stratiplot() is definitely easier. Because we are using functions from another package, we have to load the package using library().

## Loading required package: vegan
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-1
## analogue version 0.17-0

5.5.2.1 Data preparation

The data required by Stratiplot() is a data frame of relative abundances (as columns). To calculate this from our filtered data, we can use the spread() function to get the data in “parameter wide” form.

## # A tibble: 16 x 13
##    depth_cm age_bp `Surirella delica… `Anomoeoneis seria… `Navicula subti…
##       <int>  <int>              <dbl>               <dbl>            <dbl>
##  1        0    -39             12.7                 0.775           11.4  
##  2        5     12              7.57                1.84            14.9  
##  3       45    445              4.02                4.02            12.6  
##  4       75    795              6.25                3.53             8.97 
##  5      115   1373              4.55                1.45             6.40 
##  6      175   2314              7.29                0.521            7.81 
##  7      235   3469              3.18                0.867            6.07 
##  8      275   4243              6.30                2.62             4.99 
##  9      325   5218              1.47                3.24             6.49 
## 10      375   6196              0.508               3.05             4.57 
## 11      425   7274              4.49               11.4              4.49 
## 12      445   7975              2.88               10.7              3.40 
## 13      475   9018              0.777              10.1              6.48 
## 14      495   9714              0.521              13.0              0.260
## 15      525  10645              1.07                9.33             0.533
## 16      555  11240              0                   0                0    
## # ... with 8 more variables: `Fragilaria pinnata` <dbl>, `Fragilaria
## #   brevistriata` <dbl>, `Pinnularia microstauron` <dbl>, `Pinnularia
## #   biceps` <dbl>, `Melosira distans` <dbl>, `Cymbella hebridica` <dbl>,
## #   `Fragilaria virescens var. birostrata` <dbl>, `Fragilaria
## #   construens` <dbl>

5.5.2.2 Plotting a single core

The Stratiplot() function takes two arguments: the first is a version of the data frame we just created without any of the depth information. The second is the depth information as a vector.

There are many arguments to Stratiplot(), but I found for this example that the following values worked for this particular dataset. In particular, increasing the topPad parameter was needed to keep the species labels from being cutoff.

5.6 Exercises

  • Use either ggplot2 or analogue to plot the species c("Surirella delicatissima", "Anomoeoneis serians var. brachysira", "Navicula subtilissima", "Fragilaria pinnata", "Fragilaria brevistriata"). For bonus points, use ggplot2 and rotate the axis labels.

5.7 Summary

In this tutorial, we used ggplot2 and the analogue package to create (and save) stratigraphic diagrams for geochemical and species composition data. Unforunately, I don’t know of any good resources currently available for creating stratigraphic plots in R, although there is an R session at IPA-IAL this year.

References

Whitehead, Donald R., Donald F. Charles, Stephen T. Jackson, John P. Smol, and Daniel R. Engstrom. 1989. “The Developmental History of Adirondack (N.Y.) Lakes.” Journal of Paleolimnology 2 (3): 185–206. https://doi.org/10.1007/BF00202046.