Skip to contents

One of the motivators to writing geoarrow was to leverage the features of the arrow package and larger Apache Arrow ecosystem for bigger-than-memory data sets. In the arrow package, these are exposed as Dataset objects and can be created from one or more files using arrow::open_dataset(). There is an excellent introduction to using Datasets with dplyr in the arrow package documentation that goes into detail, but the gist of it is that you can use arrow::open_dataset() to open a bigger-than-memory collection of files and query it using the familiar dplyr::filter(), dplyr::mutate(), dplyr::summarise(), and dplyr::arrange().

Opening a dataset

The geoarrow package provides a few functions simplify using write_dataset() and open_dataset() with datasets that contain geospatial data encoded as one or more columns. We’ll use the arrow package, the dplyr package, and geoarrow. The geoarrow package includes a small test dataset of places in Denmark derived from OpenStreetMap and processed by GeoFabrik.

library(arrow)
library(dplyr)
library(geoarrow)

places_folder <- system.file("example_dataset/osm_places", package = "geoarrow")
list.files(places_folder, recursive = TRUE)
#> [1] "fclass=city/part-0.parquet"            
#> [2] "fclass=farm/part-0.parquet"            
#> [3] "fclass=hamlet/part-0.parquet"          
#> [4] "fclass=island/part-0.parquet"          
#> [5] "fclass=locality/part-0.parquet"        
#> [6] "fclass=national_capital/part-0.parquet"
#> [7] "fclass=suburb/part-0.parquet"          
#> [8] "fclass=town/part-0.parquet"            
#> [9] "fclass=village/part-0.parquet"

These files were written using arrow::write_dataset() and have hive-style partitioning (i.e., the folder name provides some information about the value of a certain field for files within that folder).

You can preview a dataset using head() and geoarrow_collect_sf():

places <- open_dataset(places_folder)
places %>% 
  head() %>% 
  geoarrow_collect_sf()
#> Simple feature collection with 6 features and 5 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 8.452075 ymin: 55.46649 xmax: 12.08192 ymax: 56.46175
#> Geodetic CRS:  WGS 84
#> # A tibble: 6 × 6
#>   osm_id    code population name     fclass            geometry
#>   <chr>    <int>      <dbl> <chr>    <chr>          <POINT [°]>
#> 1 21040334  1001      50781 Roskilde city   (12.08192 55.64335)
#> 2 21040360  1001      72398 Esbjerg  city   (8.452075 55.46649)
#> 3 26559154  1001      62687 Randers  city   (10.03715 56.46175)
#> 4 26559170  1001      60508 Kolding  city     (9.47905 55.4895)
#> 5 26559198  1001      56567 Vejle    city   (9.533324 55.70001)
#> 6 26559213  1001     273077 Aarhus   city    (10.2134 56.14963)

Querying a data set

Just like a local data frame, you can use dplyr verbs like fiilter(), mutate() and summarise() to subset and manipulate the data before it is pulled into the R session. For example, to find all the places in Denmark with a population greater than 100,000 people, we can do this:

places %>% 
  filter(population > 100000) %>% 
  select(name, population, fclass, geometry) %>% 
  geoarrow_collect_sf()
#> Simple feature collection with 5 features and 3 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 9.921526 ymin: 55.39972 xmax: 12.57007 ymax: 57.04626
#> Geodetic CRS:  WGS 84
#> # A tibble: 5 × 4
#>   name          population fclass                      geometry
#>   <chr>              <dbl> <chr>                    <POINT [°]>
#> 1 Aarhus            273077 city              (10.2134 56.14963)
#> 2 Odense            178210 city             (10.38521 55.39972)
#> 3 Aalborg           114194 city             (9.921526 57.04626)
#> 4 København         613288 national_capital (12.57007 55.68672)
#> 5 Frederiksberg     102029 suburb           (12.53262 55.67802)

Geometry operators aren’t yet supported within the Arrow compute engine, so any filtering or transformation on geometry columns must be done at the beginning (i.e., when selecting files to pass to open_dataset()) or at the end (i.e., after calling geoarrow_collect_sf()).

capital <- places %>% 
  filter(name == "København") %>% 
  geoarrow_collect_sf()

# cities within 200 km of the capital
places %>% 
  filter(fclass == "city") %>% 
  geoarrow_collect_sf() %>% 
  filter(
    s2::s2_dwithin(geometry, capital, 200000)
  )
#> Simple feature collection with 7 features and 5 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 9.47905 ymin: 55.39972 xmax: 12.08192 ymax: 56.46175
#> Geodetic CRS:  WGS 84
#> # A tibble: 7 × 6
#>   osm_id      code population name     fclass            geometry
#> * <chr>      <int>      <dbl> <chr>    <chr>          <POINT [°]>
#> 1 21040334    1001      50781 Roskilde city   (12.08192 55.64335)
#> 2 26559154    1001      62687 Randers  city   (10.03715 56.46175)
#> 3 26559170    1001      60508 Kolding  city     (9.47905 55.4895)
#> 4 26559198    1001      56567 Vejle    city   (9.533324 55.70001)
#> 5 26559213    1001     273077 Aarhus   city    (10.2134 56.14963)
#> 6 26559274    1001     178210 Odense   city   (10.38521 55.39972)
#> 7 1368129781  1001      58646 Horsens  city   (9.844477 55.86117)

Using geoarrow_collect()

Until this point, we have used geoarrow_collect_sf(). This is probably what you want, since the fantastic sf package is a feature-complete GIS designed to work well with GIS data in data.frame form. If you want to customize how geometry columns are converted into R objects, you can use geoarrow_collect() with a handler. For example, if you want to convert geometry to a wk::wkb() directly without pivoting through sf, you can use handler = wk::wkb_writer.

places %>% 
  head() %>% 
  geoarrow_collect(handler = wk::wkb_writer)
#> # A tibble: 6 × 6
#>   osm_id    code population name     fclass geometry                   
#>   <chr>    <int>      <dbl> <chr>    <chr>  <wk_wkb>                   
#> 1 21040334  1001      50781 Roskilde city   <POINT (12.08192 55.64335)>
#> 2 21040360  1001      72398 Esbjerg  city   <POINT (8.452075 55.46649)>
#> 3 26559154  1001      62687 Randers  city   <POINT (10.03715 56.46175)>
#> 4 26559170  1001      60508 Kolding  city   <POINT (9.47905 55.4895)>  
#> 5 26559198  1001      56567 Vejle    city   <POINT (9.533324 55.70001)>
#> 6 26559213  1001     273077 Aarhus   city   <POINT (10.2134 56.14963)>

For large point data sets, it can be useful to represent geometry as a wk::xy(), which is a thin wrapper around data.frame(x = c(), y = c()). You can create these objects as geometry columns using handler = wk::xy_writer:

(places_xy <- places %>% 
  select(name, geometry) %>% 
  geoarrow_collect(handler = wk::xy_writer))
#> # A tibble: 7,255 × 2
#>    name     geometry            
#>    <chr>    <wk_xy>             
#>  1 Roskilde (12.081925 55.64335)
#>  2 Esbjerg  ( 8.452075 55.46649)
#>  3 Randers  (10.037148 56.46175)
#>  4 Kolding  ( 9.479050 55.48950)
#>  5 Vejle    ( 9.533324 55.70001)
#>  6 Aarhus   (10.213405 56.14963)
#>  7 Odense   (10.385210 55.39972)
#>  8 Horsens  ( 9.844477 55.86117)
#>  9 Aalborg  ( 9.921526 57.04626)
#> 10 Stige Ø  (10.426084 55.44770)
#> # … with 7,245 more rows

xy <- as.data.frame(places_xy$geometry)
head(xy$x)
#> [1] 12.081925  8.452075 10.037148  9.479050  9.533324 10.213405