Getting started with geoarrow
geoarrow.Rmd
One of the motivators to writing geoarrow was to leverage the features of the arrow package and larger Apache Arrow ecosystem for bigger-than-memory data sets. In the arrow package, these are exposed as Dataset
objects and can be created from one or more files using arrow::open_dataset()
. There is an excellent introduction to using Datasets with dplyr in the arrow package documentation that goes into detail, but the gist of it is that you can use arrow::open_dataset()
to open a bigger-than-memory collection of files and query it using the familiar dplyr::filter()
, dplyr::mutate()
, dplyr::summarise()
, and dplyr::arrange()
.
Opening a dataset
The geoarrow package provides a few functions simplify using write_dataset()
and open_dataset()
with datasets that contain geospatial data encoded as one or more columns. We’ll use the arrow package, the dplyr package, and geoarrow. The geoarrow package includes a small test dataset of places in Denmark derived from OpenStreetMap and processed by GeoFabrik.
library(arrow)
library(dplyr)
library(geoarrow)
places_folder <- system.file("example_dataset/osm_places", package = "geoarrow")
list.files(places_folder, recursive = TRUE)
#> [1] "fclass=city/part-0.parquet"
#> [2] "fclass=farm/part-0.parquet"
#> [3] "fclass=hamlet/part-0.parquet"
#> [4] "fclass=island/part-0.parquet"
#> [5] "fclass=locality/part-0.parquet"
#> [6] "fclass=national_capital/part-0.parquet"
#> [7] "fclass=suburb/part-0.parquet"
#> [8] "fclass=town/part-0.parquet"
#> [9] "fclass=village/part-0.parquet"
These files were written using arrow::write_dataset()
and have hive-style partitioning (i.e., the folder name provides some information about the value of a certain field for files within that folder).
You can preview a dataset using head()
and geoarrow_collect_sf()
:
places <- open_dataset(places_folder)
places %>%
head() %>%
geoarrow_collect_sf()
#> Simple feature collection with 6 features and 5 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 8.452075 ymin: 55.46649 xmax: 12.08192 ymax: 56.46175
#> Geodetic CRS: WGS 84
#> # A tibble: 6 × 6
#> osm_id code population name fclass geometry
#> <chr> <int> <dbl> <chr> <chr> <POINT [°]>
#> 1 21040334 1001 50781 Roskilde city (12.08192 55.64335)
#> 2 21040360 1001 72398 Esbjerg city (8.452075 55.46649)
#> 3 26559154 1001 62687 Randers city (10.03715 56.46175)
#> 4 26559170 1001 60508 Kolding city (9.47905 55.4895)
#> 5 26559198 1001 56567 Vejle city (9.533324 55.70001)
#> 6 26559213 1001 273077 Aarhus city (10.2134 56.14963)
Querying a data set
Just like a local data frame, you can use dplyr verbs like filter()
, mutate()
and summarise()
to subset and manipulate the data before it is pulled into the R session. For example, to find all the places in Denmark with a population greater than 100,000 people, we can do this:
places %>%
filter(population > 100000) %>%
select(name, population, fclass, geometry) %>%
geoarrow_collect_sf()
#> Simple feature collection with 5 features and 3 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 9.921526 ymin: 55.39972 xmax: 12.57007 ymax: 57.04626
#> Geodetic CRS: WGS 84
#> # A tibble: 5 × 4
#> name population fclass geometry
#> <chr> <dbl> <chr> <POINT [°]>
#> 1 København 613288 national_capital (12.57007 55.68672)
#> 2 Frederiksberg 102029 suburb (12.53262 55.67802)
#> 3 Aarhus 273077 city (10.2134 56.14963)
#> 4 Odense 178210 city (10.38521 55.39972)
#> 5 Aalborg 114194 city (9.921526 57.04626)
Geometry operators aren’t yet supported within the Arrow compute engine, so any filtering or transformation on geometry columns must be done at the beginning (i.e., when selecting files to pass to open_dataset()
) or at the end (i.e., after calling geoarrow_collect_sf()
).
capital <- places %>%
filter(name == "København") %>%
geoarrow_collect_sf()
# cities within 200 km of the capital
places %>%
filter(fclass == "city") %>%
geoarrow_collect_sf() %>%
filter(
s2::s2_dwithin(geometry, capital, 200000)
)
#> Simple feature collection with 7 features and 5 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 9.47905 ymin: 55.39972 xmax: 12.08192 ymax: 56.46175
#> Geodetic CRS: WGS 84
#> # A tibble: 7 × 6
#> osm_id code population name fclass geometry
#> * <chr> <int> <dbl> <chr> <chr> <POINT [°]>
#> 1 21040334 1001 50781 Roskilde city (12.08192 55.64335)
#> 2 26559154 1001 62687 Randers city (10.03715 56.46175)
#> 3 26559170 1001 60508 Kolding city (9.47905 55.4895)
#> 4 26559198 1001 56567 Vejle city (9.533324 55.70001)
#> 5 26559213 1001 273077 Aarhus city (10.2134 56.14963)
#> 6 26559274 1001 178210 Odense city (10.38521 55.39972)
#> 7 1368129781 1001 58646 Horsens city (9.844477 55.86117)
Using geoarrow_collect()
Until this point, we have used geoarrow_collect_sf()
. This is probably what you want, since the fantastic sf package is a feature-complete GIS designed to work well with GIS data in data.frame form. If you want to customize how geometry columns are converted into R objects, you can use geoarrow_collect()
with a handler
. For example, if you want to convert geometry to a wk::wkb()
directly without pivoting through sf, you can use handler = wk::wkb_writer
.
places %>%
head() %>%
geoarrow_collect(handler = wk::wkb_writer)
#> # A tibble: 6 × 6
#> osm_id code population name fclass geometry
#> <chr> <int> <dbl> <chr> <chr> <wk_wkb>
#> 1 21040334 1001 50781 Roskilde city <POINT (12.08192 55.64335)>
#> 2 21040360 1001 72398 Esbjerg city <POINT (8.452075 55.46649)>
#> 3 26559154 1001 62687 Randers city <POINT (10.03715 56.46175)>
#> 4 26559170 1001 60508 Kolding city <POINT (9.47905 55.4895)>
#> 5 26559198 1001 56567 Vejle city <POINT (9.533324 55.70001)>
#> 6 26559213 1001 273077 Aarhus city <POINT (10.2134 56.14963)>
For large point data sets, it can be useful to represent geometry as a wk::xy()
, which is a thin wrapper around data.frame(x = c(), y = c())
. You can create these objects as geometry columns using handler = wk::xy_writer
:
(places_xy <- places %>%
select(name, geometry) %>%
geoarrow_collect(handler = wk::xy_writer))
#> # A tibble: 7,255 × 2
#> name geometry
#> <chr> <wk_xy>
#> 1 Roskilde (12.081925 55.64335)
#> 2 Esbjerg ( 8.452075 55.46649)
#> 3 Randers (10.037148 56.46175)
#> 4 Kolding ( 9.479050 55.48950)
#> 5 Vejle ( 9.533324 55.70001)
#> 6 Aarhus (10.213405 56.14963)
#> 7 Odense (10.385210 55.39972)
#> 8 Horsens ( 9.844477 55.86117)
#> 9 Aalborg ( 9.921526 57.04626)
#> 10 Englebjerggård (11.777372 55.20040)
#> # … with 7,245 more rows
xy <- as.data.frame(places_xy$geometry)
head(xy$x)
#> [1] 12.081925 8.452075 10.037148 9.479050 9.533324 10.213405