The geoarrow package exposes a set of high-level helpers to integrate geospatial data with Arrow Datasets; however, the format used can be easily represented using the Apache Arrow C Data interface, exposing vectors of geometries as ABI-stable C structures that can be used without any header other than the two struct definitions in arrow/c/abi.h.

We’ll use the narrow and geoarrow packages to demonstrate the structure of geoarrow arrays and how they can be iterated over in C and C++.

library(narrow)
library(geoarrow)

We’ll also need some test arrays with each of the six types supported in the Arrow-native format.

points <- geoarrow_create_narrow(
wk::wkt(c("POINT (30 10)", "POINT (40 30)"))
)

linestrings <- geoarrow_create_narrow(
wk::wkt(
c(
"LINESTRING (30 10, 10 30, 40 40)",
"LINESTRING (0 0, 10 5)"
)
)
)

polygons <- geoarrow_create_narrow(
wk::wkt(
c(
"POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))",
"POLYGON (
(35 10, 45 45, 15 40, 10 20, 35 10),
(20 30, 35 35, 30 20, 20 30))"
)
)
)

multipoints <- geoarrow_create_narrow(
wk::wkt(c("MULTIPOINT (0 1, 2 3)", "MULTIPOINT (0 0, 3 8)"))
)

multilinestrings <- geoarrow_create_narrow(
wk::wkt(
c(
"MULTILINESTRING ((30 10, 40 40, 20 40, 10 20, 30 10))",
"MULTILINESTRING (
(35 10, 45 45, 15 40, 10 20, 35 10),
(20 30, 35 35, 30 20, 20 30))"
)
)
)

multipolygons <- geoarrow_create_narrow(
wk::wkt(
c(
"MULTIPOLYGON (
((30 20, 45 40, 10 40, 30 20)),
((15 5, 40 10, 10 20, 5 10, 15 5)))",
"MULTIPOLYGON (
((40 40, 20 45, 45 30, 40 40)),
((20 35, 10 30, 10 10, 30 5, 45 20, 20 35),
(30 20, 20 15, 20 25, 30 20)))"
)
)
)

The other piece of setup you may need to follow along is a way to pass the example arrays into C and/or C++ code. If you are writing an R package, you can include the narrow package in your LinkingTo: field in your DESCRIPTION; if you are writing an RMarkdown document (like this one), you can set your PKG_CPPFLAGS environment variable to include the “include” directory in the narrow package directory:

Sys.setenv(
PKG_CPPFLAGS = paste0("-I", system.file("include", package = "narrow"))
)

This will let you #include "narrow.h", which includes a copy of the Arrow C Data structure definitions and functions to extract them safely from R objects.

#include "narrow.h"

## Reading geoarrow arrays in R

The easiest way to inspect geoarrow array data is from R using the wk package. The general structure of an array is available from wk::wk_vector_meta(), which is fast (it doesn’t iterate over all the points).

wk::wk_vector_meta(points)
#>   geometry_type size has_z has_m
#> 1             1    2 FALSE FALSE

If you need the geodesic flag or the CRS, you can use wk::wk_is_geoesic() or wk::wk_crs():

wk::wk_is_geodesic(points)
#> [1] FALSE
wk::wk_crs(points)
#> NULL

You can access coordinate values as flat arrays using wk_coords():

wk::wk_coords(linestrings)
#>   feature_id part_id ring_id  x  y
#> 1          1       1       0 30 10
#> 2          1       1       0 10 30
#> 3          1       1       0 40 40
#> 4          2       2       0  0  0
#> 5          2       2       0 10  5

The geoarrow package implements wk::wk_handle(), so you can use any handler or filter defined in any package. For example, you can apply an affine transform and create a wk::wkb() vector in one step (with zero intermediary copies):

wk::wk_handle(
linestrings,
wk::wk_transform_filter(
trans = wk::wk_affine_rotate(90),
handler = wk::wkb_writer()
)
)
#> <wk_wkb[2]>
#> [1] <LINESTRING (-10 30, -30 10, -40 40)> <LINESTRING (0 0, -5 10)>

You can use the wk::sfc_writer() handler to generate sf objects. Note that when using the wk::wk_handle() interface, you will have to propagate the CRS and/or geodesic flags yourself. For a guide on how to write filters and handlers in C and C++, see the programming vignette from the wk package.

If you need geoarrow/arrow-specific metadata information, you can inspect the schemas directly using the list-like interface of schem aobjects. The most important thing you will have to check is the extension name, which will tell you which bit of compiled code you will need to safely access the array data.

points$schema$metadata[["ARROW:extension:name"]]
#> [1] "geoarrow.point"

For points, you will also have to check dimensions and the storage type, since depending on the dimensions and/or encoding the underlying arrays could have buffer numbers and/or types.

points$schema$format
#> [1] "+w:2"
points$schema$children[[1]]$format #> [1] "g" points$schema$children[[1]]$name
#> [1] "xy"

The above output indicates that the point array is a fixed-size list (size 2) whose child is a float64. The dimension output indicates that the dimension type is xy.

You may also need to access the extension metadata which contains information like the CRS. This metadata is serialized in the raw schema, so you’ll need geoarrow_metadata() to extract it.

geoarrow_metadata(points$schema) #> named list() ## Reading geoarrow arrays using geoarrow.hpp Built-in operations in the geoarrow package use the geoarrow.hpp helper file, which defines a few classes and functions to ease dealing with multiple extension types, dimensions, and point storage types. This is currently an evolving library; however, you can copy the file and include it in another R package or project (it requires no dependencies). It is currently located here. ### Iterating over coordinates (TODO: Haven’t added this to the GeoArrowArrayView yet) ### Using the Handler The Handler is a class with methods corresponding to events that occur when looping over coordinates from top to bottom. This style of iteration was inspired by GeoRust’s geozero and SAX-style XML parsers. Writing operations as handlers is a good fit for some operations but can be challenging for others. In R, it powers the wk::wk_handle() operation which in turn powers many other operations via wk handlers elsewhere. It is particularly useful for operations that iterate forward through all coordinates. For example, the following subclass of Handler calculates the bounding box of any (arrow array of) geometry it is handed. #include <cpp11.hpp> #include "narrow.h" #define ARROW_HPP_IMPL #include "../../src/internal/geoarrow-cpp/factory.hpp" using namespace cpp11; using namespace geoarrow; class BboxHandler: public Handler { public: BboxHandler() : xmin(R_PosInf), xmax(R_NegInf), ymin(R_PosInf), ymax(R_NegInf) {} Result coords(const double* coord, int64_t n, int32_t coord_size) { xmin = std::min<double>(coord[0], xmin); xmax = std::max<double>(coord[0], xmax); ymin = std::min<double>(coord[1], ymin); ymax = std::max<double>(coord[1], ymax); return Result::CONTINUE; } double xmin; double xmax; double ymin; double ymax; }; [[cpp11::register]] doubles geoarrow_bbox(sexp schema_xptr, sexp array_data_xptr) { struct ArrowSchema* schema = safe[schema_from_xptr](schema_xptr, "schema"); struct ArrowArray* array_data = safe[array_data_from_xptr](array_data_xptr, "array_data"); std::unique_ptr<ArrayView> view(create_view(schema)); view->set_array(array_data); BboxHandler handler; view->read_meta(&handler); view->read_features(&handler); writable::doubles out = {handler.xmin, handler.xmax, handler.ymin, handler.ymax}; return out; } geoarrow_bbox(polygons$schema, polygons$array_data) #> [1] 20 35 10 30 ### Metadata The geoarrow package uses the Meta class to validate and parse both schema and array objects. It is a wrapper around a struct ArrowSchema that walks its input and extracts pieces in a reasonable format to build on. In most cases you shouldn’t need to extract metadata in compiled code; however, the Meta class is the easiest way to do so if you are in this position. #include <cpp11.hpp> #include "narrow.h" #include "../../src/internal/geoarrow-cpp/meta.hpp" using namespace cpp11; using namespace geoarrow; [[cpp11::register]] void geoarrow_metadata_dump(sexp schema_xptr) { struct ArrowSchema* schema = safe[schema_from_xptr](schema_xptr, "schema"); Meta meta; if (!meta.set_schema(schema)) { stop("%s", meta.error_); } switch(meta.extension_) { case util::Extension::Point: Rprintf("Extension: 'geoarrow.point'\n"); break; case util::Extension::Linestring: Rprintf("Extension: 'geoarrow.linestring'\n"); break; case util::Extension::Polygon: Rprintf("Extension: 'geoarrow.polygon'\n"); break; case util::Extension::MultiPoint: Rprintf("Extension: 'geoarrow.multipoint'\n"); break; case util::Extension::MultiLinestring: Rprintf("Extension: 'geoarrow.multilinestring'\n"); break; case util::Extension::MultiPolygon: Rprintf("Extension: 'geoarrow.multipolygon'\n"); break; case util::Extension::GeometryCollection: Rprintf("Extension: 'geoarrow.geometrycollection'\n"); break; default: Rprintf("Extension: other\n"); break; } switch(meta.storage_type_) { case util::StorageType::FixedSizeList: Rprintf("Storage type: fixed size list\n"); break; case util::StorageType::Struct: Rprintf("Storage type: struct\n"); break; case util::StorageType::List: Rprintf("Storage type: list\n"); break; default: Rprintf("Storage type: other\n"); } Rprintf("Dimensions: '%s'\n", meta.dim_); if (meta.crs_size_ > 0) { std::string crs(meta.crs_, meta.crs_size_); Rprintf("CRS: '%s'\n", crs.c_str()); } if (meta.edges_ == util::Edges::Spherical) { Rprintf("Spherical edges!\n"); } } geoarrow_metadata_dump(points$schema)
#> Extension: 'geoarrow.point'
#> Storage type: fixed size list
#> Dimensions: 'xy'
geoarrow_metadata_dump(linestrings$schema) #> Extension: 'geoarrow.linestring' #> Storage type: list #> Dimensions: 'xy' geoarrow_metadata_dump( geoarrow_create_narrow( wk::wkt(crs = "EPSG:1234", geodesic = TRUE), geoarrow_schema_linestring() )$schema
)
#> Extension: other
#> Storage type: other
#> Dimensions: 'xy'
#> CRS: 'EPSG:1234'
#> Spherical edges!

This also will give a reasonable error for invalid objects:

points_invalid <- points
points_invalid$schema$children[[1]]$name <- "xyzm" geoarrow_metadata_dump(points_invalid$schema)
#> Error: Expected geoarrow.point with dimensions 'xyzm' to have width 4 but found width 2

## Reading geoarrow arrays in C

If you need to deal with arbitrary vectors of that may have different types or different point encodings, you should #include "geoarrow.hpp" and use the helper functions provided there; however, the geoarrow format was designed to be accessible from C code without helpers. If you can make some assumptions about the geometry type and storage type of your input but still want the speed of compiled code, using the C structures directly may be a good fit. As an example, I’m going to print coordinates with information about the containing structures.

### Points

For points, you need three pieces of information: the length, the initial offset into the array, and the buffer of doubles containing the coordinates.

#include <R.h>
#include "narrow.h"

SEXP c_point_print(SEXP array_data_xptr) {
struct ArrowArray* array_data = array_data_from_xptr(array_data_xptr, "array_data");

int coord_size = 2;
double* coords = (double*) array_data->children[0]->buffers[1];
coords = coords + (array_data->offset) * coord_size;

double* coord;
for (int64_t point_id = 0; point_id < array_data->length; point_id++) {
coord = coords + (point_id * coord_size);
Rprintf("point[%d] (%g %g)\n", point_id, coord[0], coord[1]);
}

return R_NilValue;
}

Depending on whether or not you have access to a higher-level runtime, you probably want to validate a few things about the schema before accessing elements of the array.

point_print <- function(x) {
x <- as_narrow_array(x)

stopifnot(
x$schema$format == "+w:2",
x$schema$children[[1]]$format == "g" ) invisible(.Call("c_point_print", x$array_data))
}

point_print(points)
#> point[0] (30 10)
#> point[1] (40 30)

### Linestrings

In addition to all the information about a point, you also need the offsets into the point array from the outer linestring.

#include <R.h>
#include "narrow.h"

SEXP c_linestring_print(SEXP array_data_xptr) {
struct ArrowArray* array_data = array_data_from_xptr(array_data_xptr, "array_data");

int coord_size = 2;
double* coords = (double*) array_data->children[0]->children[0]->buffers[1];
int32_t* coord_offsets = (int32_t*) array_data->buffers[1];

coord_offsets = coord_offsets + array_data->offset;
coords = coords + coord_offsets[0];

double* coord;
for (int64_t line_id = 0; line_id < array_data->length; line_id++) {
int32_t n_coords = coord_offsets[line_id + 1] - coord_offsets[line_id];

for (int32_t point_id = 0; point_id < n_coords; point_id++) {
coord = coords + coord_size * (coord_offsets[line_id] + point_id);
Rprintf("linestring[%d]->point[%d] (%g %g)\n", line_id, point_id, coord[0], coord[1]);
}
}

return R_NilValue;
}
linestring_print <- function(x) {
x <- as_narrow_array(x)

stopifnot(
x$schema$format == "+l",
x$schema$children[[1]]$format == "+w:2", x$schema$children[[1]]$children[[1]]$format == "g" ) invisible(.Call("c_linestring_print", x$array_data))
}

linestring_print(linestrings)
#> linestring[0]->point[0] (30 10)
#> linestring[0]->point[1] (10 30)
#> linestring[0]->point[2] (40 40)
#> linestring[1]->point[0] (0 0)
#> linestring[1]->point[1] (10 5)

### Polygons

In addition to all the information about a linestring, you also need the offsets into the rings array from the outer polygon.

#include <R.h>
#include "narrow.h"

SEXP c_polygon_print(SEXP array_data_xptr) {
struct ArrowArray* array_data = array_data_from_xptr(array_data_xptr, "array_data");

int coord_size = 2;
double* coords = (double*) array_data->children[0]->children[0]->children[0]->buffers[1];
int32_t* coord_offsets = (int32_t*) array_data->children[0]->buffers[1];
int32_t* ring_offsets = (int32_t*) array_data->buffers[1];

ring_offsets = ring_offsets + array_data->offset;
coord_offsets = coord_offsets + ring_offsets[0];
coords = coords + coord_offsets[0];

double* coord;
int32_t* coord_offset;
for (int64_t poly_id = 0; poly_id < array_data->length; poly_id++) {
int32_t n_rings = ring_offsets[poly_id + 1] - ring_offsets[poly_id];
coord_offset = coord_offsets + ring_offsets[poly_id];

for (int32_t ring_id = 0; ring_id < n_rings; ring_id++) {
int32_t n_coords = coord_offset[ring_id + 1] - coord_offset[ring_id];

for (int32_t point_id = 0; point_id < n_coords; point_id++) {
coord = coords + (coord_size * (coord_offset[ring_id] + point_id));
Rprintf(
"polygon[%d]->ring[%d]->point[%d] (%g %g)\n",
poly_id, ring_id, point_id, coord[0], coord[1]);
}
}
}

return R_NilValue;
}
polygon_print <- function(x) {
x <- as_narrow_array(x)

stopifnot(
x$schema$format == "+l",
x$schema$children[[1]]$format == "+l", x$schema$children[[1]]$children[[1]]$format == "+w:2", x$schema$children[[1]]$children[[1]]$children[[1]]$format == "g"
)

invisible(.Call("c_polygon_print", x$array_data)) } polygon_print(polygons) #> polygon[0]->ring[0]->point[0] (30 10) #> polygon[0]->ring[0]->point[1] (40 40) #> polygon[0]->ring[0]->point[2] (20 40) #> polygon[0]->ring[0]->point[3] (10 20) #> polygon[0]->ring[0]->point[4] (30 10) #> polygon[1]->ring[0]->point[0] (35 10) #> polygon[1]->ring[0]->point[1] (45 45) #> polygon[1]->ring[0]->point[2] (15 40) #> polygon[1]->ring[0]->point[3] (10 20) #> polygon[1]->ring[0]->point[4] (35 10) #> polygon[1]->ring[1]->point[0] (20 30) #> polygon[1]->ring[1]->point[1] (35 35) #> polygon[1]->ring[1]->point[2] (30 20) #> polygon[1]->ring[1]->point[3] (20 30) ### Multipoints Iterating over multipoints requires the same code and assertions as for linestrings: linestring_print(multipoints) #> linestring[0]->point[0] (0 1) #> linestring[0]->point[1] (2 3) #> linestring[1]->point[0] (0 0) #> linestring[1]->point[1] (3 8) ### Multilinestrings Iterating over multilinestrings requires the same code and assertions as for polygons: polygon_print(multilinestrings) #> polygon[0]->ring[0]->point[0] (30 10) #> polygon[0]->ring[0]->point[1] (40 40) #> polygon[0]->ring[0]->point[2] (20 40) #> polygon[0]->ring[0]->point[3] (10 20) #> polygon[0]->ring[0]->point[4] (30 10) #> polygon[1]->ring[0]->point[0] (35 10) #> polygon[1]->ring[0]->point[1] (45 45) #> polygon[1]->ring[0]->point[2] (15 40) #> polygon[1]->ring[0]->point[3] (10 20) #> polygon[1]->ring[0]->point[4] (35 10) #> polygon[1]->ring[1]->point[0] (20 30) #> polygon[1]->ring[1]->point[1] (35 35) #> polygon[1]->ring[1]->point[2] (30 20) #> polygon[1]->ring[1]->point[3] (20 30) ### Multipolygons In addition to all the information about a polygon, you also need the offsets into the geometries array from the outer collection. #include <R.h> #include "narrow.h" SEXP c_multipolygon_print(SEXP array_data_xptr) { struct ArrowArray* array_data = array_data_from_xptr(array_data_xptr, "array_data"); int coord_size = 2; double* coords = (double*) array_data->children[0]->children[0]->children[0]->children[0]->buffers[1]; int32_t* coord_offsets = (int32_t*) array_data->children[0]->children[0]->buffers[1]; int32_t* ring_offsets = (int32_t*) array_data->children[0]->buffers[1]; int32_t* geom_offsets = (int32_t*) array_data->buffers[1]; geom_offsets = geom_offsets + array_data->offset; ring_offsets = ring_offsets + geom_offsets[0]; coord_offsets = coord_offsets + ring_offsets[0]; coords = coords + coord_offsets[0]; double* coord; int32_t* coord_offset; int32_t* ring_offset; for (int64_t multi_id = 0; multi_id < array_data->length; multi_id++) { int32_t n_geoms = geom_offsets[multi_id + 1] - geom_offsets[multi_id]; ring_offset = ring_offsets + geom_offsets[multi_id]; for (int32_t poly_id = 0; poly_id < n_geoms; poly_id++) { int32_t n_rings = ring_offset[poly_id + 1] - ring_offset[poly_id]; coord_offset = coord_offsets + ring_offsets[poly_id]; for (int32_t ring_id = 0; ring_id < n_rings; ring_id++) { int32_t n_coords = coord_offset[ring_id + 1] - coord_offset[ring_id]; for (int32_t point_id = 0; point_id < n_coords; point_id++) { coord = coords + (coord_size * (coord_offset[ring_id] + point_id)); Rprintf( "multipolygon[%d]->polygon[%d]->ring[%d]->point[%d] (%g %g)\n", multi_id, poly_id, ring_id, point_id, coord[0], coord[1]); } } } } return R_NilValue; } multipolygon_print <- function(x) { x <- as_narrow_array(x) stopifnot( x$schema$format == "+l", x$children[[1]]$schema$format == "+l",
x$children[[1]]$schema$children[[1]]$format == "+l",
x$children[[1]]$schema$children[[1]]$children[[1]]$format == "+w:2", x$children[[1]]$schema$children[[1]]$children[[1]]$children[[1]]$format == "g" ) invisible(.Call("c_multipolygon_print", x$array_data))
}

multipolygon_print(multipolygons)
#> multipolygon[0]->polygon[0]->ring[0]->point[0] (30 20)
#> multipolygon[0]->polygon[0]->ring[0]->point[1] (45 40)
#> multipolygon[0]->polygon[0]->ring[0]->point[2] (10 40)
#> multipolygon[0]->polygon[0]->ring[0]->point[3] (30 20)
#> multipolygon[0]->polygon[1]->ring[0]->point[0] (15 5)
#> multipolygon[0]->polygon[1]->ring[0]->point[1] (40 10)
#> multipolygon[0]->polygon[1]->ring[0]->point[2] (10 20)
#> multipolygon[0]->polygon[1]->ring[0]->point[3] (5 10)
#> multipolygon[0]->polygon[1]->ring[0]->point[4] (15 5)
#> multipolygon[1]->polygon[0]->ring[0]->point[0] (30 20)
#> multipolygon[1]->polygon[0]->ring[0]->point[1] (45 40)
#> multipolygon[1]->polygon[0]->ring[0]->point[2] (10 40)
#> multipolygon[1]->polygon[0]->ring[0]->point[3] (30 20)
#> multipolygon[1]->polygon[1]->ring[0]->point[0] (15 5)
#> multipolygon[1]->polygon[1]->ring[0]->point[1] (40 10)
#> multipolygon[1]->polygon[1]->ring[0]->point[2] (10 20)
#> multipolygon[1]->polygon[1]->ring[0]->point[3] (5 10)
#> multipolygon[1]->polygon[1]->ring[0]->point[4] (15 5)
#> multipolygon[1]->polygon[1]->ring[1]->point[0] (40 40)
#> multipolygon[1]->polygon[1]->ring[1]->point[1] (20 45)
#> multipolygon[1]->polygon[1]->ring[1]->point[2] (45 30)
#> multipolygon[1]->polygon[1]->ring[1]->point[3] (40 40)

The easiest way to inspect the schema, which encodes the storage type and extra geo-specific extension metadata, is to do it before you drop into C. In particular, wk::wk_vector_meta() has been implemented for geoarrow arrays in such a way that it will (1) validate the schema to let you make stronger assumptions about the C structures you are given and (2) deserialize and extract the metadata you might need to reduce the number of problems you have to solve in C.

If you really do need to do this in compiled code, you can access the struct ArrowSchema directly. For example, to extract the dimensions from C, you can do this:

#include <R.h>
#include "narrow.h"

SEXP c_point_dimensions(SEXP schema_xptr) {
struct ArrowSchema* schema = schema_from_xptr(schema_xptr, "schema");
return Rf_mkString(schema->children[0]->name);
}
.Call("c_point_dimensions", points$schema) #> [1] "xy" .Call("c_point_dimensions", linestrings$schema$children[[1]]) #> [1] "xy" If you need extension metadata from C you will have to iterate over the metadata field AND the extension metadata field. These are both encoded in the same way (by design). This gets verbose quickly, but if you really need to do it you can do so in about 50 lines. The important thing to note is that none of the names or values are null-terminated in the metadata fields. #include <R.h> #include "narrow.h" SEXP c_dump_metadata(SEXP schema_xptr) { struct ArrowSchema* schema = schema_from_xptr(schema_xptr, "schema"); const char* metadata = schema->metadata; int64_t pos = 0; int32_t n, m, name_len, value_len; memcpy(&n, metadata + pos, sizeof(int32_t)); pos += sizeof(int32_t); for (int i = 0; i < n; i++) { memcpy(&name_len, metadata + pos, sizeof(int32_t)); pos += sizeof(int32_t); const char* name = metadata + pos; pos += name_len; memcpy(&value_len, metadata + pos, sizeof(int32_t)); pos += sizeof(int32_t); if (name_len >= 20 && strncmp(name, "ARROW:extension:name", 20) == 0) { const char* value = metadata + pos; pos += value_len; Rprintf("ARROW:extension:name: '%.*s'\n", value_len, value); } else if (name_len >= 24 && strncmp(name, "ARROW:extension:metadata", 24) == 0) { memcpy(&m, metadata + pos, sizeof(int32_t)); pos += sizeof(int32_t); for (int j = 0; j < m; j++) { memcpy(&name_len, metadata + pos, sizeof(int32_t)); pos += sizeof(int32_t); const char* ext_name = metadata + pos; pos += name_len; memcpy(&value_len, metadata + pos, sizeof(int32_t)); pos += sizeof(int32_t); const char* ext_value = metadata + pos; pos += value_len; if (name_len == 0 || value_len == 0) { continue; } Rprintf( "ARROW:extension:metadata/%.*s: '%.*s'\n", name_len, ext_name, value_len, ext_value); } } else { pos += value_len; continue; } } return R_NilValue; } invisible(.Call("c_dump_metadata", points$schema))
#> ARROW:extension:name: 'geoarrow.point'

with_crs <- geoarrow_create_narrow(
wk::wkt(crs = "EPSG:1234"),
schema = geoarrow_schema_point()
)
#> ARROW:extension:metadata/crs: 'EPSG:1234'