Skip to contents

The ability to query large datasets using the Arrow compute engine was the primary motivator for geoarrow; however, this capability depends on efficient storage of geospatial data in file formats that work well with Arrow like Parquet and Feather. Several previously defined formats to store geometries in text- and/or binary-based file exist; however, each of these is difficult to incorporate into Parquet and/or Feather files without compromising the advantages of the Arrow columnar format that we wanted to leverage. In particular, we wanted O(1) access to coordinate values and to be able to pass geometry vectors around using the C data interface in a way that didn’t require readers to implement their own WKB parser (or any other parser).

We’ll use the geoarrow and arrow packages to demonstrate the structure and metadata of these types:

Metadata

All geoarrow arrays carry an extension type with a “geoarrow.” prefix (via the field-level ARROW:extension:name metadata key) and extension metadata (via the field-level ARROW:extension:metadata key). The extension metadata contains key/value pairs encoded in the same format as specified for metadata in the C data interface. This format was chosen to allow readers to access this information without having to vendor a base64 decoder or JSON parser. Currently supported keys are:

  • crs: Contains a serialized version of the coordinate reference system as WKT2 (previously known as WKT2:2019). The string is interpreted using UTF-8 encoding.
  • edges: A value of "spherical" instructs readers that edges should be interpolated along a spherical path rather than a Cartesian one (i.e., for lossless conversion to and from S2 and/or BigQuery geography; otherwise, edges will be interpreted as planar. The edges key must be "spherical" or the key should be omitted. A future value of “ellipsoidal” may be permitted if libraries to support such edges become available.

The keys should appear in the order listed above. Empty metadata should be encoded as four zero bytes (i.e., the 32-bit integer 0x00 0x00 0x00 0x00, indicating that there are zero metadata keys) rather than omitted. These constraints are in place to ensure that type equality can be checked without deserializing the ARROW:extension:metadata field.

The crs key is only used for geoarrow.point arrays; the edges key is only used for geoarrow.linestring and geoarrow.polygon arrays. Practically this was chosen so that child arrays can be passed to functions and validated independently (i.e., without having to pass the crs/edges values down the call stack as extra arguments). Conceptually this was chosen to keep metadata confined to the array for which it is relevant.

In geoarrow, you can view the decoded extension metadata using geoarrow_metadata():

geoarrow_metadata(geoarrow_schema_point(crs = "OGC:CRS84"))
#> $crs
#> [1] "OGC:CRS84"
geoarrow_metadata(geoarrow_schema_linestring(edges = "spherical"))
#> $edges
#> [1] "spherical"

The serialized metadata looks like this:

geoarrow_schema_point(crs = "OGC:CRS84")$metadata
#> $`ARROW:extension:name`
#> [1] "geoarrow.point"
#> 
#> $`ARROW:extension:metadata`
#>  [1] 01 00 00 00 03 00 00 00 63 72 73 09 00 00 00 4f 47 43 3a 43 52 53 38 34
geoarrow_schema_linestring(edges = "spherical")$metadata
#> $`ARROW:extension:name`
#> [1] "geoarrow.linestring"
#> 
#> $`ARROW:extension:metadata`
#>  [1] 01 00 00 00 05 00 00 00 65 64 67 65 73 09 00 00 00 73 70 68 65 72 69 63 61
#> [26] 6c

Points

Metadata

The field-level metadata for points in geoarrow must contain an extension type of “geoarrow.point” and extension metadata specifying an optional coordinate reference system.

carray <- geoarrow_create_narrow(
  wk::wkt(crs = "EPSG:4326"),
  schema = geoarrow_schema_point()
)

carray$schema$metadata
#> $`ARROW:extension:name`
#> [1] "geoarrow.point"
#> 
#> $`ARROW:extension:metadata`
#>  [1] 01 00 00 00 03 00 00 00 63 72 73 09 00 00 00 45 50 53 47 3a 34 33 32 36
geoarrow_metadata(carray$schema)
#> $crs
#> [1] "EPSG:4326"

The coordinate reference system in geoarrow is always stored with the point array, which is used as a child array for all other types.

Storage type

Points are represented in geoarrow as a fixed-size list of float64 (i.e., double) values. Conceptually this is much like storing coordinates as a (row-major) matrix with one row per feature and one column per dimension.

carray <- geoarrow_create_narrow(
  wk::wkt(c("POINT (0 1)", "POINT (2 3)")),
  schema = geoarrow_schema_point()
)
narrow::from_narrow_array(carray, arrow::Array)
#> ExtensionArray
#> <geoarrow.point <crs: unspecified>>
#> [
#>   [
#>     0,
#>     1
#>   ],
#>   [
#>     2,
#>     3
#>   ]
#> ]

Points stored as a fixed-size list have exactly one child named xy, xyz, xym, or xyzm. The width of the fixed-size list must be 2, 3, or 4, and agree with the child name (e.g., if the child name is xyzm, it must be a fixed-size list of size 4). The child storage type must be a float64 for now (although in the future other child types like float32 or decimal128 may be supported).

# interleaved xy values in one buffer
carray$array_data$children[[1]]$buffers[[2]]
#> <pointer: 0x55752a804de0>

Other storage types of points that may be supported in a future reference implementation are:

  • Struct-encoded points (i.e., x, y, and/or z and/or m stored in their own arrays)
  • Dictionary-encoded point representation (may allow for compact representation and efficient querying of polygon coverages with shared vertices)
  • S2 or H3 identifiers (compact and fast to test for containment)
  • float or decimal storage of coordinate values (float when lower precision is adequate; decimal when double precision is inadequate).

Linestrings

Metadata

The field-level metadata for linestrings in geoarrow must contain an extension type of “geoarrow.linestring” and extension metadata specifying an optional “edges” flag (see parent ‘Metadata’ section above).

carray <- geoarrow_create_narrow(
  wk::wkt(geodesic = TRUE),
  schema = geoarrow_schema_linestring()
)

carray$schema$metadata
#> $`ARROW:extension:name`
#> [1] "geoarrow.linestring"
#> 
#> $`ARROW:extension:metadata`
#>  [1] 01 00 00 00 05 00 00 00 65 64 67 65 73 09 00 00 00 73 70 68 65 72 69 63 61
#> [26] 6c
geoarrow_metadata(carray$schema)
#> $edges
#> [1] "spherical"

The coordinate reference system in geoarrow is always stored with the point array (i.e., the child array of a geoarrow.linestring).

Storage type

Linestrings are stored as a list<vertices: <geoarrow.point>>. The exact storage type of the geoarrow.point can vary as described above. Conceptually this is attaching buffer of (int32_t) offsets to an exiting array of points, where each offset points to the first vertex in a linestring.

carray <- geoarrow_create_narrow(
  wk::wkt("LINESTRING (1 2, 3 4)"),
  schema = geoarrow_schema_linestring()
)
narrow::from_narrow_array(carray, arrow::Array)
#> ExtensionArray
#> <geoarrow.linestring <crs: unspecified>>
#> [
#>   [
#>     [
#>       1,
#>       2
#>     ],
#>     [
#>       3,
#>       4
#>     ]
#>   ]
#> ]
# offsets for each linestring into the vertices array
carray$array_data$buffers[[2]]
#> <pointer: 0x55752afe2aa0>

# coordinates
carray$array_data$children[[1]]$children[[1]]$buffers[[2]]
#> <pointer: 0x55752b0c16e0>

Polygons

Metadata

The field-level metadata for polygons in geoarrow must contain an extension type of “geoarrow.polygon” and extension metadata specifying an optional “edges” flag (see parent ‘Metadata’ section above).

carray <- geoarrow_create_narrow(
  wk::wkt(geodesic = TRUE),
  schema = geoarrow_schema_polygon()
)

carray$schema$metadata
#> $`ARROW:extension:name`
#> [1] "geoarrow.polygon"
#> 
#> $`ARROW:extension:metadata`
#>  [1] 01 00 00 00 05 00 00 00 65 64 67 65 73 09 00 00 00 73 70 68 65 72 69 63 61
#> [26] 6c
geoarrow_metadata(carray$schema)
#> $edges
#> [1] "spherical"

Storage type

Linestrings are stored as a list<rings: <list<vertices: <geoarrow.point>>>. The exact storage type of the geoarrow.point can vary as described above. Conceptually this is attaching buffer of (int32_t) offsets to an exiting array of points, where each offset points to the first vertex in a linear ring. The outer list then contains offsets to the start of each polygon in the rings array. Just like WKB, rings must be closed (i.e., the first coordinate must equal the last coordinate).

carray <- geoarrow_create_narrow(
  wk::wkt("POLYGON ((0 0, 1 0, 0 1, 0 0))"),
  schema = geoarrow_schema_polygon()
)
narrow::from_narrow_array(carray, arrow::Array)
#> ExtensionArray
#> <geoarrow.polygon <crs: unspecified>>
#> [
#>   [
#>     [
#>       [
#>         0,
#>         0
#>       ],
#>       [
#>         1,
#>         0
#>       ],
#>       [
#>         0,
#>         1
#>       ],
#>       [
#>         0,
#>         0
#>       ]
#>     ]
#>   ]
#> ]
# offsets for each polygon into the ring array
carray$array_data$buffers[[2]]
#> <pointer: 0x55752b31c6b0>

# offsets for each ring into the vertices array
carray$array_data$children[[1]]$buffers[[2]]
#> <pointer: 0x55752abfc500>

# coordinates
carray$array_data$children[[1]]$children[[1]]$children[[1]]$buffers[[2]]
#> <pointer: 0x55752afdf1e0>

Collections

Metadata

Just like WKB, multipoints, multilinestrings, multipolygons, and geometrycollections share a common encoding but have different identifiers.

  • Multipoint geometries have an ARROW:extension:name of “geoarrow.multipoint” and must contain a child named “points” with the “geoarrow.point” extension type.
  • Multilinestring geometries have an ARROW:extension:name of “geoarrow.multilinestring” and must contain a child named “linestrings” with the “geoarrow.linestring” extension type.
  • Multipolygon geometries have an ARROW:extension:name of “geoarrow.multipolygon” and must contain a child named “polygons” with the “geoarrow.polygon” extension type.
  • Geometry collections (i.e., mixed arrays of points, lines, polygons, multipoints, multipolygons, and/or geometry collections) are not currently supported. For those who need to communicate these objects, use the “geoarrow.wkb” extension type. In the future, support for these will be added as unions (i.e., the child array will be a sparse or dense union of points, lines, polygons, multipoints, multilinestrings, and/or multipolygons).

Collections do not carry extension metadata of their own (i.e., the CRS and edges flags stay with the array for which they are relevant). The metadata string should not be omitted and must be empty (i.e., 0 as a 32-bit integer).

carray <- geoarrow_create_narrow(
  wk::wkt(geodesic = TRUE),
  schema = geoarrow_schema_multipoint()
)

carray$schema$metadata
#> $`ARROW:extension:name`
#> [1] "geoarrow.multipoint"
#> 
#> $`ARROW:extension:metadata`
#> [1] 00 00 00 00

(TODO: I didn’t actually implement the different extension names for different types of collections yet!!)

Storage type

  • Multipoints are stored as a list<points: <geoarrow.point>>
  • Multilinestrings are stored as a list<linestrings: <geoarrow.linestring>>
  • Multipolygons are stored as a list<polygons: <geoarrow.polygon>>

Conceptually this is attaching a buffer of (int32_t) offsets to an existing array of points, lines, or polygons.

carray <- geoarrow_create_narrow(
  wk::wkt("MULTIPOINT (1 2, 3 4)"),
  schema = geoarrow_schema_multipoint()
)
narrow::from_narrow_array(carray, arrow::Array)
#> ExtensionArray
#> <geoarrow.multipoint <crs: unspecified>>
#> [
#>   [
#>     [
#>       1,
#>       2
#>     ],
#>     [
#>       3,
#>       4
#>     ]
#>   ]
#> ]
# offsets for each multipoint into the points array
carray$array_data$buffers[[2]]
#> <pointer: 0x5575268d6a20>

# coordinates
carray$array_data$children[[1]]$children[[1]]$buffers[[2]]
#> <pointer: 0x55752a517fa0>

Relationship to well-known binary

The physical layout and logical types specified in this document are designed to align with well-known binary (WKB), as this is currently the most popular binary encoding used to store and shuffle geometries between libraries. For example, a linestring in WKB is encoded as:

  • One byte describing endian (0x01 or 0x00)
  • A uint32_t describing the geometry type and its dimensions. For a linestring this will be 2 (for XY), 1002 (for XYZ), 2002 (for XYM), or 3002 (for XYZM).
  • A uint32_t of how many vertices are contained in the linestring
  • An buffer of double containing coordinates with coordinate values kept together. For example, the points (1 2, 3 4, 5 6) would be encoded as [1, 2, 3, 4, 5, 6].

In this specification, we store the same information as WKB but organized differently:

  • A struct ArrowSchema that contains the storage type and metadata. The default representation of a linestring is stored as a list_of<vertices: fixed_list_of<xy: float64, 2>>, where the child name of the fixed list stores the dimensions (xy) of the coordinates.
  • A struct ArrowArray() that contains the coordinate values and lengths of each linestring. For example, a linestring containing the points (1 2, 3 4, 5 6) is encoded by default using:
    • An int32_t buffer of offsets to the start/end of each linestring in the points array. Because our example only includes one linestring, this would be two numbers [0, 3]. The length of each linestring can be calculated by subtracting (i.e., offset_array[i + 1] - offset_array[i]).
    • A double buffer containing coordinates with coordinate values kept together (i.e., [1, 2, 3, 4, 5, 6]).

You can learn more about these buffers and the C structures that geoarrow uses to represent them in memory in the Arrow Columnar Format specification and the C Data interface specification. For a detailed guide to iterating over geometries in C and C++, see the C and C++ development guide.