Technical Overview of Virtual Stores

Core concepts

  • Zarr is a chunked, compressed multi-dimensional array specification. Zarr is designed for cloud-native (network-addressable chunks in object storage) access. Zarr is the proposed specification for virtual stores at NASA. Unlike Cloud-Optimized GeoTIFF (COG), which is optimized for 2D raster imagery but does not generalize to multi-dimensional scientific arrays, or cloud-optimized HDF5, which improves on legacy HDF5 but still relies on HDF5 library internals and byte-range seeking within a single file, Zarr stores each chunk as a separate object in cloud storage. This design enables highly parallel reads, straightforward scaling, and native compatibility with object storage APIs without specialized client libraries.
  • Chunk manifests are lightweight metadata structures that describe a mapping from logical data space to where that data is stored. Another term for chunk manifests is an “indirection layer”. Icechunk, Kerchunk, and DMR++ are common implementations of chunk manifests.
  • Icechunk is a transactional storage engine for Zarr arrays. Icechunk stores chunk manifests which it calls virtual datasets.
  • Kerchunk is an early approach to creating chunk manifests (what it calls reference files) which maps virtual Zarr array coordinates to byte ranges in existing files using the JSON or Parquet file formats for persistence.
  • DMRPP is a chunk manifest format from OPeNDAP, semantically equivalent to Kerchunk or Zarr chunk manifest is mostly used internally on datasets supported by OPeNDAP. This format is however not stackable (consolidating many chunk manifests into one) like Kerchunk or Icechunk. When available it can speed up the creation of virtual stores.
  • VirtualiZarr is a library for generating chunk manifests. Format-specific parsers are used to read chunk information and generate in-memory chunk manifests. VirtualiZarr can then write those chunk manifests to icechunk or kerchunk.
  • GeoZarr extends Zarr with geospatial conventions, including coordinate reference systems and spatial metadata.