Recommendations

Design file chunk structure around typical access patterns

While the focus of this document is virtual data stores, it is worth mentioning data product design decisions, since those decisions impact VDS performance. As noted in the Limitations section, virtual data stores depend on the chunk structure of the underlying files. That’s why it is recommended to design files with target access patterns in mind (chunk for access, not storage).

Adopt icechunk

Icechunk should be adopted but with risk mitigation measures. Icechunk is a transactional storage engine for Zarr. In other words, it is a way to manage Zarr stores the same way you would with many traditional databases. Icechunk technology supports the following operational needs of many NASA datasets:

  • Incremental updating: Icechunk is the only technology that supports safely appending new data to an existing virtual store — critical for active missions that continuously produce new granules. Without it, the alternatives are rebuilding the entire store on each update or accepting the risk of metadata falling out of sync with the data it describes.
  • Safety: Changes to a store are made through ACID transactions, which ensure that all dependent updates (data and metadata) are committed together or rolled back together. This means a store will never be in a partially-updated state — corrupted data can be fixed by rolling back to a previous snapshot.
  • Reproducibility: An Icechunk store can be pinned to a specific snapshot, so science workflows that depend on a particular version of the data are not broken by subsequent updates. Snapshots can be tagged for long-term reference.

Reference: https://icechunk.io/en/stable/overview/.

While Icechunk is open source, this technology is maintained by a small and external development team. This introduces a risk which is a dependency on that external development team. NASA should mitigate this risk by funding icechunk maintenance and development.

More specifically, NASA should work on:

  • development of parsing chunk manifests back out of Icechunk; this will enable chunk manifests to be read back out of icechunk stores and stored in another format;
  • Icechunk maintenance; and,
  • Icechunk readers in other languages (C/C++, Julia, R, etc.).

Adopt GeoZarr standards

Adoption of GeoZarr is recommended to ensure interoperability with the developing GeoZarr ecosystem of tooling. GeoZarr is to Zarr what GeoTIFF is to TIFF: Zarr has no built-in concept of coordinate reference systems or multi-resolution data for use in tiling and overviews. GeoZarr defines conventions for these so that geospatial tools can work with Zarr stores without each tool implementing its own custom metadata interpretation. Adopting these standards in VDSs is straight-forward as they already implement Zarr metadata.

Without standards, tooling must be customized per-dataset. GeoZarr enabled development of the deck.gl-zarr library, which renders any GeoZarr-compliant dataset in the browser.

The main alternatives are CF conventions (used by NetCDF/HDF5) and STAC (for discovery-level metadata). GeoZarr draws from CF conventions but is designed specifically for Zarr, making it the natural choice for Zarr-based virtual stores.

Leverage existing tools, services and available chunk metadata.

To build virtual data stores efficiently, existing open-source tooling (Icechunk, Kerchunk, etc.) should be leveraged, rather than implementing solutions from scratch. When a collection is consistent and OPeNDAP-supported, DMRPP metadata can be used instead of reading metadata from source files. DMRPP reading is faster and can represent various archival formats.

For collections lacking DMRPP, fall back to native metadata parser. However, DMRPP has caveats that OPeNDAP should address: standardizing DMRPP across all collections, adding checksum validation at generation time, and possibly adopting a lighter schema or Parquet serialization.

Prioritize EGIS integration planning

Integration with the Earthdata Geographic Information System (EGIS) has been identified as a future priority. Scoping should begin to ensure integration with EGIS is smooth.

Address Governance Gaps

The governance decisions identified in Governance — metadata placement standards, versioning policies, and stewardship ownership — should be addressed as virtual store technology is deployed more broadly across DAACs.

Streamline end-user experience

The authentication and credential complexity currently required to open a virtual store is a significant barrier to adoption. For virtual stores to see broad use, the path from Earthdata Login to an open xarray Dataset or Datatree should be reduced to a store identifier and authentication — comparable to the experience earthaccess already provides for direct file access.

Documentation and onboarding

Virtual store documentation is already underway — PO.DAAC’s cookbook chapter, ASDC’s demo notebook and Resources all represent ongoing efforts. The next step is consolidating and improving these materials to serve two distinct audiences: data providers virtualizing datasets, and data users accessing virtual stores.

  • Provider-facing documentation should develop the existing worked examples into reusable templates covering format-specific considerations, chunking decisions, and validation.
  • User-facing documentation should lower the barrier to working with virtual stores — particularly around authentication setup, available access patterns, and how virtual store access differs from traditional file-based workflows

Coordinating these efforts across DAACs will reduce duplication and help establish consistent guidance as adoption grows.