Virtual Stores at NASA

This page collects worked examples of virtual store implementations, serving as a reference for virtual layer producers. Each example documents the data type, chosen approach, design decisions, and lessons learned.

Virtual Stores Inventory

Project	Dataset	Dataset location	Notebooks and Repos
PO.DAAC	10+ datasets	See Using Virtual Datasets	Virtual Data Set Starter Notebook
NSIDC	SMAP L4	10 years of SPL4SMGP — more coming soon	SPL4SMGP Notebook
VEDA	RASI	`s3://nasa-waterinsight/virtual-zarr-store/icechunk/RASI/`	Demo Notebook
VEDA	NLDAS-3	`s3://nasa-waterinsight/virtual-zarr-store/NLDAS-3-icechunk/`	Demo Notebook
VEDA	GEOS-CF
ASDC	TEMPO	`s3://asdc-prod-public/virtual-reference-docs/TEMPO_L3_V04_icechunk_for_202601/`	Demo Notebook
ASDC	PREFIRE	`s3://asdc-prod-public/virtual-reference-docs/PREFIRE_SAT2_3-SFC-SORTED-ALLSKY_R01/prefire-monthly-kerchunk_created20250819.json`	Demo Notebook
ASF	OPERA_L3_DISP-S1	See `RelatedUrls` with `"Format": "Zarr"` in C3294057315-ASF	Powers the OPERA Displacement Portal

Mature implementations

Characteristics:

Consistent spatial and temporal resolution across each variable
Uniform chunk grid and compression scheme across each variable
Chunk manifest generated at collection level, spanning many granules

Example 1: Consistently gridded data — PODAAC

Data type: Gridded ocean/atmosphere fields
Virtualization approach: Collection-level aggregated chunk manifests (Kerchunk)
Status: Production
Owners: Ed Armstrong and Dean Henze

Collection-level aggregated chunk manifests have been successfully implemented at PODAAC for consistently gridded Level 3 and 4 datasets. To date a total of ten PO.DAAC datasets have been virtualized as detailed in a chapter “Using Virtual Datasets” in the PO.DAAC online cookbook (https://podaac.github.io/tutorials/quarto_text/UsingVirtualDatasets.html).

One of the primary advantages of using virtual data stores (VDS) is users can interact with entire logical datasets without any preprocessing. Users’ Jupyter notebooks and Python scripts can bypass the task of downloading, wrangling and merging data from many individual files.

Take for example one of PO.DAAC’s gridded sea surface temperature (SST) dataset called OSTIA from the GHRSST project. It contains >15,000 daily files from a 41 year time series and is over 11 TBs. Ideally, a single line of code would access all the variables in the dataset in neatly organized multidimensional arrays. In the OSTIA example, SST, SST uncertainty, and mask variables are 3 dimensional, with dimensions latitude, longitude, and time. In a nutshell, the PO.DAAC VDSs provide seamless access to subsetting of desired regions and time ranges across the entire collection of files. The VDSs improve access and computation without requiring users to download massive datasets and operate on them in a file-by-file method as traditional science workflows often do.

The PO.DAAC VDS implementation model represents a mature approach and is recommended as a baseline implementation for similar data types across other DAACs.

Example 2: OPERA displacement portal — ASF

Data type: Framed L3 radar line-of-site displacements
Virtualization approach: Granule chunk manifests (Kerchunk) and frame-level chunk manifests (Kerchunk)
Status: Production
Owners: Kim Fairbanks (technical poc), Kathleen Kristenson (portal), Cassandra Wagner (kerchunk ingest), Luca Cinquini (OPERA project)

The OPERA DISP-S1 products provide surface deformation time series data over much of North and Central America from 2016 onward. These products are Level-3 products derived from Sentinel-1 synthetic aperture radar (SAR) data, and are a powerful tool for identifying areas of subsidence and uplift. Many use cases for these products are time-series-based, where users want to understand the deformation pattern of that area through time. To meet this user need, the OPERA project and ASF DAAC developed the Opera Displacement Portal, which plots the short-wavelength displacement time series for any point location.

OPERA DISP-S1 granules are provided in netCDF4 format and “framed” such that they have a consistent (geostationary) spatial extent over time. New granules are created following the same 12-day revisit cycle of the Sentinel-1 mission, so a single OPERA frame corresponds to a time-series “stack” of granules. That is, searching the DISP-S1 collection by frame number or by a frame’s spatial extent yields the same stack of granules. These granules are combined into a kerchunk-based, frame-level virtual Zarr store, with just the short-wavelength displacement variable from each granule, to enable rapid access to the displacement time-series.

During ingest, ASF DAAC creates a Kerchunk-based virtual Zarr store for each granule. The frame-level virtual Zarr store is also recreated from the entire stack of granules, adding the new granule’s displacement data to the store. Links to both the per-granule and frame-level Zarr stores are included in the granule’s UMM metadata as related URLs, making them discoverable through standard NASA Earthdata discovery and access tools.

The ASF virtual Zarr store implementation model represents a mature approach and is recommended as a baseline implementation for similar dataset types across other DAACs. This model is easily adapted to any framed dataset where users would benefit from a time-series view of the data.

Developing implementations

Example 3: NISAR Geocoded Synthetic Aperture Radar products — ASF

Data type: Geocoded NISAR synthetic aperture radar products
Virtualization approach: Granule and frame aggregated chunk manifests (IceChunk)
Status: In development
Owner: Joseph H. Kennedy (Prototype)

ASF is looking to develop virtual Zarr stores for time-series analyses of all the L2+ Geocoded NISAR products. Building off of ASF’s experience with the OPERA Displacement portal (described above) and Joseph’s experience on the ITS_LIVE project (which uses Zarr datacubes to power a time-series portal), ASF plans to provide both granule-level virtual Zarr stores and frame-level virtual Zarr stores.

There are, however, some likely changes to how ASF approaches this compared to how the OPERA displacement portal was developed. primarily, ASF plans to use icechunk as the virtualization method because it supports appending. Kerchunk does not support appending. For OPERA DISP-S1, the frame-level virtual zarr store needs to be recreated each time new a new granule’s data is added to the store. This is a resource-intensive operation that increases in cost as the archive grows.

Like OPERA DISP-S1, NISAR data products are framed such that they have a consistent (geostationary) spatial extent over time. The map of all the NISAR frames is static; the frame geometries do not shift over time. Instead of providing links to the frame-level virtual stores in every NISAR granule’s CMR metadata, which would require substantial updates to the mission processing and DAAC ingest systems, ASF plans on providing a single, static NISAR frame collection in CMR. Each “granule” in this collection would describe a single NISAR frame; it would include spatial metadata describing the NISAR frame’s footprint, and its data links would point to the virtual zarr stores created from time-series stacks of NISAR products. This would allow users to do a fast spatial search for an area of interest and easily load the entire time-series for the product(s) they are interested in.

There are a few benefits of a secondary collection instead of integrating it into the product granule metadata, as was done for OPERA DISP-S1:

For users, this collection can provide a single entrypoint into all L2+ NISAR data, reducing the number of things to search through by orders of magnitude. This improves access speed and reduces the users’ cognitive load.
For an active mission like NISAR, this imposes no work on the mission processing system or the ingest pipeline to accommodate metadata changes.
The CMR collection metadata provides a place to describe the virtualization approach in a standard way that’s consumable by downstream tools.
Downstream services like Earthdata Search and ASF’s vertex can better visually represent the spatial and temporal coverage of the data and provide a way to directly interact with the virtual zarr stores.

Example 4: Generating and Distributing Manifest Files at GES DISC

Data type: Level 3/4 Products
Virtualization approach: Kerchunk Manifest Files
Status: In development
Owner: Hailiang Zhang, Christine Smit, Chris Battisto, Jack McNelis

GES DISC is working to provide users with virtual Zarr services as part of its on-prem aggregation service deprecation efforts. The current plan is to enable virtual Zarr support for the 10 most popular collections previously served through on-prem aggregation services. GES DISC conducted benchmarking for manifest generation using a representative MERRA-2 dataset (M2T1NXAER), which contains approximately 4.3 TB of data. Running the manifest generation in a cloud-based dask cluster resulted in a total processing time of about 5 hours. In addition, GES DISC evaluated data subsetting performance. The test results showed that using Parquet-based Kerchunk manifests delivers roughly 10× better performance compared to JSON-based Kerchunk manifests. However, Parquet-based approaches have limitations, such as lack of support for incremental data appending. The VirtualiZarr team appears to be moving toward Icechunk as a longer-term solution. As a short term solution, we are generating Parquet Kerchunk manifests for historical static datasets, with plans to transition to Icechunk in the future for improved flexibility.

Example 5: Risk Analysis and Solutions Innovators (RASI)

Data type: Model data
Virtualization approach: Icechunk in S3
Status: Early development
Owner: Water Insight Team

RASI data (~10 GB) was consolidated into a single Icechunk store (~300 KB). The store was built on VEDA JupyterHub (hub.openveda.cloud), then transferred to an AWS Open Data bucket (s3://nasa-waterinsight/virtual-zarr-store/icechunk/RASI/). A demo notebook opens the Icechunk store read-only and loads the full dataset, allowing users to modify date ranges and compare performance against the original NetCDF files for benchmarking. Time-series extractions for selected points demonstrate the speed-up of Icechunk, supporting fast, cloud-native access for RASI dataset.

Example 6: NLDAS-3

Data type: Land surface forcing & model output (hourly/daily)
Virtualization approach: Icechunk in S3
Status: Early development
Owner: Water Insight Team

NLDAS-3 daily forcing data (~23 TB) was consolidated into a single Icechunk store (~200 MB). The store was built using VEDA JupyterHub (hub.openveda.cloud) and then migrated to an AWS Open Data bucket. A demo notebook opens the Icechunk store read-only and loads the full 23-year archive, allowing users to adjust date ranges and compare performance against original NetCDF files for benchmarking. The workflow includes generating time series analyses for 1, 5, and 10 spatial points.

Example 7: GEOS-CF (NASA GMAO)

Data type: Global atmospheric composition (analysis & forecast, v2)
Virtualization approach: Icechunk
Status: Early development
Owner: NASA GMAO

GEOS-CF v2 analysis and forecast fields were consolidated into Icechunk stores to enable cloud-native access to global atmospheric composition data. A subset of the GEOS-CF archive (air-quality) was processed into Icechunk stores for both the analysis and forecast product streams. Each store reduces tens of gigabytes of NetCDF files into compact reference layers suitable for fast loading and scalable analysis on the cloud.

The Icechunk stores were built using VEDA JupyterHub (hub.openveda.cloud) and current available in s3://veda-data-store for testing purpose. A demo notebook loads the GEOS-CF Icechunk stores read-only, enabling users to select variables, adjust time ranges, and compare performance with the original NetCDF files. Time-series sampling, vertical profile extraction, and global map visualization workflows demonstrate the improved responsiveness of Icechunk and highlight its suitability for large-scale GEOS-CF atmospheric composition modeling and forecasting workloads.

Example 8: Air quality satellite data (TEMPO) — ASDC

Data type: Air quality satellite data (TEMPO L3 NO₂)
Virtualization approach: Icechunk in S3
Status: Proof of concept
Owner: ASDC

A month of TEMPO L3 V04 NO₂ granules (January 2026, ~30 GB) was consolidated into a single Icechunk store (~6 MB reference layer). The store was built locally in four incremental append-and-commit steps, then manually uploaded to S3 (asdc-prod-public bucket).

The demo notebook authenticates via earthaccess, opens the store read-only, and loads the full month as a single xarray Dataset — producing cartopy map visualizations of tropospheric NO₂ columns using both index slicing and coordinate-based selection.

Next steps:

Direct-to-S3 appending. The store is currently built locally and uploaded as a static snapshot. The next step is appending directly to a store in S3 so new granules can be committed incrementally without a local intermediate — enabling a continuously growing, cloud-hosted datacube.
Streamlined end-user experience. Opening the store currently requires manual S3 configuration, credential helper classes, and multiple Icechunk API calls. This needs to be reduced to an Earthdata Login and a store identifier.
Preserve netCDF group hierarchy. The current process flattens the native netCDF group structure (/geolocation, /product, /support_data) into a single level. The next step is to use xarray DataTree to preserve the hierarchy in the Icechunk store.

Example 9: Surface emissivity (PREFIRE) — ASDC

Data type: Surface emissivity
Virtualization approach: Kerchunk in S3
Status: Proof of concept
Owner: ASDC

A Kerchunk proof of concept provides virtualized access to monthly-aggregated PREFIRE surface emissivity data. A demo notebook is available. Kerchunk was chosen for this proof of concept to explore and develop best practices for virtualizing, as Kerchunk allows for more straightforward inspection and debugging of the references in JSON. The next step is to build an Icechunk store in S3 for the same dataset and compare the user experience and performance of the two approaches.

Example 10: SMAP L4 soil moisture - NSIDC

Data type: Soil moisture
Virtualization approach: Kerchunk Parquet in S3
Status: Proof of concept
Owner: NSIDC

We assembled 10 years of SPL4SMGP a global Level-4 data from the SMAP mission totalling 40TB, 3-hour, 9km resolution. During this process, we corrected source file metadata issues within the virtual store by properly assigning coordinates and values to the time dimension. By integrating earthaccess and leveraging dmrpp files, we were able to accelerate the creation of this virtual store by an order of magnitude. This prototype is first of potentially many virtual data cubes at NSIDC DAAC.

Example 11: L2 Collections at PODAAC

Recently, PO.DAAC has developed a novel approach to virtualizing non-gridded Level 2 datasets that have well behaved repeat orbits. The implementation requires a new dimension and coordinate to be added to the original dataset in the virtual dataset creation process. Often this new dimension is the orbit start time, or a simple sequential granule index number. For example, for the VDS for the SWOT_L2_LR_SSH_Basic_D dataset, the new dimension is a sequential numeric list called ‘granule’, with new additional coordinate variables of cycle, pass, orbit start time, and filename. A user can then query for unique cycle/pass combinations from the VDS or use a space/time CMR query to find regional intersecting orbits, and determine the exact filenames for subsetted data.