Virtual Stores at NASA
This page collects worked examples of virtual store implementations, serving as a reference for virtual layer producers. Each example documents the data type, chosen approach, design decisions, and lessons learned.
Virtual Stores Inventory
| Project | Dataset | Dataset location | Notebooks and Repos |
|---|---|---|---|
| PO.DAAC | 10+ datasets | See Using Virtual Datasets | Virtual Data Set Starter Notebook |
| NSIDC | SMAP L4 | 10 years of SPL4SMGP — more coming soon | SPL4SMGP Notebook |
| VEDA | RASI | s3://nasa-waterinsight/virtual-zarr-store/icechunk/RASI/ | Github Repo |
| VEDA | NLDAS | s3://nasa-waterinsight/virtual-zarr-store/NLDAS-3-icechunk/ | Github Repo |
| VEDA | GEOS-CF | ||
| ASDC | TEMPO | s3://asdc-prod-public/virtual-reference-docs/TEMPO_L3_V04_icechunk_for_202601/ |
Demo Notebook |
| ASDC | PREFIRE | s3://asdc-prod-public/virtual-reference-docs/PREFIRE_SAT2_3-SFC-SORTED-ALLSKY_R01/prefire-monthly-kerchunk_created20250819.json |
Demo Notebook |
| ASF | OPERA_L3_DISP-S1 | See RelatedUrls with "Format": "Zarr" in C3294057315-ASF |
Powers the OPERA Displacement Portal |
Mature implementations
Example 1: Consistently gridded data — PODAAC
- Data type: Gridded ocean/atmosphere fields
- Virtualization approach: Collection-level aggregated chunk manifests (Kerchunk)
- Status: Production
- Owners: Ed Armstrong and Dean Henze
Collection-level aggregated chunk manifests have been successfully implemented at PO.DAAC for consistently gridded Level 3 and 4 datasets. To date a total of 10 PO.DAAC datasets have been virtualized as detailed in a chapter “Using Virtual Datasets” in the PO.DAAC online cookbook.
One of the primary advantages of using virtual data stores (VDS) is users can interact with entire logical datasets without any preprocessing. Users’ Jupyter notebooks and Python scripts can bypass the task of downloading, wrangling and merging data from many individual files.
Take for example one of PO.DAAC’s gridded sea surface temperature (SST) dataset called OSTIA from the GHRSST project. It contains >15,000 daily files from a 41 year times series and is over 11 TBs. Ideally, a single line of code would access all the variables in the dataset in neatly organized multidimensional arrays. In the OSTIA example, SST, SST uncertainty, and mask variables are 3 dimensional, with dimensions latitude, longitude, and time. In a nutshell, the PO.DAAC VDSs provide seamless access to subsetting of desired regions and timeranges across the entire collection of files. The VDSs improve access and computation without requiring users to download massive datasets and operate on them in a file-by-file method as traditional science workflows often do.
The PO.DAAC VDS implementation model represents a mature approach and is recommended as a baseline implementation for similar data types across other DAACs.
Example 2: OPERA displacement portal — ASF
- Data type: Framed L3 radar line-of-site displacements
- Virtualization approach: Granule chunk manifests (Kerchunk) and frame-level chunk manifests (Kerchunk)
- Status: Production
- Owners: Kim Fairbanks (technical poc), Kathleen Kristenson (portal), Cassandra Wagner (kerchunk ingest), Luca Cinquini (OPERA project)
The OPERA DISP-S1 products provide surface deformation time series data over much of North and Central America from 2016 onward. These products are Level-3 products derived from Sentinel-1 synthetic aperture radar (SAR) data, and are a powerful tool for identifying areas of subsidence and uplift. Many use cases for these products are time-series-based, where users want to understand the deformation pattern of that area through time. To meet this user need, the OPERA project and ASF DAAC developed the Opera Displacement Portal, which plots the short-wavelength displacement time series for any point location.
OPERA DISP-S1 granules are provided in netCDF4 format and “framed” such that they have a consistent (geostationary) spatial extent over time. New granules are created following the same 12-day revisit cycle of the Sentinel-1 mission, so a single OPERA frame corresponds to a time-series “stack” of granules. That is, searching the DISP-S1 collection by frame number or by a frame’s spatial extent yields the same stack of granules. These granules are combined into a kerchunk-based, frame-level virtual Zarr store, with just the short-wavelength displacement variable from each granule, to enable rapid access to the displacement time-series.
During ingest, ASF DAAC creates a Kerchunk-based virtual Zarr store for each granule. The frame-level virtual Zarr store is also recreated from the entire stack of granules, adding the new granule’s displacement data to the store. Links to both the per-granule and frame-level Zarr stores are included in the granule’s UMM metadata as related URLs, making them discoverable through standard NASA Earthdata discovery and access tools.
The ASF virtual Zarr store implementation model represents a mature approach and is recommended as a baseline implementation for similar dataset types across other DAACs. This model is easily adapted to any framed dataset where users would benefit from a time-series view of the data.
Developing implementations
Example 3: NISAR Geocoded Synthetic Aperture Radar products — ASF
- Data type: Geocoded NISAR synthetic aperture radar products
- Virtualization approach: Granule and frame aggregated chunk manifests (IceChunk)
- Status: In development
- Owner: Joseph H. Kennedy (Prototype)
ASF is looking to develop virtual Zarr stores for time-series analyses of all the L2+ Geocoded NISAR products. Building off of ASF’s experience with the OPERA Displacement portal (described above) and Joseph’s experience on the ITS_LIVE project (which uses Zarr datacubes to power a time-series portal), ASF plans to provide both granule-level virtual Zarr stores and frame-level virtual Zarr stores.
There are, however, some likely changes to how ASF approaches this compared to how the OPERA displacement portal was developed. primarily, ASF plans to use icechunk as the virtualization method because it supports appending – kerchunk does not support appending, so for OPERA DISP-S1, the frame-level virtual zarr store needs to be recreated each time new a new granule’s data is added to the store. This is a resource-intensive operation that increases in cost as the archive grows.
Like OPERA DISP-S1, NISAR data products are framed such that they have a consistent (geostationary) spatial extent over time. The map of all the NISAR frames is static; the frame geometries do not shift over time. Instead of providing links to the frame-level virtual stores in every NISAR granule’s CMR metadata, which would require substantial updates to the mission processing and DAAC ingest systems, ASF plans on providing a single, static NISAR frame collection in CMR. Each “granule” in this collection would describe a single NISAR frame; it would include spatial metadata describing the NISAR frame’s footprint, and its data links would point to the virtual zarr stores created from time-series stacks of NISAR products. This would allow users to do a fast spatial search for an area of interest and easily load the entire time-series for the product(s) they are interested in.
There are a few benefits of a secondary collection instead of integrating it into the product granule metadata, as was done for OPERA DISP-S1:
- For users, this collection can provide a single entrypoint into all L2+ NISAR data, reducing the number of things to search through by orders of magnitude. This improves access speed and reduces the users’ cognitive load.
- For an active mission like NISAR, this imposes no work on the mission processing system or the ingest pipeline to accommodate metadata changes.
- The CMR collection metadata provides a place to describe the virtualization approach in a standard way that’s consumable by downstream tools.
- Downstream services like Earthdata Search and ASF’s vertex can better visually represent the spatial and temporal coverage of the data and provide a way to directly interact with the virtual zarr stores.
Example 4: Generating and Distributing Manifest Files at GES DISC
- Data type: Level 3/4 Products
- Virtualization approach: Kerchunk Manifest Files
- Status: In development
- Owner: Hailiang Zhang, Christine Smit, Chris Battisto
GES DISC is working to provide users with virtual Zarr services as part of its on-prem aggregation service deprecation efforts. The current plan is to enable virtual Zarr support for the 10 most popular collections previously served through on-prem aggregation services. GES DISC conducted benchmarking for manifest generation using a representative MERRA-2 dataset (M2T1NXAER), which contains approximately 4.3 TB of data. Running the manifest generation in a cloud-based dask cluster resulted in a total processing time of about 5 hours. In addition, GES DISC evaluated data subsetting performance. The test results showed that using Parquet-based Kerchunk manifests delivers roughly 10× better performance compared to JSON-based Kerchunk manifests. However, Parquet-based approaches have limitations, such as lack of support for incremental data appending. The VirtualiZarr team appears to be moving toward Icechunk as a longer-term solution. As a short term solution, we are generating Parquet Kerchunk manifests for historical static datasets, with plans to transition to Icechunk in the future for improved flexibility.
Example 5: Air quality satellite data (TEMPO) — ASDC
- Data type: Air quality satellite data (TEMPO L3 NO₂)
- Virtualization approach: Icechunk in S3
- Status: Proof of concept
- Owner: ASDC
A month of TEMPO L3 V04 NO₂ granules (January 2026, ~30 GB) was consolidated into a single Icechunk store (~6 MB reference layer). The store was built locally in four incremental append-and-commit steps, then manually uploaded to S3 (asdc-prod-public bucket). The demo notebook authenticates via earthaccess, opens the store read-only, and loads the full month as a single xarray Dataset — producing cartopy map visualizations of tropospheric NO₂ columns using both index slicing and coordinate-based selection.
Next steps:
- Direct-to-S3 appending. The store is currently built locally and uploaded as a static snapshot. The next step is appending directly to a store in S3 so new granules can be committed incrementally without a local intermediate — enabling a continuously growing, cloud-hosted datacube.
- Streamlined end-user experience. Opening the store currently requires manual S3 configuration, credential helper classes, and multiple Icechunk API calls. This needs to be reduced to an Earthdata Login and a store identifier.
- Preserve netCDF group hierarchy. The current process flattens the native netCDF group structure (
/geolocation,/product,/support_data) into a single level. The next step is to use xarray DataTree to preserve the hierarchy in the Icechunk store.
Example 6: Surface emissivity (PREFIRE) — ASDC
- Data type: Surface emissivity
- Virtualization approach: Kerchunk in S3
- Status: Proof of concept
- Owner: ASDC
A Kerchunk proof of concept provides virtualized access to monthly-aggregated PREFIRE surface emissivity data. A demo notebook is available. Kerchunk was chosen for this proof of concept to explore and develop best practices for virtualizing, as Kerchunk allows for more straightforward inspection and debugging of the references in JSON. The next step is to build an Icechunk store in S3 for the same dataset and compare the user experience and performance of the two approaches.
Example 7: SMAP L4 soil moisture - NSIDC
- Data type: Soil moisture
- Virtualization approach: Kerchunk Parquet in S3
- Status: Proof of concept
- Owner: NSIDC
We assembled 10 years of SPL4SMGP a global Level-4 data from the SMAP mission totalling 40TB, 3-hour, 9km resolution. During this process, we corrected source file metadata issues within the virtual store by properly assigning coordinates and values to the time dimension. By integrating earthaccess and leveraging dmrpp files, we were able to accelerate the creation of this virtual store by an order of magnitude. This prototype is first of potentially many virtual data cubes at NSIDC DAAC.
Example 8: L2 Collections at PODAAC
Virtualized L2 collections are currently under development at PO.DAAC.