Known Limitations
Early structural decisions have lasting performance consequences
Virtual data stores (VDS) depend upon the chunks of the underlying files. Files’ internal chunk structure, and consequently chunk manifests, cannot be optimized simultaneously for all use cases. Chunk shape and size directly determines which access patterns are fast and which are slow. Early structural decisions will benefit some access patterns and disadvantage others. There is no universally optimal chunking, only a least-worst one for the most common use case.
Chunk size
Chunks that are too small cause excessive HTTP requests and computational overhead to decompress. Chunks that are too large transfer more data than needed.
For a more thorough explanation, see Datacube Guide: Tiny data chunks and Datacube Guide: Massive data chunks.
Chunk shape
If a set of files has a chunk shape to optimize for spatial access it cannot simultaneously be optimized for access across the time dimension (i.e. time series).
A useful analogy: pancakes vs. churros.
- A pancake chunk holds the full spatial extent at one timestep. Loading a global snapshot is fast because it’s all in one chunk. Time series are slow because each timestep is stored in a separate chunk.
- A churro chunk holds many timesteps for a small spatial location. Time series are fast for a spatial subset, but global views are slow.
This is a real problem in practice: many datasets store one file per timestep, which makes data collection straight-forward but is not optimized for time series access.
VDSs are often built after data product decisions have already been made. What you can still control is the manifest, where you can make changes to what variables are represented and in what composition. For examples, see Virtual Stores at NASA.
Chunk sizes must be consistent across files
Variable-length chunks are not yet supported, meaning that all files contributing to a virtual store must share the same chunking scheme, as described in this Zarr feature request. This presents challenges for any dataset where the grid shape differs, even slightly, across granules. The TEMPO collection, as described in this GitHub issue, is an example where even a small different in grid shape can make the dataset incompatible with the current Zarr model.
Note work to support variable-length chunks is underway, see https://github.com/zarr-developers/zarr-python/pull/3802.
Collections with multiple grid systems
There is significant variation across—and even within—datasets when it comes to grid systems. Data from Earth-observing satellites is commonly collected in swaths, which do not natively form a single, regular data cube. Most global, uniformly gridded datasets are higher-level products (Level 3 or Level 4) that have undergone additional processing to form a consistent grid.
However, many data products preserve their native spatial structure, resulting in different coordinate systems, as well as overlapping or irregular grids. These datasets are typically organized into tiles, scenes, or frames (e.g., Sentinel tiles; OPERA and NISAR frames). Virtualizing these datasets requires additional design considerations to account for their varying geometries and coordinate systems.
As highlighted in a GeoZarr specification thread on mapping multi-CRS datasets, array-based formats like Zarr typically require a uniform multi-dimensional structure. Array-based formats like Zarr struggle to natively represent heterogeneous, multi-CRS tiles (such as global Sentinel-2 or Landsat grids) without forcing a permanent reprojection to a common global grid — a process that introduces data loss, edge artifacts, and spatial distortion.
To preserve original data integrity, virtualization must instead rely on abstract, secondary indexing systems (like GDAL’s Geospatial Tile Index or external STAC catalogs) to act as a “virtual mosaic,” stitching together scenes and handling reprojection strictly on-the-fly. This “dynamic” approach shifts the bottleneck to the I/O layer. As contributors in the GeoZarr thread experienced, rapidly fetching and aligning thousands of localized, overlapping chunks can easily overwhelm cloud storage APIs with massive concurrent read requests, requiring manual tuning of client caching and concurrency.
If the GeoZarr community coalesce into a solution, it will need to be be implemented in client libraries like GDAL and xarray as well. The benefit of supporting these kind of virtual data cubes will be big for the scientific community, but the technical challenges are not trivial.
Potential solutions
- Some datasets may lend themselves to being cubable per-frame, such as NISAR.
- Compose a virtual data tree: this solves the issue of variable grid systems by not representing the data as a cube but still enabling a single entry point. The problem of mosaicking data still remains.
- Apply a query engine to Zarr metadata storing many 1-D arrays. This will streamline mosaicking as it avoids a secondary indexing system and potentially avoids fetching overlapping chunks. An early approach to this is Zarr Data Fusion Search.
- For future missions, investigate the use of discrete global grid systems so lower level data products enable spatial and temporal alignment that will facilitate data processing and the generation of virtual data cubes.
Authentication and credential complexity
Opening a virtual store backed by NASA data currently requires steps beyond standard Earthdata Login, specifically S3 credential configuration and tool-specific API calls to open the store before any data is accessed. Friction exists because virtual stores sit at the intersection of several authentication boundaries: the store itself (which may be in a public or protected S3 bucket), the source data files the store references (which typically require Earthdata Login credentials), and services (which may have their own authentication interfaces).
Language and ecosystem constraints
Icechunk is written in Rust with an API in Python. Users and data providers working in other languages (Julia, R, Java, etc.) may face limited or no support for reading and writing Icechunk stores. Rust presents an organizational risk similar to what NASA has experienced with niche languages in other systems: supporting and extending Icechunk long-term would require NASA staff or contractors with Rust expertise, which is not yet widely available in the earth science community. Rust is seeing broader general adoption than some past niche languages, which reduces but does not eliminate this risk.
Until this is simplified to something comparable to the experience earthaccess provides for direct file access, credential complexity will remain a practical barrier to adoption — particularly for researchers who are not cloud-infrastructure specialists.