Onboarding Guide
Getting started with data virtualization
This guide is intended for data providers and engineers who are new to virtual store technology and want to begin virtualizing NASA datasets.
Prerequisites
(List required background knowledge, tools, and access credentials.)
Step 1: Understand your data
Before virtualizing, document:
- File format (NetCDF, HDF5, GeoTIFF, etc.) and any data hierarchy
- Grid structure (uniform, non-uniform, swath)
- Typical file sizes and granule counts
- Primary user access patterns
Step 2: Choose a virtualization approach
| Approach | Best for | Status |
|---|---|---|
| Kerchunk | Reference files for existing archives | Mature |
| Icechunk | Transactional stores, appending data | Active development |
Step 3: Design your chunk manifest
Refer to Best Practices and Known Limitations before finalizing your chunk layout.
Step 4: Generate and validate your virtual store
(Step-by-step instructions and code examples to be added.)
The toolchain offers multiple entry points at this step:
- earthaccess.virtualize() provides a higher-level interface that handles common cases with less configuration.
- Icechunk API directly offers more control over commit behavior, store configuration, and append workflows.
Generation can run in series or in parallel (e.g., using Dask), depending on the number of granules and available compute.
Step 5: Register and publish
(Instructions for registering your virtual store with the relevant DAAC and making it discoverable.)
Getting help
(Links to community resources, Slack channels, issue trackers, etc.)
Common gotchas
Before starting, review the Known Limitations section — several apply directly to onboarding decisions:
- Chunk size consistency — verify your files share a common chunking scheme before building the store, not after.
- Early structural decisions — understand your users’ dominant access patterns before choosing a chunk layout.
Additionally, netCDF group hierarchies may need to be flattened during virtualization. Support for netCDF groups in virtual stores is a work-in-progress. If preserving hierarchy matters for your dataset, test this early with xarray DataTree.