Onboarding Guide

Getting started with data virtualization

This guide is intended for data providers and engineers who are new to virtual store technology and want to begin virtualizing NASA datasets.

Prerequisites

(List required background knowledge, tools, and access credentials.)

Step 1: Understand your data

Before virtualizing, document:

File format (NetCDF, HDF5, GeoTIFF, etc.) and any data hierarchy
Grid structure (uniform, non-uniform, swath)
Typical file sizes and granule counts
Primary user access patterns

Step 2: Choose a virtualization approach

Approach	Best for	Status
Kerchunk	Reference files for existing archives	Mature
Icechunk	Transactional stores, appending data	Active development

Step 3: Design your chunk manifest

Refer to Best Practices and Known Limitations before finalizing your chunk layout.

Step 4: Generate and validate your virtual store

(Step-by-step instructions and code examples to be added.)

The toolchain offers multiple entry points at this step:

earthaccess.virtualize() provides a higher-level interface that handles common cases with less configuration.
Icechunk API directly offers more control over commit behavior, store configuration, and append workflows.

Generation can run in series or in parallel (e.g., using Dask), depending on the number of granules and available compute.

Step 5: Register and publish

(Instructions for registering your virtual store with the relevant DAAC and making it discoverable.)

Getting help

(Links to community resources, Slack channels, issue trackers, etc.)

Common gotchas

Before starting, review the Known Limitations section — several apply directly to onboarding decisions:

Chunk size consistency — verify your files share a common chunking scheme before building the store, not after.
Early structural decisions — understand your users’ dominant access patterns before choosing a chunk layout.

Additionally, netCDF group hierarchies may need to be flattened during virtualization. Support for netCDF groups in virtual stores is a work-in-progress. If preserving hierarchy matters for your dataset, test this early with xarray DataTree.