Onboarding Guide

Getting started with data virtualization

This guide is intended for data providers and engineers who are new to virtual store technology and want to begin virtualizing NASA datasets.

Prerequisites

(List required background knowledge, tools, and access credentials.)

Step 1: Understand your data

Before virtualizing, document:

  • File format (NetCDF, HDF5, GeoTIFF, etc.) and any data hierarchy
  • Grid structure (uniform, non-uniform, swath)
  • Typical file sizes and granule counts
  • Primary user access patterns

Step 2: Choose a virtualization approach

Approach Best for Status
Kerchunk Reference files for existing archives Mature
Icechunk Transactional stores, appending data Active development

Step 3: Design your chunk manifest

Refer to Best Practices and Known Limitations before finalizing your chunk layout.

Step 4: Generate and validate your virtual store

(Step-by-step instructions and code examples to be added.)

The toolchain offers multiple entry points at this step:

  • earthaccess.virtualize() provides a higher-level interface that handles common cases with less configuration.
  • Icechunk API directly offers more control over commit behavior, store configuration, and append workflows.

Generation can run in series or in parallel (e.g., using Dask), depending on the number of granules and available compute.

Step 5: Register and publish

(Instructions for registering your virtual store with the relevant DAAC and making it discoverable.)

Getting help

(Links to community resources, Slack channels, issue trackers, etc.)

Common gotchas

Before starting, review the Known Limitations section — several apply directly to onboarding decisions:

Additionally, netCDF group hierarchies may need to be flattened during virtualization. Support for netCDF groups in virtual stores is a work-in-progress. If preserving hierarchy matters for your dataset, test this early with xarray DataTree.