Walk-through - Virtualizing NetCDFs from Amazon’s Open Data Program

Let’s start with the first example from the VirtualiZarr homepage.

This example uses the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) from the Registry of Open Data on AWS. The virtualization process will be much faster if run proximal to the data in AWS’s us-west-2 region.

Creating the virtual dataset looks quite similar to how we normally open data with xarray, but there are a few notable differences that are shown through this example.

First, import the necessary functions and classes:

import icechunk
import obstore
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry

Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet – let’s suppress those.

import warnings

warnings.filterwarnings(
    "ignore",
    message="Numcodecs codecs are not in the Zarr version 3 specification*",
    category=UserWarning,
)

We can use Obstore’s obstore.store.from_url convenience method to create an ObjectStore that can fetch data from the specified URLs.

bucket = "s3://nex-gddp-cmip6/"
path = "NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc"
store = obstore.store.from_url(bucket, region="us-west-2", skip_signature=True)

We also need to create an ObjectStoreRegistry that maps the URL structure to the ObjectStore.

registry = ObjectStoreRegistry({bucket: store})

Now, let’s create a parser instance and create a virtual dataset by passing the URL, parser, and registry to virtualizarr.open_virtual_dataset.

parser = HDFParser()
vds = open_virtual_dataset(
    url=f"{bucket}{path}",
    parser=parser,
    registry=registry,
    loadable_variables=[],
)
print(vds)

<xarray.Dataset> Size: 1GB
Dimensions:  (time: 365, lat: 600, lon: 1440)
Coordinates:
    time     (time) float64 3kB ManifestArray<shape=(365,), dtype=float64, ch...
    lat      (lat) float64 5kB ManifestArray<shape=(600,), dtype=float64, chu...
    lon      (lon) float64 12kB ManifestArray<shape=(1440,), dtype=float64, c...
Data variables:
    tasmax   (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

Since we specified loadable_variables=[], no data has been loaded or copied in this process. We have merely created an in-memory lookup table that points to the location of chunks in the original netCDF when data is needed later on. The default behavior (loadable_variables=None) will load data associated with coordinates but not data variables. The size represents the size of the original dataset - you can see the size of the virtual dataset using the vz accessor:

print(f"Original dataset size: {vds.nbytes} bytes")
print(f"Virtual dataset size: {vds.vz.nbytes} bytes")

Original dataset size: 1261459240 bytes
Virtual dataset size: 11776 bytes

VirtualiZarr’s other top-level function is virtualizarr.open_virtual_mfdataset, which can open and virtualize multiple data sources into a single virtual dataset, similar to how xarray.open_mfdataset opens multiple data files as a single dataset.

urls = [
    f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_{year}_v2.0.nc"
    for year in range(2015, 2017)
]
vds = open_virtual_mfdataset(urls, parser=parser, registry=registry)
print(vds)

<xarray.Dataset> Size: 3GB
Dimensions:  (time: 731, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 6kB 2015-01-01T12:00:00 ... 2016-12-31T12:...
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
    tasmax   (time, lat, lon) float32 3GB ManifestArray<shape=(731, 600, 1440...
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

The magic of VirtualiZarr is that you can persist the virtual dataset to disk in a chunk references format such as Icechunk, meaning that the work of constructing the single coherent dataset only needs to happen once. For subsequent data access, you can use xarray.open_zarr to open that Icechunk store, which on object storage is far faster than using xarray.open_mfdataset to open the the original non-cloud-optimized files.

Let’s persist the Virtual dataset using Icechunk. Here we store the dataset in a memory store but in most cases you’ll store the virtual dataset in the cloud.

icechunk_store = icechunk.in_memory_storage()
config = icechunk.RepositoryConfig.default()

config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        url_prefix=bucket,
        store=icechunk.s3_store(region="us-west-2", anonymous=True),
    ),
)
virtual_credentials = icechunk.containers_credentials(
    {bucket: icechunk.s3_anonymous_credentials()}
)
repo = icechunk.Repository.create(
    icechunk_store,
    config,
    authorize_virtual_chunk_access=virtual_credentials,
)

session = repo.writable_session("main")
vds.vz.to_icechunk(session.store)
session.commit("Create virtual store")

'2ZRMJZYJ7MTB7XM6T0S0'

Always check your work when creating a new virtual dataset by comparing the virtual version to the native version

import fsspec
import xarray as xr

Open the virtual version:

observed = xr.open_zarr(session.store, consolidated=False, zarr_format=3)
observed

<xarray.Dataset> Size: 3GB
Dimensions:  (time: 731, lat: 600, lon: 1440)
Coordinates:
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
  * time     (time) datetime64[ns] 6kB 2015-01-01T12:00:00 ... 2016-12-31T12:...
Data variables:
    tasmax   (time, lat, lon) float32 3GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

Let’s use a traditional method to open the NetCDF files directly:

fs = fsspec.filesystem("s3", anon=True)

fsspec_caching = {
    "cache_type": "blockcache",  # block cache stores blocks of fixed size and uses eviction using a LRU strategy.
    "block_size": 8
    * 1024
    * 1024,  # size in bytes per block, adjust depends on the file size but the recommended size is in the MB
}

expected = xr.open_mfdataset(
    [fs.open(url, **fsspec_caching) for url in urls], engine="h5netcdf"
)

Let’s use xarray’s testing utilities to compare our virtual dataset with the original dataset:

xr.testing.assert_allclose(expected, observed)