import icechunk
import obstore
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistryWalk-through - Virtualizing NetCDFs from Amazon’s Open Data Program
Let’s start with the first example from the VirtualiZarr homepage.
This example uses the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) from the Registry of Open Data on AWS. The virtualization process will be much faster if run proximal to the data in AWS’s us-west-2 region.
Creating the virtual dataset looks quite similar to how we normally open data with xarray, but there are a few notable differences that are shown through this example.
First, import the necessary functions and classes:
Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet – let’s suppress those.
import warnings
warnings.filterwarnings(
"ignore",
message="Numcodecs codecs are not in the Zarr version 3 specification*",
category=UserWarning,
)We can use Obstore’s obstore.store.from_url convenience method to create an ObjectStore that can fetch data from the specified URLs.
bucket = "s3://nex-gddp-cmip6/"
path = "NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc"
store = obstore.store.from_url(bucket, region="us-west-2", skip_signature=True)We also need to create an ObjectStoreRegistry that maps the URL structure to the ObjectStore.
registry = ObjectStoreRegistry({bucket: store})Now, let’s create a parser instance and create a virtual dataset by passing the URL, parser, and registry to virtualizarr.open_virtual_dataset.
parser = HDFParser()
vds = open_virtual_dataset(
url=f"{bucket}{path}",
parser=parser,
registry=registry,
loadable_variables=[],
)
print(vds)<xarray.Dataset> Size: 1GB
Dimensions: (time: 365, lat: 600, lon: 1440)
Coordinates:
time (time) float64 3kB ManifestArray<shape=(365,), dtype=float64, ch...
lat (lat) float64 5kB ManifestArray<shape=(600,), dtype=float64, chu...
lon (lon) float64 12kB ManifestArray<shape=(1440,), dtype=float64, c...
Data variables:
tasmax (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/22)
cmip6_source_id: ACCESS-CM2
cmip6_institution_id: CSIRO-ARCCSS
cmip6_license: CC-BY-SA 4.0
activity: NEX-GDDP-CMIP6
Conventions: CF-1.7
frequency: day
... ...
doi: https://doi.org/10.7917/OFSG3345
external_variables: areacella
contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
creation_date: Sat Nov 16 13:31:18 PST 2024
disclaimer: These data are considered provisional and subject ...
tracking_id: d4b2123b-abf9-4c3c-a780-58df6ce4e67f
Since we specified loadable_variables=[], no data has been loaded or copied in this process. We have merely created an in-memory lookup table that points to the location of chunks in the original netCDF when data is needed later on. The default behavior (loadable_variables=None) will load data associated with coordinates but not data variables. The size represents the size of the original dataset - you can see the size of the virtual dataset using the vz accessor:
print(f"Original dataset size: {vds.nbytes} bytes")
print(f"Virtual dataset size: {vds.vz.nbytes} bytes")Original dataset size: 1261459240 bytes
Virtual dataset size: 11776 bytes
VirtualiZarr’s other top-level function is virtualizarr.open_virtual_mfdataset, which can open and virtualize multiple data sources into a single virtual dataset, similar to how xarray.open_mfdataset opens multiple data files as a single dataset.
urls = [
f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_{year}_v2.0.nc"
for year in range(2015, 2017)
]
vds = open_virtual_mfdataset(urls, parser=parser, registry=registry)
print(vds)<xarray.Dataset> Size: 3GB
Dimensions: (time: 731, lat: 600, lon: 1440)
Coordinates:
* time (time) datetime64[ns] 6kB 2015-01-01T12:00:00 ... 2016-12-31T12:...
* lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
* lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
tasmax (time, lat, lon) float32 3GB ManifestArray<shape=(731, 600, 1440...
Attributes: (12/22)
cmip6_source_id: ACCESS-CM2
cmip6_institution_id: CSIRO-ARCCSS
cmip6_license: CC-BY-SA 4.0
activity: NEX-GDDP-CMIP6
Conventions: CF-1.7
frequency: day
... ...
doi: https://doi.org/10.7917/OFSG3345
external_variables: areacella
contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
creation_date: Sat Nov 16 13:31:18 PST 2024
disclaimer: These data are considered provisional and subject ...
tracking_id: d4b2123b-abf9-4c3c-a780-58df6ce4e67f
The magic of VirtualiZarr is that you can persist the virtual dataset to disk in a chunk references format such as Icechunk, meaning that the work of constructing the single coherent dataset only needs to happen once. For subsequent data access, you can use xarray.open_zarr to open that Icechunk store, which on object storage is far faster than using xarray.open_mfdataset to open the the original non-cloud-optimized files.
Let’s persist the Virtual dataset using Icechunk. Here we store the dataset in a memory store but in most cases you’ll store the virtual dataset in the cloud.
icechunk_store = icechunk.in_memory_storage()
config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(
icechunk.VirtualChunkContainer(
url_prefix=bucket,
store=icechunk.s3_store(region="us-west-2", anonymous=True),
),
)
virtual_credentials = icechunk.containers_credentials(
{bucket: icechunk.s3_anonymous_credentials()}
)
repo = icechunk.Repository.create(
icechunk_store,
config,
authorize_virtual_chunk_access=virtual_credentials,
)
session = repo.writable_session("main")
vds.vz.to_icechunk(session.store)
session.commit("Create virtual store")'2ZRMJZYJ7MTB7XM6T0S0'
Always check your work when creating a new virtual dataset by comparing the virtual version to the native version
import fsspec
import xarray as xrOpen the virtual version:
observed = xr.open_zarr(session.store, consolidated=False, zarr_format=3)
observed<xarray.Dataset> Size: 3GB
Dimensions: (time: 731, lat: 600, lon: 1440)
Coordinates:
* lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
* lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
* time (time) datetime64[ns] 6kB 2015-01-01T12:00:00 ... 2016-12-31T12:...
Data variables:
tasmax (time, lat, lon) float32 3GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
Attributes: (12/22)
cmip6_source_id: ACCESS-CM2
cmip6_institution_id: CSIRO-ARCCSS
cmip6_license: CC-BY-SA 4.0
activity: NEX-GDDP-CMIP6
Conventions: CF-1.7
frequency: day
... ...
doi: https://doi.org/10.7917/OFSG3345
external_variables: areacella
contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
creation_date: Sat Nov 16 13:31:18 PST 2024
disclaimer: These data are considered provisional and subject ...
tracking_id: d4b2123b-abf9-4c3c-a780-58df6ce4e67fLet’s use a traditional method to open the NetCDF files directly:
fs = fsspec.filesystem("s3", anon=True)
fsspec_caching = {
"cache_type": "blockcache", # block cache stores blocks of fixed size and uses eviction using a LRU strategy.
"block_size": 8
* 1024
* 1024, # size in bytes per block, adjust depends on the file size but the recommended size is in the MB
}
expected = xr.open_mfdataset(
[fs.open(url, **fsspec_caching) for url in urls], engine="h5netcdf"
)Let’s use xarray’s testing utilities to compare our virtual dataset with the original dataset:
xr.testing.assert_allclose(expected, observed)