import json
import datetime
import icechunk
import pystac
import xstac
import zarr
import xarray as xrCreating a STAC Collection for a Virtual Icechunk Store
There is a virtual icechunk store that is publicly available at: s3://nasa-waterinsight/virtual-zarr-store/NLDAS-3-icechunk/
This notebook goes through the current thinking for how you would set up a STAC collection that points to that virtual icechunk store and provides all the information a user needs to interact with the virtual zarr store programmatically or via a web UI.
Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet – let’s suppress those.
import warnings
warnings.filterwarnings(
"ignore",
message="Numcodecs codecs are not in the Zarr version 3 specification*",
category=UserWarning,
)These are the PRs that need to land before you can open the virtual icechunk store with zarr directly:
- https://github.com/zarr-developers/zarr-python/pull/3369
- https://github.com/earth-mover/icechunk/pull/1161
Until then:
storage = icechunk.s3_storage(
bucket="nasa-waterinsight",
prefix="virtual-zarr-store/NLDAS-3-icechunk/",
region="us-west-2",
anonymous=True,
)The bucket and prefix are from the icechunk href. The anonymous=True needs to come from somewhere else.
config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(
icechunk.VirtualChunkContainer(
"s3://nasa-waterinsight/NLDAS3/forcing/daily/",
icechunk.s3_store(region="us-west-2")
)
)
virtual_credentials = icechunk.containers_credentials(
{
"s3://nasa-waterinsight/NLDAS3/forcing/daily/": icechunk.s3_anonymous_credentials()
}
)Here we need the href for the internal storage bucket(s) (composed of bucket and prefix) and we need the region of that bucket. Then we need some way of providing credentials.
repo = icechunk.Repository.open(
storage=storage,
config=config,
authorize_virtual_chunk_access=virtual_credentials,
)
session = repo.readonly_session(snapshot_id='YTNGFY4WY9189GEH1FNG')Since icechunk manages versions (like git) we need some way of knowing which branch, tag or snapshot_id (similar to commit) to use.
ds = xr.open_zarr(session.store, consolidated=False, zarr_format=3)Last of all we need a way of specifying that we are looking at icechunk here as well as the standard fields: consolidated, zarr_format that are already included in the STAC Zarr extension.
Note that it is possible that these last two are not actually required. Xarray should know that for icechunk stores consolidated is always false and similarly xarray should be able to infer the zarr format from the store itself.
ds<xarray.Dataset> Size: 51TB
Dimensions: (time: 8399, lat: 6500, lon: 11700)
Coordinates:
* time (time) datetime64[ns] 67kB 2001-01-02 2001-01-03 ... 2024-01-01
* lat (lat) float64 52kB 7.005 7.015 7.025 7.035 ... 71.97 71.98 71.99
* lon (lon) float64 94kB -169.0 -169.0 -169.0 ... -52.03 -52.01 -52.0
Data variables:
PSurf (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Qair (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
SWdown (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Rainf (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
LWdown (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Tair (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Tair_min (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Wind_E (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Wind_N (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Tair_max (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Attributes: (12/17)
missing_value: -9999.0
time_definition: daily
shortname: NLDAS_FOR0010_D_3.0
title: NLDAS Forcing Data L4 Daily 0.01 x 0.01 degree V3...
version: 3.0 beta
institution: NASA GSFC
... ...
websites: https://ldas.gsfc.nasa.gov/nldas/v3/ ; https://li...
MAP_PROJECTION: EQUIDISTANT CYLINDRICAL
SOUTH_WEST_CORNER_LAT: 7.005000114440918
SOUTH_WEST_CORNER_LON: -168.9949951171875
DX: 0.009999999776482582
DY: 0.009999999776482582Ok! We have the xarray dataset lazily opened by accessing only the virtual icechunk store! Now the goal is to create a STAC collection that describes that virtual icechunk store and contains all the information we need for accessing it (all the inputs to the functions above).
Extract metadata for STAC collection
Now let’s see how xstac can help us extract the variables and represent them using the Datacube STAC Extension.
This section takes inspiration from https://github.com/stac-utils/xstac/blob/main/examples/nasa-nex-gddp-cmip6/generate.py
We’ll start with some of the hard-coded values that will need to be provided for any given dataset. As much as possible these should be lifted directly from the xr.Dataset attrs.
collection_id = "nldas-3"
description = (
"NLDAS-3 provides a fine-scale (1 km) meteorological forcing (precipitation) in "
"both retrospective and near real-time over North and Central America, including "
"Alaska, Hawaii, and Puerto Rico, by leveraging high-quality gauge, satellite, "
"and model datasets through advanced data assimilation methods. Read more: "
"https://ldas.gsfc.nasa.gov/nldas/v3"
)
providers = [
pystac.Provider(
name="NLDAS",
roles=["producer", "processor", "licensor"],
url="https://ldas.gsfc.nasa.gov/nldas"
)
]I want to draw special attention to how we can use the Storage STAC Extension to capture that this particular bucket can be accessed anonymously. We can also capture the region within this blob.
storage_schemes = {
"aws-s3-nasa-waterinsight": {
"type": "aws-s3",
"platform": "https://{bucket}.s3.{region}.amazonaws.com",
"bucket": "nasa-waterinsight",
"region": "us-west-2",
"anonymous": True,
}
}Now let’s configure some metadata that can be gotten from the xr.Dataset itself.
title = ds.attrs["title"]
extents = pystac.Extent(
spatial=pystac.SpatialExtent(bboxes=[list(ds.rio.bounds())]),
temporal=pystac.TemporalExtent(
intervals=[
datetime.datetime.fromisoformat(str(ds.time.min().values)),
datetime.datetime.fromisoformat(str(ds.time.max().values))
]
),
)Now that we have all those values set, create a pystac.Collection:
template = pystac.Collection(
collection_id,
description=description,
extent=extents,
extra_fields={"storage:schemes": storage_schemes, "item_assets": {}},
providers=providers,
title=title,
stac_extensions=[
"https://stac-extensions.github.io/storage/v2.0.0/schema.json",
]
)Now that we have a preliminary version of the STAC Collection we can pass it off to xstac to pull out the variables and dims using the Datacube STAC Extension.
collection = xstac.xarray_to_stac(
ds,
template,
temporal_dimension="time",
x_dimension="lon",
y_dimension="lat",
reference_system=4326,
validate=False,
)With that collection in hand we can create an asset that points to the virtual icechunk store and add it as a collection-level asset.
Add collection-level assets
The main concern of the asset is how to access the data. So we need the asset to contain all the information we need to pass into icechunk functions when opening the virtual icechunk store.
snapshot_id = "YTNGFY4WY9189GEH1FNG"
virtual_href = "s3://nasa-waterinsight/virtual-zarr-store/NLDAS-3-icechunk/"
name = "nldas-3"
virtual_key = f"{name}@{snapshot_id}"
storage_ref = "aws-s3-nasa-waterinsight"
collection.add_asset(
virtual_key,
pystac.Asset(
virtual_href,
title="NLDAS-3 Virtual Zarr Store",
media_type="application/vnd.zarr+icechunk", # I made this up: discussion https://earthmover-community.slack.com/archives/C07NQCBSTB7/p1756918042834049
roles=["data", "references", "virtual", "latest-version"],
extra_fields={
"zarr:consolidated": False,
"zarr:zarr_format": 3,
"icechunk:snapshot_id": snapshot_id,
"storage:refs": [storage_ref],
}
)
)We also need to specify how to access the legacy files which potentially sit in their own bucket. We can use the Virtual Assets STAC Extension to capture that.
legacy_href = "s3://nasa-waterinsight/NLDAS3/forcing/daily/"
legacy_key = "nldas-3-legacy-bucket"
collection.add_asset(
legacy_key,
pystac.Asset(
legacy_href,
title="NLDAS-3 Legacy Bucket",
media_type="application/x-netcdf",
roles=["data"],
extra_fields={
"storage:refs": [storage_ref],
}
)
)
collection.assets[virtual_key].extra_fields["vrt:hrefs"] = [
{
"key": legacy_key,
"href": f"https://raw.githubusercontent.com/NASA-IMPACT/dse-virtual-zarr-workshop/refs/heads/main/docs/examples/collection.json#/assets/{legacy_key}"
},
]We can also add information about how to render each variable. This uses the Render STAC Extension and specifies how applications (for instance titiler) should represent each variable visually.
renders = {}
for k in ds:
if k.startswith("LW"):
colormap_name = "inferno"
elif k.startswith("PS"):
colormap_name = "viridis"
elif k.startswith("Q"):
colormap_name = "plasma"
elif k.startswith("Rain"):
colormap_name = "cfastie"
elif k.startswith("SW"):
colormap_name = "magma"
elif k.startswith("T"):
colormap_name = "RdYlBu_r"
elif k.startswith("Wind"):
colormap_name = "PuOr"
renders[k] = {
"title": ds[k].attrs["long_name"],
"assets": ["nldas-3"],
"resampling": "average",
"colormap_name": colormap_name,
"rescale": [[ds[k].attrs["vmin"], ds[k].attrs["vmax"]]],
"backend": "xarray",
}
collection.extra_fields["renders"] = rendersLast of all we will add the Zarr STAC Extension and the Render STAC Extension to the list of stac_extensions. We need to do this after adding the asset because they relies on there being assets in the collection.
collection.stac_extensions = list(set(
collection.stac_extensions + [
"https://stac-extensions.github.io/render/v2.0.0/schema.json",
"https://stac-extensions.github.io/virtual-assets/v1.0.0/schema.json",
"https://stac-extensions.github.io/zarr/v1.1.0/schema.json",
"https://stac-extensions.github.io/version/v1.2.0/schema.json",
]
))Validate the collection:
collection.validate()['https://schemas.stacspec.org/v1.1.0/collection-spec/json-schema/collection.json',
'https://stac-extensions.github.io/zarr/v1.1.0/schema.json',
'https://stac-extensions.github.io/datacube/v2.2.0/schema.json',
'https://stac-extensions.github.io/version/v1.2.0/schema.json',
'https://stac-extensions.github.io/storage/v2.0.0/schema.json',
'https://stac-extensions.github.io/render/v2.0.0/schema.json',
'https://stac-extensions.github.io/virtual-assets/v1.0.0/schema.json']
Dump the collection to json
with open("collection.json", "w") as f:
json.dump(collection.to_dict(), f, indent=2)Read the virtual icechunk using collection-level asset
There is functionality in xpystac to open the virtual icechunk store referenced by the virtual asset so with properly configured STAC, you don’t need to interact with it at all.
Now let’s read the collection in from where we stored it in the json, find the latest version of the icechunk store, and lazily open it as an xarray.Dataset
collection = pystac.Collection.from_file("collection.json")
assets = collection.get_assets(media_type="application/vnd.zarr+icechunk", role="latest-version")
assets{'nldas-3@YTNGFY4WY9189GEH1FNG': <Asset href=s3://nasa-waterinsight/virtual-zarr-store/NLDAS-3-icechunk/>}
ds = xr.open_dataset(list(assets.values())[0], chunks={}, engine="stac")
ds<xarray.Dataset> Size: 51TB
Dimensions: (time: 8399, lat: 6500, lon: 11700)
Coordinates:
* lat (lat) float64 52kB 7.005 7.015 7.025 7.035 ... 71.97 71.98 71.99
* lon (lon) float64 94kB -169.0 -169.0 -169.0 ... -52.03 -52.01 -52.0
* time (time) datetime64[ns] 67kB 2001-01-02 2001-01-03 ... 2024-01-01
Data variables:
Rainf (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
PSurf (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
LWdown (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Qair (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
SWdown (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Wind_E (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Tair_max (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Wind_N (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Tair (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Tair_min (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray>
Attributes: (12/17)
missing_value: -9999.0
time_definition: daily
shortname: NLDAS_FOR0010_D_3.0
title: NLDAS Forcing Data L4 Daily 0.01 x 0.01 degree V3...
version: 3.0 beta
institution: NASA GSFC
... ...
websites: https://ldas.gsfc.nasa.gov/nldas/v3/ ; https://li...
MAP_PROJECTION: EQUIDISTANT CYLINDRICAL
SOUTH_WEST_CORNER_LAT: 7.005000114440918
SOUTH_WEST_CORNER_LON: -168.9949951171875
DX: 0.009999999776482582
DY: 0.009999999776482582Make sure you can actually access underlying data:
%%time
cape_rain = ds.Rainf.sel(
time="2023-07-16",
lat=slice(41.48, 42.10),
lon=slice(-70.84, -69.77)
).compute()CPU times: user 3.05 s, sys: 596 ms, total: 3.65 s
Wall time: 5.89 s
cape_rain<xarray.DataArray 'Rainf' (lat: 62, lon: 107)> Size: 53kB
array([[ 5.16139793, 4.73033953, 4.79954004, ..., 2.80910897,
2.8472836 , 2.64635181],
[ 4.856112 , 4.68789768, 4.71580458, ..., 2.80869484,
2.87434697, 2.80177665],
[ 4.72029495, 4.83154964, 4.46945238, ..., 2.90991831,
3.02153993, 3.10612035],
...,
[ 7.01298904, 7.19748259, 7.40456963, ..., 18.69691277,
18.66467857, 18.25347519],
[ 7.89184809, 8.40285873, 7.87267876, ..., 17.93091774,
19.15982437, 18.79140854],
[ 8.36418152, 8.91619682, 8.63145733, ..., 18.75998878,
20.46075821, 20.56634521]], shape=(62, 107))
Coordinates:
* lat (lat) float64 496B 41.49 41.49 41.51 41.51 ... 42.08 42.08 42.1
* lon (lon) float64 856B -70.83 -70.82 -70.81 ... -69.79 -69.78 -69.77
time datetime64[ns] 8B 2023-07-16
Attributes:
units: kg m-2
standard_name: Total precipitation
long_name: Total precipitation
cell_methods: time: sum
vmin: 0.0
vmax: 60.617774963378906