import json
import datetime
import icechunk
import pystac
import xstac
import zarr
import xarray as xr
Creating a STAC Collection for a Virtual Icechunk Store
There is a virtual icechunk store that is publicly available at: s3://nasa-waterinsight/virtual-zarr-store/NLDAS-3-icechunk/
This notebook goes through the current thinking for how you would set up a STAC collection that points to that virtual icechunk store and provides all the information a user needs to interact with the virtual zarr store programmatically or via a web UI.
Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet – let’s suppress those.
import warnings
warnings.filterwarnings("ignore",
="Numcodecs codecs are not in the Zarr version 3 specification*",
message=UserWarning,
category )
These are the PRs that need to land before you can open the virtual icechunk store with zarr directly:
- https://github.com/zarr-developers/zarr-python/pull/3369
- https://github.com/earth-mover/icechunk/pull/1161
Until then:
= icechunk.s3_storage(
storage ="nasa-waterinsight",
bucket="virtual-zarr-store/NLDAS-3-icechunk/",
prefix="us-west-2",
region=True,
anonymous )
The bucket
and prefix
are from the icechunk href. The anonymous=True
needs to come from somewhere else.
= icechunk.RepositoryConfig.default()
config
config.set_virtual_chunk_container(
icechunk.VirtualChunkContainer("s3://nasa-waterinsight/NLDAS3/forcing/daily/",
="us-west-2")
icechunk.s3_store(region
)
)= icechunk.containers_credentials(
virtual_credentials
{"s3://nasa-waterinsight/NLDAS3/forcing/daily/": icechunk.s3_anonymous_credentials()
} )
Here we need the href
for the internal storage bucket(s) (composed of bucket
and prefix
) and we need the region
of that bucket. Then we need some way of providing credentials.
= icechunk.Repository.open(
repo =storage,
storage=config,
config=virtual_credentials,
authorize_virtual_chunk_access
)
= repo.readonly_session(snapshot_id='YTNGFY4WY9189GEH1FNG') session
Since icechunk manages versions (like git) we need some way of knowing which branch
, tag
or snapshot_id
(similar to commit
) to use.
= xr.open_zarr(session.store, consolidated=False, zarr_format=3) ds
Last of all we need a way of specifying that we are looking at icechunk here as well as the standard fields: consolidated
, zarr_format
that are already included in the STAC Zarr extension.
Note that it is possible that these last two are not actually required. Xarray should know that for icechunk stores consolidated
is always false and similarly xarray should be able to infer the zarr format from the store itself.
ds
<xarray.Dataset> Size: 51TB Dimensions: (time: 8399, lat: 6500, lon: 11700) Coordinates: * lon (lon) float64 94kB -169.0 -169.0 -169.0 ... -52.03 -52.01 -52.0 * lat (lat) float64 52kB 7.005 7.015 7.025 7.035 ... 71.97 71.98 71.99 * time (time) datetime64[ns] 67kB 2001-01-02 2001-01-03 ... 2024-01-01 Data variables: LWdown (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> PSurf (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Tair (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Qair (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Tair_max (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Tair_min (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Wind_N (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> SWdown (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Wind_E (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Rainf (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Attributes: (12/17) missing_value: -9999.0 time_definition: daily shortname: NLDAS_FOR0010_D_3.0 title: NLDAS Forcing Data L4 Daily 0.01 x 0.01 degree V3... version: 3.0 beta institution: NASA GSFC ... ... websites: https://ldas.gsfc.nasa.gov/nldas/v3/ ; https://li... MAP_PROJECTION: EQUIDISTANT CYLINDRICAL SOUTH_WEST_CORNER_LAT: 7.005000114440918 SOUTH_WEST_CORNER_LON: -168.9949951171875 DX: 0.009999999776482582 DY: 0.009999999776482582
Ok! We have the xarray dataset lazily opened by accessing only the virtual icechunk store! Now the goal is to create a STAC collection that describes that virtual icechunk store and contains all the information we need for accessing it (all the inputs to the functions above).
Extract metadata for STAC collection
Now let’s see how xstac can help us extract the variables and represent them using the Datacube STAC Extension.
This section takes inspiration from https://github.com/stac-utils/xstac/blob/main/examples/nasa-nex-gddp-cmip6/generate.py
We’ll start with some of the hard-coded values that will need to be provided for any given dataset. As much as possible these should be lifted directly from the xr.Dataset
attrs.
= "nldas-3"
collection_id = (
description "NLDAS-3 provides a fine-scale (1 km) meteorological forcing (precipitation) in "
"both retrospective and near real-time over North and Central America, including "
"Alaska, Hawaii, and Puerto Rico, by leveraging high-quality gauge, satellite, "
"and model datasets through advanced data assimilation methods. Read more: "
"https://ldas.gsfc.nasa.gov/nldas/v3"
)= [
providers
pystac.Provider(="NLDAS",
name=["producer", "processor", "licensor"],
roles="https://ldas.gsfc.nasa.gov/nldas"
url
) ]
I want to draw special attention to how we can use the Storage STAC Extension to capture that this particular bucket can be accessed anonymously. We can also capture the region
within this blob.
= {
storage_schemes "aws-s3-nasa-waterinsight": {
"type": "aws-s3",
"platform": "https://{bucket}.s3.{region}.amazonaws.com",
"bucket": "nasa-waterinsight",
"region": "us-west-2",
"anonymous": True,
} }
Now let’s configure some metadata that can be gotten from the xr.Dataset
itself.
= ds.attrs["title"]
title = pystac.Extent(
extents =pystac.SpatialExtent(bboxes=[list(ds.rio.bounds())]),
spatial=pystac.TemporalExtent(
temporal=[
intervalsstr(ds.time.min().values)),
datetime.datetime.fromisoformat(str(ds.time.max().values))
datetime.datetime.fromisoformat(
]
), )
Now that we have all those values set, create a pystac.Collection
:
= pystac.Collection(
template
collection_id,=description,
description=extents,
extent={"storage:schemes": storage_schemes, "item_assets": {}},
extra_fields=providers,
providers=title,
title=[
stac_extensions"https://stac-extensions.github.io/storage/v2.0.0/schema.json",
] )
Now that we have a preliminary version of the STAC Collection we can pass it off to xstac
to pull out the variables and dims using the Datacube STAC Extension.
= xstac.xarray_to_stac(
collection
ds,
template,="time",
temporal_dimension="lon",
x_dimension="lat",
y_dimension=4326,
reference_system=False,
validate )
With that collection in hand we can create an asset that points to the virtual icechunk store and add it as a collection-level asset.
Add collection-level assets
The main concern of the asset is how to access the data. So we need the asset to contain all the information we need to pass into icechunk functions when opening the virtual icechunk store.
= "YTNGFY4WY9189GEH1FNG"
snapshot_id = "s3://nasa-waterinsight/virtual-zarr-store/NLDAS-3-icechunk/"
virtual_href
= "nldas-3"
name = f"{name}@{snapshot_id}"
virtual_key = "aws-s3-nasa-waterinsight"
storage_ref
collection.add_asset(
virtual_key,
pystac.Asset(
virtual_href,="NLDAS-3 Virtual Zarr Store",
title="application/vnd.zarr+icechunk", # I made this up: discussion https://earthmover-community.slack.com/archives/C07NQCBSTB7/p1756918042834049
media_type=["data", "references", "virtual", "latest-version"],
roles={
extra_fields"zarr:consolidated": False,
"zarr:zarr_format": 3,
"icechunk:snapshot_id": snapshot_id,
"storage:refs": [storage_ref],
}
) )
We also need to specify how to access the legacy files which potentially sit in their own bucket. We can use the Virtual Assets STAC Extension to capture that.
= "s3://nasa-waterinsight/NLDAS3/forcing/daily/"
legacy_href = "nldas-3-legacy-bucket"
legacy_key
collection.add_asset(
legacy_key,
pystac.Asset(
legacy_href,="NLDAS-3 Legacy Bucket",
title="application/x-netcdf",
media_type=["data"],
roles={
extra_fields"storage:refs": [storage_ref],
}
)
)
"vrt:hrefs"] = [
collection.assets[virtual_key].extra_fields[
{"key": legacy_key,
"href": f"https://raw.githubusercontent.com/NASA-IMPACT/dse-virtual-zarr-workshop/refs/heads/main/docs/examples/collection.json#/assets/{legacy_key}"
}, ]
We can also add information about how to render each variable. This uses the Render STAC Extension and specifies how applications (for instance titiler) should represent each variable visually.
= {}
renders for k in ds:
if k.startswith("LW"):
= "inferno"
colormap_name elif k.startswith("PS"):
= "viridis"
colormap_name elif k.startswith("Q"):
= "plasma"
colormap_name elif k.startswith("Rain"):
= "cfastie"
colormap_name elif k.startswith("SW"):
= "magma"
colormap_name elif k.startswith("T"):
= "RdYlBu_r"
colormap_name elif k.startswith("Wind"):
= "PuOr"
colormap_name
= {
renders[k] "title": ds[k].attrs["long_name"],
"assets": ["nldas-3"],
"resampling": "average",
"colormap_name": colormap_name,
"rescale": [[ds[k].attrs["vmin"], ds[k].attrs["vmax"]]],
"backend": "xarray",
}
"renders"] = renders collection.extra_fields[
Last of all we will add the Zarr STAC Extension and the Render STAC Extension to the list of stac_extensions
. We need to do this after adding the asset because they relies on there being assets in the collection.
= list(set(
collection.stac_extensions + [
collection.stac_extensions "https://stac-extensions.github.io/render/v2.0.0/schema.json",
"https://stac-extensions.github.io/virtual-assets/v1.0.0/schema.json",
"https://stac-extensions.github.io/zarr/v1.1.0/schema.json",
"https://stac-extensions.github.io/version/v1.2.0/schema.json",
] ))
Validate the collection
Validation just randomly started raising an error this morning :( Potentially related to https://github.com/stac-utils/pystac/issues/1409
collection.validate()
Dump the collection to json
with open("collection.json", "w") as f:
=2) json.dump(collection.to_dict(), f, indent
Read the virtual icechunk using collection-level asset
Let’s create a function that opens the virtual icechunk store as a virtual asset using only the STAC metadata. Don’t worry too much about the function itself. Ideally this will end up living in xpystac (here is the issue for that: https://github.com/stac-utils/xpystac/issues/53) so you don’t need to interact with it at all. The key takeaway is that the only input to this function is the virtual asset.
def read_virtual_asset(virtual_asset: pystac.Asset):
""" Read a collection-level virtual icechunk asset """
= virtual_asset.owner
collection
# --- Create icechunk storage for virtual store
= virtual_asset.extra_fields["storage:refs"]
storage_refs if len(storage_refs) != 1:
raise ValueError("Only supports one storage:ref per virtual asset")
= collection.extra_fields["storage:schemes"].get(storage_refs[0])
storage_scheme if not storage_scheme["type"] == "aws-s3":
raise ValueError("Only S3 buckets are currently supported")
= storage_scheme["bucket"]
bucket = storage_scheme["region"]
region = storage_scheme.get("anonymous", False)
anonymous = virtual_asset.href.split(f"{bucket}/")[1]
prefix
= icechunk.s3_storage(
storage =bucket,
bucket=prefix,
prefix=region,
region=anonymous,
anonymous
)
# --- Create icechunk config object for chunk store
= virtual_asset.extra_fields["vrt:hrefs"]
legacy_buckets
if len(legacy_buckets) != 1:
raise ValueError("Only supports one vrt:href per virtual asset")
= collection.assets[legacy_buckets[0]["key"]]
legacy_asset = legacy_asset.href
href
= legacy_asset.extra_fields["storage:refs"]
storage_refs if len(storage_refs) != 1:
raise ValueError("Only supports one storage:ref per legacy asset")
= collection.extra_fields["storage:schemes"].get(storage_refs[0])
storage_scheme if not storage_scheme["type"] == "aws-s3":
raise ValueError("Only S3 buckets are currently supported")
= storage_scheme["bucket"]
bucket = storage_scheme["region"]
region = storage_scheme.get("anonymous", False)
anonymous
= icechunk.RepositoryConfig.default()
config
config.set_virtual_chunk_container(
icechunk.VirtualChunkContainer(
href,=region)
icechunk.s3_store(region
)
)if anonymous:
= icechunk.containers_credentials(
virtual_credentials
{
href: icechunk.s3_anonymous_credentials()
}
)else:
raise ValueError("Only anonymous S3 buckets are currently supported")
# --- Open icechunk session at a particular snapshot
= icechunk.Repository.open(
repo =storage,
storage=config,
config=virtual_credentials,
authorize_virtual_chunk_access
)
# TODO: should look for exactly one of "snapshot_id", "branch", "tag"
= virtual_asset.extra_fields["icechunk:snapshot_id"]
snapshot_id = repo.readonly_session(snapshot_id=snapshot_id)
session
# --- Open store as an xarray object
= virtual_asset.extra_fields["zarr:consolidated"]
consolidated = virtual_asset.extra_fields["zarr:zarr_format"]
zarr_format
return xr.open_zarr(session.store, consolidated=consolidated, zarr_format=zarr_format)
Now let’s read the collection in from where we stored it in the json, find the latest version of the icechunk store, and lazily open it as an xarray.Dataset
= pystac.Collection.from_file("collection.json")
collection
= collection.get_assets(media_type="application/vnd.zarr+icechunk", role="latest-version")
assets assets
{'nldas-3@YTNGFY4WY9189GEH1FNG': <Asset href=s3://nasa-waterinsight/virtual-zarr-store/NLDAS-3-icechunk/>}
= read_virtual_asset(list(assets.values())[0])
ds ds
<xarray.Dataset> Size: 51TB Dimensions: (time: 8399, lat: 6500, lon: 11700) Coordinates: * lon (lon) float64 94kB -169.0 -169.0 -169.0 ... -52.03 -52.01 -52.0 * lat (lat) float64 52kB 7.005 7.015 7.025 7.035 ... 71.97 71.98 71.99 * time (time) datetime64[ns] 67kB 2001-01-02 2001-01-03 ... 2024-01-01 Data variables: Tair_max (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Wind_N (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Wind_E (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> PSurf (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Qair (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> LWdown (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Tair (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Rainf (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Tair_min (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> SWdown (time, lat, lon) float64 5TB dask.array<chunksize=(1, 500, 900), meta=np.ndarray> Attributes: (12/17) missing_value: -9999.0 time_definition: daily shortname: NLDAS_FOR0010_D_3.0 title: NLDAS Forcing Data L4 Daily 0.01 x 0.01 degree V3... version: 3.0 beta institution: NASA GSFC ... ... websites: https://ldas.gsfc.nasa.gov/nldas/v3/ ; https://li... MAP_PROJECTION: EQUIDISTANT CYLINDRICAL SOUTH_WEST_CORNER_LAT: 7.005000114440918 SOUTH_WEST_CORNER_LON: -168.9949951171875 DX: 0.009999999776482582 DY: 0.009999999776482582
Make sure you can actually access underlying data:
%%time
= ds.Rainf.sel(
cape_rain ="2023-07-16",
time=slice(41.48, 42.10),
lat=slice(-70.84, -69.77)
lon ).compute()
CPU times: user 954 ms, sys: 289 ms, total: 1.24 s
Wall time: 4.13 s
cape_rain
<xarray.DataArray 'Rainf' (lat: 62, lon: 107)> Size: 53kB array([[ 5.16139793, 4.73033953, 4.79954004, ..., 2.80910897, 2.8472836 , 2.64635181], [ 4.856112 , 4.68789768, 4.71580458, ..., 2.80869484, 2.87434697, 2.80177665], [ 4.72029495, 4.83154964, 4.46945238, ..., 2.90991831, 3.02153993, 3.10612035], ..., [ 7.01298904, 7.19748259, 7.40456963, ..., 18.69691277, 18.66467857, 18.25347519], [ 7.89184809, 8.40285873, 7.87267876, ..., 17.93091774, 19.15982437, 18.79140854], [ 8.36418152, 8.91619682, 8.63145733, ..., 18.75998878, 20.46075821, 20.56634521]], shape=(62, 107)) Coordinates: * lon (lon) float64 856B -70.83 -70.82 -70.81 ... -69.79 -69.78 -69.77 * lat (lat) float64 496B 41.49 41.49 41.51 41.51 ... 42.08 42.08 42.1 time datetime64[ns] 8B 2023-07-16 Attributes: units: kg m-2 standard_name: Total precipitation long_name: Total precipitation cell_methods: time: sum vmin: 0.0 vmax: 60.617774963378906