STAC item conventions

Tooling and format to use when creating STAC items for VEDA

Copied from veda-backend#28

This document defines a set of conventions for generating STAC Items consistently for the VEDA Dashboard UI and future API users. Ultimately, these represent the minimum metadata API users can expect from the backend.

Rio-stac conventions for generating STAC Items

All of our current ingestion plans will use rio-stac to generate item metadata for COGs so the notes below are organized around the input parameters of the create_stac_item method.

example rio-stac python usage

item = rio_stac.stac.create_stac_item(
  id = item_id,
  source = f"s3://{obj.bucket_name}/{obj.key}", 
  collection = collection_id, 
  input_datetime = <datetime.datetime>,
  with_proj = True,
  with_raster = True,
  asset_name = "cog_default",
  asset_roles = ["data", "layer"],
  asset_media_type = "image/tiff; application=geotiff; profile=cloud-optimized",
)

Rio-stac create item parameter recommendations These recommendations are for generating STAC Item metadata for collections intended for the dasboard and may not be applicable to all ARCO collections.

Parameter	Recommendations
id	(1) When STAC Item metadata is generated from a COG file, strip the full file extension from the filename for the item id. (2) When ids are not unique across collections, append the collection id to the item id. For example the no2-monthly and no2-monthly-diff COGs are stored with unique bucket prefixes but within the prefix all the filenames are the same, so the collection id is appended: `OMI_trno2_0.10x0.10_201604_Col3_V4` → `OMI_trno2_0.10x0.10_201604_Col3_V4-no2-monthly`).
with_proj	`True`. Generate projection extension metadata for the item for future ARCO datastore users.
with_raster	`True`. This will generate gdal statistics for every band in the COG—we use these to get the range of values for the full collection.
asset_name	A meaningful asset name for the default cloud optimized asset to be displayed on a map. `cog_default` is a placeholder—we need to choose and commit to an asset name for all collections. If not set, will default to `asset`. * TODO Decision: For items with many assets we should ingest all with appropriate keys and duplicate one preferred display asset as the default cog. We should be considering metadata conventions in pgstac-titiler
asset_roles	`["data", "layer"]` data is an appropriate role, we may also choose to add something like layer to indicate that the asset is optimized to be used as a map layer (stac specification for asset roles).
asset_media_type	`"image/tiff; application=geotiff; profile=cloud-optimized` (stac best practices for asset media type).
properties	CMIP6: TODO, CMR: TODO if we don’t store links to the original data, downstream users are not going to be able to pair STAC records with the versioned parent data in CMR

Data provenance convention

When adding STAC items that were derived from previously published data (such as CMR records), there are multiple ways to preserve the linkage between the item and the more complete source metadata. We should provide at a minimum metadata assets for any items derived from previously published data. Here are three examples from HLS:

metadata are assets The CMR properties question in the table above (how to refer the STAC Item to it’s CMR source metadata) could instead be solved by adding a metadata asset. This does not require creating a new extension for CMR, it just involves creating an asset from the CMR granule metadata which should be in the event context for CMR search driven ingests. The example below is from documentation for using HLS cloud optimized data.

"assets": {
  "metadata": {
    "href": "https://cmr.earthdata.nasa.gov/search/concepts/G2099379244-LPCLOUD.xml",
    "type": "application/xml"
    },
    "thumbnail": { ...}
}

stac-spec scieintific extension

"properties": {
   "sci:doi": "10.5067/HLS/HLSS30.002",
   ...
}

Item links to metadata Use a cite-as Item link to the DOI for the source data.

"links": [
  {
    "rel": "cite-as",
    "href": "https://doi.org/10.5067/HLS/HLSS30.002"
  },
  ...
]

STAC Item validation convention

We are producing pystac.items with rio-stac’s create_stac_item method and we should validate them before publishing them to s3. Testing found that it is possible to produce structurally sound but invalid STAC Items with create_stac_item.

The built in pystac validator on the pystac.item returned by create_stac_item can be used to easily validate the metadata—item.validate() will raise an exception for invalid metadata. Pystac does need to be installed with the appropriate dependencies for validation.

Convention for default map layer assets for spectral data

Many of the collections for the dashboard have a clear default map layer asset that we can name cog_default. This convention does not map as well to spectral data with many assets (B01, B02,…). A preferred band asset could be duplicated to define a default map layer asset to be consistent but this needs to be decided.