Catalog Ingestion
How to load metadata with our STAC API
STEP III: Create dataset definitions
The next step is to divide all the data into logical collections. A collection is basically what it sounds like, a collection of data files that share the same properties like, the data it’s measuring, the periodicity, the spatial region, etc. For example, current VEDA datasets like no2-mean
and no2-diff
should be two different collections, because one measures the mean levels of nitrogen dioxide and the other the differences in observed levels. Likewise, datasets like no2-monthly
and no2-yearly
should be different because the periodicity is different.
Once you have logically grouped the datasets into collections, you will need to create dataset definitions for each of these collections. The data definition is a json file that contains some metadata of the dataset and information on how to discover these datasets in the s3 bucket. An example is shown below:
lis-global-da-evap.json
{
"collection": "lis-global-da-evap",
"title": "Evapotranspiration - LIS 10km Global DA",
"description": "Gridded total evapotranspiration (in kg m-2 s-1) from 10km global LIS with assimilation",
"license": "CC0-1.0",
"is_periodic": true,
"time_density": "day",
"spatial_extent": {
"xmin": -179.95,
"ymin": -59.45,
"xmax": 179.95,
"ymax": 83.55
},
"temporal_extent": {
"startdate": "2002-08-02T00:00:00Z",
"enddate": "2021-12-01T00:00:00Z"
},
"sample_files": [
"s3://veda-data-store-staging/EIS/COG/LIS_GLOBAL_DA/Evap/LIS_Evap_200208020000.d01.cog.tif"
],
"discovery_items": [
{
"discovery": "s3",
"cogify": false,
"upload": false,
"dry_run": false,
"prefix": "EIS/COG/LIS_GLOBAL_DA/Evap/",
"bucket": "veda-data-store-staging",
"filename_regex": "(.*)LIS_Evap_(.*).tif$",
"datetime_range": "day"
}
]
}
Click to show field descriptions
The following table describes what each of these fields mean:
field | description | allowed value | example |
---|---|---|---|
collection |
the id of the collection | lowercase letters with optional “-” delimeters | no2-monthly-avg |
title |
a short human readable title for the collection | string with 5-6 words | “Average NO2 measurements (Monthly)” |
description |
a detailed description for the dataset | should include what the data is, what sensor was used to measure, where the data was pulled/derived from, etc | |
license |
license for data use; Default open license: CC0-1.0 |
SPDX license id | CC0-1.0 |
is_periodic |
is the data periodic? specifies if the data files repeat at a uniform time interval | true | false |
true |
time_density |
the time step in which we want to navigate the dataset in the dashboard | year | month | day | hour | minute | null |
|
spatial_extent |
the spatial extent of the collection; a bounding box that includes all the data files in the collection | {"xmin": -180, "ymin": -90, "xmax": 180, "ymax": 90} |
|
spatial_extent["xmin"] |
left x coordinate of the spatial extent bounding box | -180 <= xmin <= 180; xmin < xmax | 23 |
spatial_extent["ymin"] |
bottom y coordinate of the spatial extent bounding box | -90 <= ymin <= 90; ymin < ymax | -40 |
spatial_extent["xmax"] |
right x coordinate of the spatial extent bounding box | -180 <= xmax <= 180; xmax > xmin | 150 |
spatial_extent["ymax"] |
top y coordinate of the spatial extent bounding box | -90 <= ymax <= 90; ymax > ymin | 40 |
temporal_extent |
temporal extent that covers all the data files in the collection | {"start_date": "2002-08-02T00:00:00Z", "end_date": "2021-12-01T00:00:00Z"} |
|
temporal_extent["start_date"] |
the start_date of the dataset |
iso datetime that ends in Z |
2002-08-02T00:00:00Z |
temporal_extent["end_date"] |
the end_date of the dataset |
iso datetime that ends in Z |
2021-12-01T00:00:00Z |
sample_files |
a list of s3 urls for the sample files that go into the collection | [ "s3://veda-data-store-staging/no2-diff/no2-diff_201506.tif", "s3://veda-data-store-staging/no2-diff/no2-diff_201507.tif"] |
|
discovery_items["discovery"] |
where to discover the data from; currently supported are s3 buckets and cmr | s3 | cmr |
s3 |
discovery_items["cogify"] |
does the file need to be converted to a cloud optimized geptiff (COG)? false if it is already a COG |
true | false |
false |
discovery_items["upload"] |
does it need to be uploaded to the veda s3 bucket? false if it already exists in veda-data-store-staging |
true | false |
false |
discovery_items["dry_run"] |
if set to true , the items will go through the pipeline, but won’t actually publish to the stac catalog; useful for testing purposes |
true | false |
false |
discovery_items["bucket"] |
the s3 bucket where the data is uploaded to | any bucket that the data pipelines has access to | veda-data-store-staging | climatedashboard-data | {any-public-bucket} |
discovery_items["prefix"] |
within the s3 bucket, the prefix or path to the “folder” where the data files exist | any valid path winthin the bucket | EIS/COG/LIS_GLOBAL_DA/Evap/ |
discovery_items["filename_regex"] |
a common filename pattern that all the files in the collection follow | a valid regex expression | (.*)LIS_Evap_(.*).cog.tif$ |
discovery_items["datetime_range"] |
based on the naming convention in STEP I, the datetime range to be extracted from the filename | year | month | day |
year |
Note: The steps after this are technical, so at this point open a PR on the veda-data GitHub repository and a member of the VEDA team will handle the publication process.
STEP IV: Publication
The publication process involves 3 steps:
- [VEDA] Publishing to the development STAC catalog
https://dev.openveda.cloud/api/stac
- [EIS] Reviewing the collection/items published to the dev STAC catalog
- [VEDA] Publishing to the staging STAC catalog
https://staging-stac.delta-backend.com
To use the VEDA Ingestion API to schedule ingestion/publication of the data follow these steps:
1. Obtain credentials from a VEDA team member
Ask a VEDA team member to create Cognito
credentials (username and password) for VEDA authentication.
2. Export username and password
export username="johndoe"
export password="xxxx"
3. Get token
# Required imports
import os
import requests
# Pull username and password from environment variables
= os.environ.get("username")
username = os.environ.get("password")
password
# base url for the workflows api
# experimental / subject to change in the future
# DISCLAIMER: coming soon, not yet available
= "https://dev.openveda.cloud/api/workflows"
base_url
# endpoint to get the token from
= f"{base_url}/token"
token_url
# authentication credentials to be passed to the token_url
= {
body "username": username,
"password": password,
}
# request token
= requests.post(token_url, data=body)
response if not response.ok:
raise Exception("Couldn't obtain the token. Make sure the username and password are correct.")
else:
# get token from response
= response.json().get("AccessToken")
token # prepare headers for requests
= {
headers "Authorization": f"Bearer {token}"
}
4. Ingest the dataset
Then, use the code snippet below to publish the dataset.
# url for dataset validation / publication
= f"{base_url}/dataset/validate"
validate_url
= f"{base_url}/dataset/publish"
publish_url
# prepare the body of the request,
= json.load(open("dataset-definition.json"))
body
# Validate the data definition using the /validate endpoint
= requests.post(
validation_response
validate_url,=headers,
headers=body
json
)
# look at the response
validation_response.raise_for_status()
# If the validation is successful, publish the dataset using /publish endpoint
= requests.post(
publish_response
publish_url,=headers,
headers=body
json
)
if publish_response.ok:
print("Success")
Check the status of the execution
# the id of the execution
# should be available in the response of workflow execution request
= "xxx"
execution_id # url for execution status
= f"{workflow_execution_url}/{execution_id}"
execution_status_url # make the request
= requests.get(
response
execution_status_url,=headers,
headers
)if response.ok:
print(response.json())