File preparation

How to create valid cloud-optimized data files and upload them to the VEDA data store

STEP I: Prepare the data

VEDA supports inclusion of cloud optimized GeoTIFFs (COGs) to its data store.

Create sane Cloud-Optimized GeoTIFFs (COGs)

We often encounter issues like missing or wrong nodata value, missing coordinate-reference system, missing or wrong overviews - polluted by fill values or not conserving class values in categorical data, empty files, or artifacts in the data.

Discovering these issues early on (ideally before upload to our buckets) can save us all a lot of time.

A command-line tool for creating and validating COGs is rio-cogeo. The rio-cogeo documentation is a great starting reference point and have a very helpful guide on preparing COGs, too.

To inspect and validate your raster dataset use the following
```
rio cogeo info /path/to/file.tif
```
This will allow you to explore the size in rows and lines, understand the data type (e.g., byte, float, etc.), and verify how NoData is defined. Once you have explored the COG’s definitions you can validate it by using the following commands:
```
rio cogeo validate /path/to/file.tif
```
COG validation can be used to identify if any errors are found. The following steps can be used to resolve any errors.
If your raster contains empty pixels, make sure the nodata value is set correctly (check with rio cogeo info). The nodata value needs to be set before cloud-optimizing the raster, so overviews are computed from real data pixels only. Pro-tip: For floating-point rasters, using NaN for flagging nodata helps avoid roundoff errors later on.

You can set the nodata flag on a GeoTIFF in-place with:
```
rio edit_info --nodata 255 /path/to/file.tif
```
or in Python with
```
import rasterio

with rasterio.open("/path/to/file.tif", "r+") as ds:
    ds.nodata = 255
```
Note that this only changes the flag. If you want to change the actual value you have in the data, you need to create a new copy of the file where you change the pixel values.
Make sure the coordinate reference system is embedded in the COG (check with rio cogeo info)
When creating the COG, use the most appropriate resampling method for overviews. For example, use average for continuous / floating point data and mode for categorical / integer.
```
rio cogeo create --overview-resampling "mode" /path/to/input.tif /path/to/output.tif
```

Name your files correctly

Make sure that the COG filename is meaningful and contains the datetime associated with the COG in the following format. All the datetime values in the file should be preceded by the _ underscore character. Some examples are shown below:

Single datetime

Year data: nightlights_2012.tif, nightlights_2012-yearly.tif
Month data: nightlights_201201.tif, nightlights_2012-01_monthly.tif
Day data: nightlights_20120101day.tif, nightlights_2012-01-01_day.tif

Datetime range

Year data: nightlights_2012_2014.tif, nightlights_2012_year_2015.tif
Month data: nightlights_201201_201205.tif, nightlights_2012-01_month_2012-06_data.tif
Day data: nightlights_20120101day_20121221.tif, nightlights_2012-01-01_to_2012-12-31_day.tif

Note that the date/datetime value is always preceded by an _ (underscore).

STEP II: Upload to the VEDA data store

Once you have the COGs, obtain permissions (Cognito credentials) from the VEDA team to upload them to the veda-data-store-staging bucket.

Upload the data to a sensible location inside the bucket. Example: s3://veda-data-store-staging/<collection-id>/