Benchmarks 1: Tiling COGs with and without GDAL Environment Variables

Explanation

This notebook does not report any results for tiling Zarr datasets. It surfaces the significance of the underlying environment and configuration of low level libraries to the performance of a framework we are comparing with for tiling imagery.

titiler-pgstac creates image tiles using rio-tiler which uses rasterio. Rasterio uses GDAL “under the hood”. Certain GDAL environment variables impact tiling performance when working with rasterio to read data from Cloud-Optimized GeoTIFFs

As noted in Benchmarking Methodolgy, the time to tile includes the time to query a pgSTAC database and then use the query ID returned to read and create image tiles from COGs on S3. The libraries used were pgSTAC for reading STAC metadata and rasterio (via rio_tiler) for reading COGs on S3.

Dataset Generation

All dataset generation code is in the tiling-benchmark repo’s cmip6-pgstac directory. The STAC collection is defined in CMIP6_daily_GISS-E2-1-G_tas_collection.json. The STAC item records for the CMIP6 COGs are generated in the 01-generate-datasets/cmip6-pgstac/generate_cmip6_items.ipynb notebook. They are seeded via seed-db.sh.

Tests

Tests were run via the tile-benchmarking/02-run-tests/01-cog-gdal-tests.ipynb notebook.

import pandas as pd
import hvplot.pandas
import holoviews as hv
pd.options.plotting.backend = 'holoviews'
import warnings
warnings.filterwarnings('ignore')
git_url_path = "https://raw.githubusercontent.com/developmentseed/tile-benchmarking/main/02-run-tests/results-csvs/"
df = pd.read_csv(f"{git_url_path}/01-cog-gdal-results.csv")
df['set_gdal_vars'] = df['set_gdal_vars'].astype(str)
zooms = range(6)
cmap = ["#E1BE6A", "#40B0A6"]
plt_opts = {"width": 300, "height": 250}

plts = []

for zoom_level in zooms:
    df_level = df[df["zoom"] == zoom_level]
    plts.append(
        df.hvplot.box(
            y="time",
            by=["set_gdal_vars"],
            c="set_gdal_vars",
            cmap=cmap,
            ylabel="Time to render (ms)",
            xlabel="GDAL Environment Variables Set/Unset",
            legend=False,
            title=f"Zoom level {zoom_level}",
        ).opts(**plt_opts)
    )
hv.Layout(plts).cols(2)

Interpretation of the Results

  1. Setting these GDAL environment variables significantly impacts performance, with 100x speed up in performance.
  2. Not shown above, but variation across tiles is not significant.
  3. Variation across zoom levels is not significant.

These GDAL variables are documented here: https://developmentseed.org/titiler/advanced/performance_tuning/.

By setting the GDAL environment variables we limit the number of total requests to S3. Specifically, these environment variables ensure that:

  • All of the metadata may be read in 1 request. This is not necessarily true, but more likely since we increase the initial number of GDAL ingested bytes.
  • There are no extra LIST requests which GDAL uses to discover sidecar files. COGs don’t have sidecar files.
  • Consecutive range requests are merged into 1 request.
  • Multiple range requests use the same TCP connection.