Benchmarks 2: Tile Generation Benchmarks across Data Formats

Explanation

This page shared results from benchmarking the performance of tiling CMIP6 data stored as COG, NetCDF and Zarr.

The intention is to understand the performance tradeoff between these data formats. These results should not be considered conclusive as additional library and caching improvements may be made in the future.

In order to tile the NetCDF, we use a kerchunk reference file. You are able to use the ZarrReader with NetCDF files without a kerchunk reference file, however you cannot read more than file at once which makes it incomparable with the pgSTAC+COG and Zarr methods.

Dataset Generation

The test datasets produced and benchmarked are:

  1. Cloud-Optimized GeoTiffs (COGs): Tiles are produced using the publicly available CMIP6 COGs in the s3://nex-gddp-cmip6-cog/ bucket. The metadata generation is documented in 01-cmip6-cog-tile-server-benchmarks.ipynb.
  2. kerchunk + netCDF: Tiles are produced using a kerchunk reference file generated for the NetCDF files stored in the s3://nex-gddp-cmip6 bucket. The code to produce the kerchunk reference is in the tile-benchmarking repo: 01-generate-datasets/generate-cmip6-kerchunk.ipynb.
  3. Zarr: Tiles are produced using a zarr store with the same chunking configuration and the underlying NetCDFs. The code to produce the zarr store is in the tile-benchmarking repo: 01-generate-datasets/generate-cmip6-zarr.ipynb.

Tests

Tests were run via the tile-benchmarking/02-run-tests/02-cog-kerchunk-zarr.ipynb notebook.

import pandas as pd
import hvplot.pandas
import holoviews as hv
pd.options.plotting.backend = 'holoviews'
import warnings
warnings.filterwarnings('ignore')
git_url_path = "https://raw.githubusercontent.com/developmentseed/tile-benchmarking/main/02-run-tests/results-csvs/"
df = pd.read_csv(f"{git_url_path}/02-cog-kerchunk-zarr-results.csv")
zooms = range(6)
cmap = ["#E1BE6A", "#40B0A6", "#0C7BDC"]
plt_opts = {"width": 300, "height": 250}

plts = []

for zoom_level in zooms:
    df_level = df[df["zoom"] == zoom_level]
    plts.append(
        df_level.hvplot.box(
            y="time",
            by=["data_format"],
            c="data_format",
            cmap=cmap,
            ylabel="Time to render (ms)",
            xlabel="Data Format",
            legend=False,
            title=f"Zoom level {zoom_level}",
        ).opts(**plt_opts)
    )

hv.Layout(plts).cols(2)

Interpretation of the Results

  • Tiling COGs performs better than tiling Zarr or the kerchunk reference, at all zoom levels.
  • The performance of the kerchunk reference is better than the Zarr store. It is important to consider this is because the NetCDF files’ chunks are the same. Even though 365 time steps (days) are stored in each NetCDF file, it is chunked by day.