Benchmarking: AWS Region

import hvplot
import hvplot.pandas  # noqa
import pandas as pd
import statsmodels.formula.api as smf

pd.options.plotting.backend = "holoviews"

Read summary of all benchmarking results.

summary = pd.read_parquet("s3://carbonplan-benchmarks/benchmark-data/v0.2/summary.parq")

Subset the data to isolate the impact of location and chunk size.

df = summary[
    (summary["projection"] == 3857)
    & (summary["pixels_per_tile"] == 128)
    & (summary["shard_size"] == 0)
]

Set plot options.

cmap = ["#FFC20A", "#0C7BDC"]
plt_opts = {"width": 600, "height": 400}

Create a box plot showing how the rendering time depends on the AWS region and chunk size.

df.hvplot.box(
    y="duration",
    by=["actual_chunk_size", "region"],
    c="region",
    cmap=cmap,
    ylabel="Time to render (ms)",
    xlabel="Chunk size (MB); AWS region",
    legend=False,
).opts(**plt_opts)

Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the time to render the data. Datasets with larger chunk sizes take longer to render. The AWS region does not have a noticeable impact on rendering time.

model = smf.ols("duration ~ actual_chunk_size + C(region)", data=df).fit()
model.summary()
OLS Regression Results
Dep. Variable: duration R-squared: 0.446
Model: OLS Adj. R-squared: 0.444
Method: Least Squares F-statistic: 205.1
Date: Tue, 29 Aug 2023 Prob (F-statistic): 4.58e-66
Time: 20:28:30 Log-Likelihood: -3916.0
No. Observations: 512 AIC: 7838.
Df Residuals: 509 BIC: 7851.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1859.2163 40.422 45.995 0.000 1779.801 1938.631
C(region)[T.us-west-2] -53.6344 44.989 -1.192 0.234 -142.021 34.752
actual_chunk_size 51.8170 2.563 20.221 0.000 46.782 56.852
Omnibus: 22.416 Durbin-Watson: 1.979
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12.956
Skew: 0.227 Prob(JB): 0.00154
Kurtosis: 2.367 Cond. No. 31.2


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.