Benchmarking: Shard Size

import hvplot
import hvplot.pandas  # noqa
import pandas as pd
import statsmodels.formula.api as smf

pd.options.plotting.backend = "holoviews"

Read summary of all benchmarking results.

summary = pd.read_parquet("s3://carbonplan-benchmarks/benchmark-data/v0.2/summary.parq")

Subset the data to isolate the impact of chunk and shard size.

df = summary[
    (summary["projection"] == 4326)
    & (summary["pixels_per_tile"] == 128)
    & (summary["shard_size"] > 0)
    & (summary["region"] == "us-west-2")
]

Create a box plot showing how the rendering time depends on chunk and shard size.

df.hvplot.box(
    y="duration",
    by=["actual_chunk_size", "shard_size"],
    c="shard_size",
    cmap=["#FEFE62", "#D35FB7"],
    ylabel="Time to render (ms)",
    xlabel="Chunk size (MB); Target shard size (MB)",
    legend=False,
).opts(width=600, height=400)

Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the time to render. Datasets with larger chunk sizes take longer to render. The shard size does not have a noticeable impact on rendering time.

model = smf.ols("duration ~ actual_chunk_size + shard_size", data=df).fit()
model.summary()

OLS Regression Results
Dep. Variable:	duration	R-squared:	0.398
Model:	OLS	Adj. R-squared:	0.393
Method:	Least Squares	F-statistic:	83.71
Date:	Tue, 29 Aug 2023	Prob (F-statistic):	1.25e-28
Time:	20:30:46	Log-Likelihood:	-2063.2
No. Observations:	256	AIC:	4132.
Df Residuals:	253	BIC:	4143.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2446.8248	161.251	15.174	0.000	2129.259	2764.391
actual_chunk_size	70.9110	5.482	12.935	0.000	60.115	81.707
shard_size	0.6188	1.925	0.321	0.748	-3.172	4.409

Omnibus:	31.868	Durbin-Watson:	2.302
Prob(Omnibus):	0.000	Jarque-Bera (JB):	39.758
Skew:	-0.911	Prob(JB):	2.33e-09
Kurtosis:	3.637	Cond. No.	267.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.