import hvplot
import hvplot.pandas # noqa
import pandas as pd
import statsmodels.formula.api as smf
= "holoviews" pd.options.plotting.backend
Benchmarking: Shard Size
Read summary of all benchmarking results.
= pd.read_parquet("s3://carbonplan-benchmarks/benchmark-data/v0.2/summary.parq") summary
Subset the data to isolate the impact of chunk and shard size.
= summary[
df "projection"] == 4326)
(summary[& (summary["pixels_per_tile"] == 128)
& (summary["shard_size"] > 0)
& (summary["region"] == "us-west-2")
]
Create a box plot showing how the rendering time depends on chunk and shard size.
df.hvplot.box(="duration",
y=["actual_chunk_size", "shard_size"],
by="shard_size",
c=["#FEFE62", "#D35FB7"],
cmap="Time to render (ms)",
ylabel="Chunk size (MB); Target shard size (MB)",
xlabel=False,
legend=600, height=400) ).opts(width
Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the time to render. Datasets with larger chunk sizes take longer to render. The shard size does not have a noticeable impact on rendering time.
= smf.ols("duration ~ actual_chunk_size + shard_size", data=df).fit()
model model.summary()
Dep. Variable: | duration | R-squared: | 0.398 |
Model: | OLS | Adj. R-squared: | 0.393 |
Method: | Least Squares | F-statistic: | 83.71 |
Date: | Tue, 29 Aug 2023 | Prob (F-statistic): | 1.25e-28 |
Time: | 20:30:46 | Log-Likelihood: | -2063.2 |
No. Observations: | 256 | AIC: | 4132. |
Df Residuals: | 253 | BIC: | 4143. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 2446.8248 | 161.251 | 15.174 | 0.000 | 2129.259 | 2764.391 |
actual_chunk_size | 70.9110 | 5.482 | 12.935 | 0.000 | 60.115 | 81.707 |
shard_size | 0.6188 | 1.925 | 0.321 | 0.748 | -3.172 | 4.409 |
Omnibus: | 31.868 | Durbin-Watson: | 2.302 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 39.758 |
Skew: | -0.911 | Prob(JB): | 2.33e-09 |
Kurtosis: | 3.637 | Cond. No. | 267. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.