Downsample zarr

Demonstrates downsampling zarr using xarray and dask with data from SMAP

Author

Aimee Barciauskas

Published

January 3, 2023

Run this notebook

You can launch this notebook in VEDA JupyterHub by clicking the link below.

Launch in VEDA JupyterHub (requires access)

Learn more

Inside the Hub

This notebook was written on the VEDA JupyterHub and as such is designed to be run on a jupyterhub which is associated with an AWS IAM role which has been granted permissions to the VEDA data store via its bucket policy. The instance used provided 16GB of RAM.

See (VEDA Analytics JupyterHub Access)[https://nasa-impact.github.io/veda-docs/veda-jh-access.html] for information about how to gain access.

Outside the Hub

The data is in a protected bucket. Please request access by emailng aimee@developmentseed.org or alexandra@developmentseed.org and providing your affiliation, interest in or expected use of the dataset and an AWS IAM role or user Amazon Resource Name (ARN). The team will help you configure the cognito client.

You should then run:

%run -i 'cognito_login.py'

Approach

This notebook demonstrates 2 strategies for to subselect data from a Zarr dataset in order to visualize using the memory of a notebook.

Downsample the temporal resolution of the data using xarray.DataArray.resample
Coarsening the spatial aspect of the data using xarray.DataArray.coarsen

A strategy for visualizing any large amount of data is Datashader which bins data into a fixed 2-D array. The call to rasterize ensures the use of the datashader library to bin the data.

About the data

The SMAP mission is an orbiting observatory that measures the amount of water in the surface soil everywhere on Earth.

Load libraries

!pip install -q hvplot==0.8.4

import xarray as xr
import hvplot.xarray
import geoviews as gv
import datashader as dsh
from holoviews.operation.datashader import rasterize

gv.output(size=300)

Optional: Create and Scale a Dask Cluster

We create a separate Dask cluster to speed up reprojecting the data (and other potential computations which could be required and are parallelizable).

Note if you skip this cell you will still be using Dask, you’ll just be using the machine where you are running this notebook.

from dask_gateway import GatewayCluster, Gateway

gateway = Gateway()
clusters = gateway.list_clusters()

# connect to an existing cluster - this is useful when the kernel shutdown in the middle of an interactive session
if clusters:
    cluster = gateway.connect(clusters[0].name)
else:
    cluster = GatewayCluster(shutdown_on_close=True)

cluster.scale(16)
client = cluster.get_client()
client

Client

Client-2c42768f-d18f-11ee-89f9-02b3484b28f4

Connection method: Cluster object	Cluster type: dask_gateway.GatewayCluster
Dashboard: /services/dask-gateway/clusters/prod.ce053103b0e44d4d9c26bd8f50584072/status

Cluster Info

GatewayCluster

Name: prod.ce053103b0e44d4d9c26bd8f50584072
Dashboard: /services/dask-gateway/clusters/prod.ce053103b0e44d4d9c26bd8f50584072/status

Open the dataset from S3

ds = xr.open_zarr("s3:///veda-data-store-staging/EIS/zarr/SPL3SMP.zarr")
ds

<xarray.Dataset>
Dimensions:                        (northing_m: 406, easting_m: 964,
                                    datetime: 1679)
Coordinates:
  * datetime                       (datetime) datetime64[ns] 2018-01-01 ... 2...
  * easting_m                      (easting_m) float64 -1.735e+07 ... 1.735e+07
  * northing_m                     (northing_m) float64 7.297e+06 ... -7.297e+06
Data variables: (12/26)
    albedo                         (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    albedo_pm                      (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    bulk_density                   (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    bulk_density_pm                (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    clay_fraction                  (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    clay_fraction_pm               (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    ...                             ...
    static_water_body_fraction     (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    static_water_body_fraction_pm  (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    surface_flag                   (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    surface_flag_pm                (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    surface_temperature            (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
    surface_temperature_pm         (northing_m, easting_m, datetime) float32 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>

Indexes: (3)

datetime

PandasIndex

PandasIndex(DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10',
               ...
               '2022-08-31', '2022-09-01', '2022-09-02', '2022-09-03',
               '2022-09-04', '2022-09-05', '2022-09-06', '2022-09-07',
               '2022-09-08', '2022-09-09'],
              dtype='datetime64[ns]', name='datetime', length=1679, freq=None))

easting_m

PandasIndex

PandasIndex(Index([      -17349514.34,       -17313482.12,        -17277449.9,
             -17241417.68,       -17205385.46,       -17169353.24,
             -17133321.02,        -17097288.8,       -17061256.58,
             -17025224.36,
       ...
       17025223.540000003,        17061255.76,        17097287.98,
               17133320.2, 17169352.419999998, 17205384.640000004,
       17241416.860000003, 17277449.080000002,         17313481.3,
              17349513.52],
      dtype='float64', name='easting_m', length=964))

northing_m

PandasIndex

PandasIndex(Index([        7296524.72,          7260492.5,  7224460.279999999,
               7188428.06,         7152395.84,         7116363.62,
        7080331.399999999,         7044299.18,         7008266.96,
               6972234.74,
       ...
       -6972234.400000001,        -7008266.62, -7044298.840000001,
       -7080331.060000001,        -7116363.28, -7152395.500000001,
       -7188427.720000002,        -7224459.94, -7260492.160000001,
              -7296524.38],
      dtype='float64', name='northing_m', length=406))

Attributes: (0)

Downsample zarr

Run this notebook

Inside the Hub

Outside the Hub

Approach

About the data

Load libraries

Optional: Create and Scale a Dask Cluster

Client

Cluster Info

GatewayCluster

Open the dataset from S3

Strategy 1: Downsample the temporal resolution of the data

Quick plot

Reproject before plotting

Strategy 2: Coarsen spatial resolution of the data

Cleanup