Getting Started#

This guide will help you get going with the ML pipeline. We use template notebooks containing code with explanatory markdown to demonstrate how pipeline modules can be implemented in geospatial computer vision workflows. The following diagram describes the pipeline architecture.

MIRO board source

Proficiency with python and some pre-existing knowledge on key tools and data formats is expected, for which we provide the following recommendations for coming up to speed:

  • Introduction to Geospatial Raster and Vector Data with Python

    • We will work with substantially with raster data in our templates. For example, some of our templates will use Synthetic Aperture Radar (SAR) satellite imagery. A specific focus on the Cloud Optimized GeoTIFF (COG) format is worth noting, as it allows for dynamic subsetting and reading of spatial raster data. Occasionally, we will work with vector data as well. This series of lessions will teach you about the key fundamentals for working with these data types using python.

  • The SpatioTemporal Asset Catalog (STAC) specification

    • We source from data archived in STAC format in our templates.

  • PyTorch Lightning

    • This is the ML framework that we use in our templates.

  • Xarray

    • We use Xarray data cubes to read and store multi-dimensional data.

  • Dask

    • We use Dask for efficient parallel compute. It pairs nicely with Xarray for parallelization across chunked arrays.

  • Dask and Xarray (by way of rioxarray) in practice

  • Weights and Biases

    • We use this tool to log and visualize graphs and plots of data and/or model artifacts.

  • Segmentation Models Pytorch (SMP)

    • SMP is a PyTorch model library that we import backbone segmentation models from.

  • Hydra

    • Hydra is our training experiment configuration tool.

Now, let’s jump into the content you will interact with here. We have provided a set of templates/lessons that we recommend going through in the following order:

  1. Data Pipelines with TorchData. This template will demonstrate the following:

    • How to request a temporally and geographically constrained subset of Harmonized Landsat Sentinel-2 (HLS) images from a STAC API, and pair it with GeoJSON vector polygon labels for a semantic segmentation task.

    • Load the STAC items into Xarray data arrays using PySTAC and stackstac

    • Slice the multi-temporal Xarray data arrays along the temporal dimensions into tiles using Xbatcher

    • Create iterables for the image tiles and integer labels, and then zip those together and collate them as Torch tensors

    • Use the zipped image and label tensors in a Torch datapipe

    • Partition the dataset

    • Create dataloaders for each partition

    • Vizualize the datapipe and plot samples from it

  2. Placeholder for dataset versioning using DVC guide. We will use Data Version Control to log any changes made to our ML dataset.

  3. Model Training with PyTorch Lightning. We will use Segmentation Models Pytorch to supply a backbone model architecture.

    • Modularize the data generation code from lesson 1 and create backbone and custom model models abiding by the PyTorch Lightning nn.Module structure.

  4. Experiment Configuration with Hydra.

    • Execute multiple training experiments leveraging different combinations of tunable hyperparameters defined in Hydra’s configuration structure.

    • Model metrics will be logged using Weights and Biases integration.

  1. PyTorch Lightning + Hydra based evaluation notebook

    • Generate confusion matrices, precision, recall and f1-scores.

  2. Placeholder for model versioning using HuggingFace. We will use HuggingFace to publish the final model weights.

These templates are examples of suggested usage of the ML Pipeline. Remember that we intentionally structured the pipeline in modules to enable component selection, plug and play in your geospatial ML workflows. Enjoy and please reference and/or open a new ticket in the repo issues should you have any questions or suggestions.