Skip to content

Documenting data via an Intake catalogue #6

@VeckoTheGecko

Description

@VeckoTheGecko

Using an intake catalogue would be a really effective, clear way to document and ingest data when working on Lorenz with Parcels. Exploring the catalogue and looking at data descriptions would be as simple as:

import intake

# Load the catalogue (either local file or from a URL)
cat = intake.open_catalog('https://raw.githubusercontent.com/.../catalog.yaml')

# Explore what’s available
list(cat)
# ['gcm_ocean_data', ...]

# Display a single dataset entry
cat.gcm_ocean_data
gcm_ocean_data:
  args:
    urlpath: '...' # A local path on Lorenz
  description: 'Ocean GCM simulation output including temperature, salinity, and velocity fields.'
  driver: zarr
  metadata:
    institution: 'IMAU'
    frequency: 'monthly'
    model: 'GCM-XYZ'

And reading in an xarray dataset would be as simple as:

# Load the dataset as an xarray object
ds = cat.gcm_ocean_data.to_dask()

Note that given Intake would produce Xarray objects, there is a limited benefit doing this work before Parcels v4.

This Intake catalogue can then be built into a website using a script such as the one used in the WCRP KM Scale Hackathon 2025 (website - build workflow). Though for our usecases, since this is internal, perhaps its just easier for us to have users look at the raw YAML file. The same information would be surfaced, just looks slightly nicer in a website. Thoughts @erikvansebille ?

Items:

  • Define Intake catalogue in a GitHub repo adding all datasets (transfer all info from the wiki into the catalogue)

Tangential, but related to data ingestion. @erikvansebille , I just want flag that Pangeo Forge provides recipes for fetching datasets from the original providers and bringing them onto disk in a unified format. This would be helpful for updating datasets in a standardized way and also for providing new datasets (so that we don't have to manually update them via scripts). Would this be usefule - is it a burden updating Lorenz datasets, or not really since its only done occasionally? Also I'm not 100% sure how "alive" the Pangeo Forge project is and whether we can get a local runner in place to download these datasets pangeo-forge/pangeo-forge-recipes#814 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions