Documenting data via an Intake catalogue

Using an intake catalogue would be a really effective, clear way to document and ingest data when working on Lorenz with Parcels. Exploring the catalogue and looking at data descriptions would be as simple as:

```py
import intake

# Load the catalogue (either local file or from a URL)
cat = intake.open_catalog('https://raw.githubusercontent.com/.../catalog.yaml')

# Explore what’s available
list(cat)
# ['gcm_ocean_data', ...]

# Display a single dataset entry
cat.gcm_ocean_data
```
```yml
gcm_ocean_data:
  args:
    urlpath: '...' # A local path on Lorenz
  description: 'Ocean GCM simulation output including temperature, salinity, and velocity fields.'
  driver: zarr
  metadata:
    institution: 'IMAU'
    frequency: 'monthly'
    model: 'GCM-XYZ'
```

And reading in an xarray dataset would be as simple as:

```py
# Load the dataset as an xarray object
ds = cat.gcm_ocean_data.to_dask()
```

Note that given Intake would produce Xarray objects, there is a limited benefit doing this work before Parcels v4.

This Intake catalogue _can_ then be built into a website using a script such as the one used in the WCRP KM Scale Hackathon 2025 ([website](https://digital-earths-global-hackathon.github.io/catalog/) - [build workflow](https://github.com/digital-earths-global-hackathon/catalog/blob/main/.github/workflows/deploy.yaml)). Though for our usecases, since this is internal, perhaps its just easier for us to have users look at the raw YAML file. The same information would be surfaced, just looks slightly nicer in a website. Thoughts @erikvansebille ?

Items:

- [ ] Define Intake catalogue in a GitHub repo adding all datasets (transfer all info from the wiki into the catalogue)


---

Tangential, but related to data ingestion. @erikvansebille , I just want flag that [Pangeo Forge](https://pangeo-forge.org/) provides recipes for fetching datasets from the original providers and bringing them onto disk in a unified format. This would be helpful for updating datasets in a standardized way and also for providing new datasets (so that we don't have to manually update them via scripts). Would this be usefule - is it a burden updating Lorenz datasets, or not really since its only done occasionally? Also I'm not 100% sure how "alive" the Pangeo Forge project is and whether we can get a local runner in place to download these datasets https://github.com/pangeo-forge/pangeo-forge-recipes/issues/814#issuecomment-2910007258

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting data via an Intake catalogue #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Documenting data via an Intake catalogue #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions