Skip to content

DM-55041: Add Zarr file I/O#45

Draft
timj wants to merge 38 commits into
mainfrom
tickets/DM-55041
Draft

DM-55041: Add Zarr file I/O#45
timj wants to merge 38 commits into
mainfrom
tickets/DM-55041

Conversation

@timj
Copy link
Copy Markdown
Member

@timj timj commented May 23, 2026

This uses xarray / CF / OME conventions where appropriate.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

timj added 30 commits May 22, 2026 05:49
Captures the agreed-on architecture for adding a zarr backend alongside
the existing FITS/JSON/NDF backends: cloud-first OME-Zarr v0.5 layout
with namespaced LSST extensions, IR-driven two-pass writes mirroring the
NDF approach, and recursive composition for image-shaped sub-archives.

Generated with AI

Co-Authored-By: SLAC AI
Generated with AI

Co-Authored-By: SLAC AI
Pivots the on-disk layout from the v1 OME-multiscale-image-with-lsst-
companions structure to an xarray/CF-shaped root with image, variance,
mask as siblings; OME multiscales metadata points at the same image
array (no byte duplication). Mask becomes a 2-D packed-integer array
with CF flag_masks/flag_meanings/flag_descriptions for native
geospatial-tool interop. ColorImage stops stacking and writes its
channels as recursive sub-archives. WCS is affine-only OME plus
authoritative AST string in v1; RFC-5 nonlinear transformations are a
follow-up blocked on writing an AST JSON channel. Adds an 11x11-grid
residual validator that drops the simplified affine when the max
pixel-equivalent error exceeds 1 pixel.

Generated with AI

Co-Authored-By: SLAC AI
NCZarr is purely additive on top of the v1 layout (_NCZARR_GROUP /
_NCZARR_ARRAY markers + optional 1-D coordinate variables). Held out
of v1 because the zarr-v3 mapping is still evolving; recorded so we
don't lose the conclusion.

Generated with AI

Co-Authored-By: SLAC AI
Replaces the v1 plan (commit 51a1e3a) with a plan that matches the
revised spec (commits 83b9064 + a34369d): xarray/CF-shaped root with
OME-NGFF as a discoverability layer, 2-D packed-integer mask with CF
flag attrs, ColorImage as recursive sub-archives (no stacking),
CellCoadd with native 4-D PSF (no fixup pass), affine residual
validator with 1-pixel threshold, and the AST string as the WCS
round-trip authority. Six phases, ~37 bite-sized TDD tasks, every
critical invariant pinned by a failing-then-passing test.

Generated with AI

Co-Authored-By: SLAC AI
Generated with AI

Co-Authored-By: SLAC AI
…ations

Also extends the AST bridge in _transforms/_ast.py with ZoomMap and
PolyMap so the validator's tests can construct synthetic linear and
distorted FrameSets.

Generated with AI

Co-Authored-By: SLAC AI
Following project convention, imports go at the top of the module rather
than inside function bodies. PolyMap's numpy use and affine_check's
numpy use are hoisted; tests likewise hoist their _ast imports.

Generated with AI

Co-Authored-By: SLAC AI
…ame_set

Generated with AI

Co-Authored-By: SLAC AI
…ag attrs

Generated with AI

Co-Authored-By: SLAC AI
Generated with AI

Co-Authored-By: SLAC AI
…age on-disk shape

Mask.serialize emits the byte axis first when the archive opts in via
_prefer_native_mask_arrays. The output archive undoes that swap so the
on-disk layout is the natural xarray (y, x) shape with a 2-D packed
wide-integer per pixel. Adds an end-to-end on-disk shape test for
MaskedImage that pins this contract (Tasks 2.8 + 3.0 from the plan).

Generated with AI

Co-Authored-By: SLAC AI
…unpack

Generated with AI

Co-Authored-By: SLAC AI
… public read()

Generated with AI

Co-Authored-By: SLAC AI
…mage

Round-trips Image, MaskedImage (3-plane), and MaskedImage (40-plane)
through the zarr backend via the new RoundtripZarr helper.
RoundtripZarr overrides _run_without_butler to use a TemporaryDirectory
since zarr archives are directories, not single files.

Generated with AI

Co-Authored-By: SLAC AI
…gle-cell chunks

- Add decorate_sub_archives that walks the IR and adds lsst.archive_class
  + ome.multiscales to any sub-group containing an image array (Phase 4.1).
- Pin ColorImage write/round-trip behaviour: each channel becomes a 2-D
  array at root with no nested sub-archive, since ColorImage.serialize
  produces flat per-channel arrays (Phase 4.2).
- Default a 4-D psf array to single-cell chunks (1, 1, Py, Px) for
  CellCoadd-style PSF storage (Phase 4.3).

CellCoadd-specific layout and round-trip tests (Phase 4.4, 4.5) are
deferred as the plan's _make_minimal_cell_coadd factories are
implementer-supplied placeholders.

Generated with AI

Co-Authored-By: SLAC AI
… interop

- Persist FitsOpaqueMetadata at /lsst/opaque_metadata/fits/primary on
  zarr write; restore on read (Phase 5.1, 5.2).
- Add FITS -> Zarr -> FITS cross-format test confirming primary-HDU
  cards survive the round-trip (Phase 5.3).
- Add xarray.open_zarr interop test pinning Dataset shape and CF flag
  attrs on the mask variable (Phase 5.4, skipped when xarray absent).
- Add optional ome-zarr-py and ngff-validator compliance tests
  (Phase 5.5, 5.6, both skipped when absent).

Also moves all imports to top-of-file across the zarr backend per
project convention.

Generated with AI

Co-Authored-By: SLAC AI
…gment

Phase 6 finalization:
- Replace the placeholder __init__.py docstring with a full overview
  covering supported types, on-disk layout, WCS handling, cloud
  defaults, FITS round-trip, and optional install.
- Add doc/lsst.images/zarr.rst as the API reference page mirroring
  the other backends and wire it into the toctree.
- Add doc/changes/DM-55041.feature.md towncrier fragment.

Generated with AI

Co-Authored-By: SLAC AI
xarray's v3 backend reads dim names from the array's native
dimension_names metadata field, not from _ARRAY_DIMENSIONS in
attributes. Promote the CF attr to the v3 metadata field on write,
and mirror back on read so older v2-style consumers still see
_ARRAY_DIMENSIONS.

Generated with AI

Co-Authored-By: SLAC AI
xarray.open_zarr walks every array in a group and refuses to open the
parent if any array lacks dimension_names. Default to a list of None
sentinels for arrays that do not carry an explicit _ARRAY_DIMENSIONS
attr (e.g. tree, wcs_ast, opaque_metadata blobs).

Adds two xarray-backed read tests (skipped when xarray is missing)
and lists xarray + zarr as runtime requirements.

Generated with AI

Co-Authored-By: SLAC AI
After to_zarr() materialises the IR, call zarr.consolidate_metadata so
the root zarr.json carries every child array's metadata in one place.
This silences xarray's 'Failed to open Zarr store with consolidated
metadata' RuntimeWarning and turns opening a multi-array archive into
a single round-trip — significant on remote stores.

ZipStore does not support consolidation (raises TypeError); we swallow
that so zip writes continue to work without consolidation.

Generated with AI

Co-Authored-By: SLAC AI
timj added 5 commits May 23, 2026 07:50
…und-trip

Switch FitsOpaqueMetadata storage from a JSON {keyword: value} blob to
the raw card stream Header.tostring() emits, reshaped as a 2-D
(card, char) uint8 array.

The JSON form was lossy: COMMENT/HISTORY (empty-keyword cards) were
filtered out, inline card comments were dropped, types were coerced
through json, CONTINUE/HIERARCH cards were not faithfully captured.
The (N, 80) form is what FITS users expect and round-trips byte-for-
byte through Header.fromstring (which transparently strips the END
card and trailing 2880-byte padding).

Dim names ('card', 'char') are written into the v3 dimension_names
metadata so xarray sees the structure.

Generated with AI

Co-Authored-By: SLAC AI
These byte-stream meta arrays are always read whole — we never request
a slice of the JSON tree or AST WCS text. Force a single chunk so a
remote read is one fetch instead of ceil(size/1024). Affects:
  - root /tree (Pydantic archive tree)
  - sub-archive /<name>/tree (serialize_pointer JSON)
  - root /wcs_ast (AST FrameSet text)
  - /lsst/opaque_metadata/fits/primary (FITS card stream)

For a 50 KB tree this drops the read count from 49 chunks to 1.
Compression (Blosc/zstd) still applies and gives ~3-5x on JSON text.

Generated with AI

Co-Authored-By: SLAC AI
Mirrors the naming convention used by the other backends — the FITS
backend stores the tree in a 'JSON' HDU and the NDF backend at
/MORE/LSST/JSON. The plain 'tree' name was an internal pick that did
not communicate what the array holds.

  /lsst_json                 ← root archive tree
  /<sub>/lsst_json           ← sub-archive trees written by serialize_pointer

Root attr renamed in parallel: lsst.tree='tree' -> lsst.json='lsst_json'.
Method names (add_tree, get_tree) stay — they are internal vocabulary.

Generated with AI

Co-Authored-By: SLAC AI
Zarr has no established convention for storing an AST FrameSet text
dump; the RFC-5 nonlinear coordinateTransformations and the OME
linear approximation are the only proposals on the table. NDF stores
the WCS both as JSON and as a separate FrameSet representation; the
zarr backend just relies on the JSON tree, which already round-trips
SIP polynomials and any other PolyMap-based mapping byte-for-byte.

The internal _stage_wcs_ast helper and add_tree's frame-set hook are
left in place — they're never reached because no production
serialize() calls archive.serialize_frame_set, but the design hooks
are kept for future discussion.

Generated with AI

Co-Authored-By: SLAC AI
…led grid

Delegate the linear-fit / tolerance check to AST's astLinearApprox.
We give it the image footprint as bounds and the requested per-pixel
tolerance scaled to output (sky) units; AST returns coefficients when
a fit within tolerance exists and None otherwise. Saves us the
3-point linearization + 11x11 grid residual sample, plus the
hand-rolled great-circle separation helper.

The affine block we emit is unchanged in shape: a [scale, affine]
pair, where the affine matrix is built from AST's [c0, c1, J] layout
(constants first, then column-major Jacobian).

AffineCheckResult drops the diagnostic max_residual_pixels field and
the corresponding lsst.wcs_simplified_max_residual_pixels root attr,
since AST's pass/fail is binary — when dropped, we know the residual
exceeded tol but not by how much. lsst.wcs_simplified_dropped is
still recorded.

Also adds Mapping.linearApprox to the AST bridge.

Generated with AI

Co-Authored-By: SLAC AI
@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

❌ Patch coverage is 89.18605% with 186 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.50%. Comparing base (d80a30c) to head (2f1e643).
⚠️ Report is 22 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
python/lsst/images/zarr/_output_archive.py 77.18% 47 Missing ⚠️
python/lsst/images/zarr/_input_archive.py 79.67% 25 Missing ⚠️
python/lsst/images/zarr/_store.py 68.18% 21 Missing ⚠️
python/lsst/images/zarr/_model.py 90.37% 18 Missing ⚠️
tests/test_zarr_external_reader.py 48.57% 18 Missing ⚠️
tests/test_zarr_ome_compliance.py 52.94% 16 Missing ⚠️
python/lsst/images/zarr/_layout.py 89.10% 11 Missing ⚠️
tests/test_zarr_input_archive.py 98.23% 4 Missing ⚠️
tests/test_zarr_xarray_interop.py 93.33% 4 Missing ⚠️
python/lsst/images/tests/_roundtrip.py 86.95% 3 Missing ⚠️
... and 10 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #45      +/-   ##
==========================================
+ Coverage   74.08%   75.50%   +1.41%     
==========================================
  Files          89      107      +18     
  Lines       10515    12437    +1922     
==========================================
+ Hits         7790     9390    +1600     
- Misses       2725     3047     +322     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

timj added 3 commits May 23, 2026 10:32
- Replace zarr.storage.Store annotations with zarr.abc.store.Store
  (Store is exposed there, not from zarr.storage).
- In _store.py, hoist the Store annotation out of the branch-local
  store assignments so each concrete subtype (ZipStore, FsspecStore,
  LocalStore) is accepted; ZipStore stays in its own variable since
  its close() guard reads ZipStore-specific state.
- In _model.py, narrow the lazy-handle read return to np.ndarray via
  asarray, import BytesCodec/BloscCodec from zarr.codecs directly
  (the codecs submodule is not surfaced on the top-level zarr
  package's stub), and cast cname/shuffle to BloscCodec's enum-typed
  arguments — runtime accepts plain strings.
- In _output_archive.py, drop None placeholders from MaskSchema
  iteration before building CfFlagAttributes; assert the top-level
  image is 2-D before passing its shape to affine_check; cast
  FrameSet to AST Object for Channel.write.

Generated with AI

Co-Authored-By: SLAC AI
Eight Sphinx warnings (treated as errors with -W) for unresolvable
:py:obj: references in single-backtick form:

  - `MaskedImage` / `ColorImage` -> use full paths
    `~lsst.images.MaskedImage` / `~lsst.images.ColorImage` so the
    cross-reference resolves.
  - `fsspec`, `zarr`, `TemporaryDirectory` -> external / stdlib
    refs without intersphinx mappings; switch to literal double
    backticks.
  - `ZarrDocument` / `ZarrDocument.to_zarr` -> internal IR types
    not exported from lsst.images.zarr; switch to literal double
    backticks (matches NDF's _model.NdfDocument convention).

Generated with AI

Co-Authored-By: SLAC AI
Adds sections to doc/lsst.images/zarr.rst:
  - Standards alignment (Zarr v3, xarray/CF, OME-NGFF v0.5, geo-zarr, LSST archive tree)
  - Data model (lsst_json tree, root attrs, FITS opaque metadata, per-array)
  - Example layouts for VisitImage, CellCoadd, and ColorImage with the
    full directory tree and per-array shape / chunk notes
  - WCS handling (PolyMap chain in JSON, OME affine approx via
    AST linearapprox)
  - Tooling that can read these files: xarray, napari-ome-zarr,
    ome-zarr-py, GDAL/rasterio, zarr-python, napari, neuroglancer,
    ngff-validator
  - Round-trip with FITS

Generated with AI

Co-Authored-By: SLAC AI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant