DM-55041: Add Zarr file I/O#45
Draft
timj wants to merge 38 commits into
Draft
Conversation
Captures the agreed-on architecture for adding a zarr backend alongside the existing FITS/JSON/NDF backends: cloud-first OME-Zarr v0.5 layout with namespaced LSST extensions, IR-driven two-pass writes mirroring the NDF approach, and recursive composition for image-shaped sub-archives. Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
Pivots the on-disk layout from the v1 OME-multiscale-image-with-lsst- companions structure to an xarray/CF-shaped root with image, variance, mask as siblings; OME multiscales metadata points at the same image array (no byte duplication). Mask becomes a 2-D packed-integer array with CF flag_masks/flag_meanings/flag_descriptions for native geospatial-tool interop. ColorImage stops stacking and writes its channels as recursive sub-archives. WCS is affine-only OME plus authoritative AST string in v1; RFC-5 nonlinear transformations are a follow-up blocked on writing an AST JSON channel. Adds an 11x11-grid residual validator that drops the simplified affine when the max pixel-equivalent error exceeds 1 pixel. Generated with AI Co-Authored-By: SLAC AI
NCZarr is purely additive on top of the v1 layout (_NCZARR_GROUP / _NCZARR_ARRAY markers + optional 1-D coordinate variables). Held out of v1 because the zarr-v3 mapping is still evolving; recorded so we don't lose the conclusion. Generated with AI Co-Authored-By: SLAC AI
Replaces the v1 plan (commit 51a1e3a) with a plan that matches the revised spec (commits 83b9064 + a34369d): xarray/CF-shaped root with OME-NGFF as a discoverability layer, 2-D packed-integer mask with CF flag attrs, ColorImage as recursive sub-archives (no stacking), CellCoadd with native 4-D PSF (no fixup pass), affine residual validator with 1-pixel threshold, and the AST string as the WCS round-trip authority. Six phases, ~37 bite-sized TDD tasks, every critical invariant pinned by a failing-then-passing test. Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
…ations Also extends the AST bridge in _transforms/_ast.py with ZoomMap and PolyMap so the validator's tests can construct synthetic linear and distorted FrameSets. Generated with AI Co-Authored-By: SLAC AI
Following project convention, imports go at the top of the module rather than inside function bodies. PolyMap's numpy use and affine_check's numpy use are hoisted; tests likewise hoist their _ast imports. Generated with AI Co-Authored-By: SLAC AI
…ame_set Generated with AI Co-Authored-By: SLAC AI
…ag attrs Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
…age on-disk shape Mask.serialize emits the byte axis first when the archive opts in via _prefer_native_mask_arrays. The output archive undoes that swap so the on-disk layout is the natural xarray (y, x) shape with a 2-D packed wide-integer per pixel. Adds an end-to-end on-disk shape test for MaskedImage that pins this contract (Tasks 2.8 + 3.0 from the plan). Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
…unpack Generated with AI Co-Authored-By: SLAC AI
… public read() Generated with AI Co-Authored-By: SLAC AI
…mage Round-trips Image, MaskedImage (3-plane), and MaskedImage (40-plane) through the zarr backend via the new RoundtripZarr helper. RoundtripZarr overrides _run_without_butler to use a TemporaryDirectory since zarr archives are directories, not single files. Generated with AI Co-Authored-By: SLAC AI
…gle-cell chunks - Add decorate_sub_archives that walks the IR and adds lsst.archive_class + ome.multiscales to any sub-group containing an image array (Phase 4.1). - Pin ColorImage write/round-trip behaviour: each channel becomes a 2-D array at root with no nested sub-archive, since ColorImage.serialize produces flat per-channel arrays (Phase 4.2). - Default a 4-D psf array to single-cell chunks (1, 1, Py, Px) for CellCoadd-style PSF storage (Phase 4.3). CellCoadd-specific layout and round-trip tests (Phase 4.4, 4.5) are deferred as the plan's _make_minimal_cell_coadd factories are implementer-supplied placeholders. Generated with AI Co-Authored-By: SLAC AI
… interop - Persist FitsOpaqueMetadata at /lsst/opaque_metadata/fits/primary on zarr write; restore on read (Phase 5.1, 5.2). - Add FITS -> Zarr -> FITS cross-format test confirming primary-HDU cards survive the round-trip (Phase 5.3). - Add xarray.open_zarr interop test pinning Dataset shape and CF flag attrs on the mask variable (Phase 5.4, skipped when xarray absent). - Add optional ome-zarr-py and ngff-validator compliance tests (Phase 5.5, 5.6, both skipped when absent). Also moves all imports to top-of-file across the zarr backend per project convention. Generated with AI Co-Authored-By: SLAC AI
…gment Phase 6 finalization: - Replace the placeholder __init__.py docstring with a full overview covering supported types, on-disk layout, WCS handling, cloud defaults, FITS round-trip, and optional install. - Add doc/lsst.images/zarr.rst as the API reference page mirroring the other backends and wire it into the toctree. - Add doc/changes/DM-55041.feature.md towncrier fragment. Generated with AI Co-Authored-By: SLAC AI
xarray's v3 backend reads dim names from the array's native dimension_names metadata field, not from _ARRAY_DIMENSIONS in attributes. Promote the CF attr to the v3 metadata field on write, and mirror back on read so older v2-style consumers still see _ARRAY_DIMENSIONS. Generated with AI Co-Authored-By: SLAC AI
xarray.open_zarr walks every array in a group and refuses to open the parent if any array lacks dimension_names. Default to a list of None sentinels for arrays that do not carry an explicit _ARRAY_DIMENSIONS attr (e.g. tree, wcs_ast, opaque_metadata blobs). Adds two xarray-backed read tests (skipped when xarray is missing) and lists xarray + zarr as runtime requirements. Generated with AI Co-Authored-By: SLAC AI
After to_zarr() materialises the IR, call zarr.consolidate_metadata so the root zarr.json carries every child array's metadata in one place. This silences xarray's 'Failed to open Zarr store with consolidated metadata' RuntimeWarning and turns opening a multi-array archive into a single round-trip — significant on remote stores. ZipStore does not support consolidation (raises TypeError); we swallow that so zip writes continue to work without consolidation. Generated with AI Co-Authored-By: SLAC AI
…und-trip
Switch FitsOpaqueMetadata storage from a JSON {keyword: value} blob to
the raw card stream Header.tostring() emits, reshaped as a 2-D
(card, char) uint8 array.
The JSON form was lossy: COMMENT/HISTORY (empty-keyword cards) were
filtered out, inline card comments were dropped, types were coerced
through json, CONTINUE/HIERARCH cards were not faithfully captured.
The (N, 80) form is what FITS users expect and round-trips byte-for-
byte through Header.fromstring (which transparently strips the END
card and trailing 2880-byte padding).
Dim names ('card', 'char') are written into the v3 dimension_names
metadata so xarray sees the structure.
Generated with AI
Co-Authored-By: SLAC AI
These byte-stream meta arrays are always read whole — we never request a slice of the JSON tree or AST WCS text. Force a single chunk so a remote read is one fetch instead of ceil(size/1024). Affects: - root /tree (Pydantic archive tree) - sub-archive /<name>/tree (serialize_pointer JSON) - root /wcs_ast (AST FrameSet text) - /lsst/opaque_metadata/fits/primary (FITS card stream) For a 50 KB tree this drops the read count from 49 chunks to 1. Compression (Blosc/zstd) still applies and gives ~3-5x on JSON text. Generated with AI Co-Authored-By: SLAC AI
Mirrors the naming convention used by the other backends — the FITS backend stores the tree in a 'JSON' HDU and the NDF backend at /MORE/LSST/JSON. The plain 'tree' name was an internal pick that did not communicate what the array holds. /lsst_json ← root archive tree /<sub>/lsst_json ← sub-archive trees written by serialize_pointer Root attr renamed in parallel: lsst.tree='tree' -> lsst.json='lsst_json'. Method names (add_tree, get_tree) stay — they are internal vocabulary. Generated with AI Co-Authored-By: SLAC AI
Zarr has no established convention for storing an AST FrameSet text dump; the RFC-5 nonlinear coordinateTransformations and the OME linear approximation are the only proposals on the table. NDF stores the WCS both as JSON and as a separate FrameSet representation; the zarr backend just relies on the JSON tree, which already round-trips SIP polynomials and any other PolyMap-based mapping byte-for-byte. The internal _stage_wcs_ast helper and add_tree's frame-set hook are left in place — they're never reached because no production serialize() calls archive.serialize_frame_set, but the design hooks are kept for future discussion. Generated with AI Co-Authored-By: SLAC AI
…led grid Delegate the linear-fit / tolerance check to AST's astLinearApprox. We give it the image footprint as bounds and the requested per-pixel tolerance scaled to output (sky) units; AST returns coefficients when a fit within tolerance exists and None otherwise. Saves us the 3-point linearization + 11x11 grid residual sample, plus the hand-rolled great-circle separation helper. The affine block we emit is unchanged in shape: a [scale, affine] pair, where the affine matrix is built from AST's [c0, c1, J] layout (constants first, then column-major Jacobian). AffineCheckResult drops the diagnostic max_residual_pixels field and the corresponding lsst.wcs_simplified_max_residual_pixels root attr, since AST's pass/fail is binary — when dropped, we know the residual exceeded tol but not by how much. lsst.wcs_simplified_dropped is still recorded. Also adds Mapping.linearApprox to the AST bridge. Generated with AI Co-Authored-By: SLAC AI
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #45 +/- ##
==========================================
+ Coverage 74.08% 75.50% +1.41%
==========================================
Files 89 107 +18
Lines 10515 12437 +1922
==========================================
+ Hits 7790 9390 +1600
- Misses 2725 3047 +322 ☔ View full report in Codecov by Sentry. |
- Replace zarr.storage.Store annotations with zarr.abc.store.Store (Store is exposed there, not from zarr.storage). - In _store.py, hoist the Store annotation out of the branch-local store assignments so each concrete subtype (ZipStore, FsspecStore, LocalStore) is accepted; ZipStore stays in its own variable since its close() guard reads ZipStore-specific state. - In _model.py, narrow the lazy-handle read return to np.ndarray via asarray, import BytesCodec/BloscCodec from zarr.codecs directly (the codecs submodule is not surfaced on the top-level zarr package's stub), and cast cname/shuffle to BloscCodec's enum-typed arguments — runtime accepts plain strings. - In _output_archive.py, drop None placeholders from MaskSchema iteration before building CfFlagAttributes; assert the top-level image is 2-D before passing its shape to affine_check; cast FrameSet to AST Object for Channel.write. Generated with AI Co-Authored-By: SLAC AI
Eight Sphinx warnings (treated as errors with -W) for unresolvable
:py:obj: references in single-backtick form:
- `MaskedImage` / `ColorImage` -> use full paths
`~lsst.images.MaskedImage` / `~lsst.images.ColorImage` so the
cross-reference resolves.
- `fsspec`, `zarr`, `TemporaryDirectory` -> external / stdlib
refs without intersphinx mappings; switch to literal double
backticks.
- `ZarrDocument` / `ZarrDocument.to_zarr` -> internal IR types
not exported from lsst.images.zarr; switch to literal double
backticks (matches NDF's _model.NdfDocument convention).
Generated with AI
Co-Authored-By: SLAC AI
Adds sections to doc/lsst.images/zarr.rst:
- Standards alignment (Zarr v3, xarray/CF, OME-NGFF v0.5, geo-zarr, LSST archive tree)
- Data model (lsst_json tree, root attrs, FITS opaque metadata, per-array)
- Example layouts for VisitImage, CellCoadd, and ColorImage with the
full directory tree and per-array shape / chunk notes
- WCS handling (PolyMap chain in JSON, OME affine approx via
AST linearapprox)
- Tooling that can read these files: xarray, napari-ome-zarr,
ome-zarr-py, GDAL/rasterio, zarr-python, napari, neuroglancer,
ngff-validator
- Round-trip with FITS
Generated with AI
Co-Authored-By: SLAC AI
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This uses xarray / CF / OME conventions where appropriate.
Checklist
doc/changes