From a11db46756aa7cfeb42337b8a485980fbd7163ca Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 05:49:22 -0700 Subject: [PATCH 01/60] Add design spec for zarr I/O backend Captures the agreed-on architecture for adding a zarr backend alongside the existing FITS/JSON/NDF backends: cloud-first OME-Zarr v0.5 layout with namespaced LSST extensions, IR-driven two-pass writes mirroring the NDF approach, and recursive composition for image-shaped sub-archives. Generated with AI Co-Authored-By: SLAC AI --- .../specs/2026-05-22-zarr-io-design.md | 426 ++++++++++++++++++ 1 file changed, 426 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-22-zarr-io-design.md diff --git a/docs/superpowers/specs/2026-05-22-zarr-io-design.md b/docs/superpowers/specs/2026-05-22-zarr-io-design.md new file mode 100644 index 00000000..47c110a7 --- /dev/null +++ b/docs/superpowers/specs/2026-05-22-zarr-io-design.md @@ -0,0 +1,426 @@ +# Zarr I/O Backend for `lsst.images` — Design + +**Status:** Approved (design phase). Ready for implementation planning. +**Date:** 2026-05-22 +**Author:** Tim Jenness (with Claude collaborator) + +## 1. Goals, Scope, Non-Goals + +### Goals + +Add a `lsst.images.zarr` subpackage providing: + +- A `ZarrOutputArchive` and `ZarrInputArchive` implementing the existing + `lsst.images.serialization` `OutputArchive` / `InputArchive` ABCs. +- Top-level `read()` and `write()` helpers consistent with the FITS, JSON, + and NDF backends. +- A Python intermediate representation (IR) — `ZarrDocument`, `ZarrGroup`, + `ZarrArray`, etc. — that describes the on-disk layout independently of + `zarr-python`, mirroring the role `NdfDocument` plays for the NDF backend. + +Because the backend builds on the abstract archive interface, every image +type that already serializes to FITS/JSON/NDF (`Image`, `Mask`, +`MaskedImage`, `VisitImage`, `ColorImage`, `CellCoadd`, plus any +`serialize()`-implementing object reachable through the archive) works with +no per-type code in the backend itself. Per-type adjustments are limited to +layout decisions (e.g. ColorImage's channel axis) made in `_layout.py` +against the populated IR. + +### Standards alignment + +Files written are valid OME-Zarr v0.5 NGFF images at every level where +image-shaped data lives, augmented with namespaced `lsst:` extensions where +no relevant standard exists (mask plane semantics, AST WCS round-trip, the +LSST archive tree, table layout, cell-grid hints). + +External tools that consume OME-Zarr (`napari`, `neuroglancer`, +`ngff-validator`, `ome-zarr-py`, etc.) can render the science arrays +without LSST-specific awareness. Recursive composition: any sub-archive +holding image-shaped data — including PSF model parameter images — is +itself a valid OME-Zarr group at its zarr path, with its own +`ome.multiscales` and `lsst.archive_class` attributes. + +### Cloud-first, local works too + +- Default chunk geometry is tile-aligned (~1024×1024 for plain images, + `cell_shape` for `CellCoadd`). +- Sharding (zarr v3 native) is enabled by default with a tunable shard size + to keep object counts manageable on S3/GCS. +- Subset reads via `slices=` exploit zarr's chunk index. +- Both `DirectoryStore` and `ZipStore` are supported; the choice is driven + by URI shape (`*.zarr.zip` → ZipStore, otherwise directory). Remote URIs + go through `lsst.resources.ResourcePath` and `fsspec`. + +### Scope + +Same image-type coverage as the FITS backend: `Image`, `Mask` (2D and 3D), +`MaskedImage`, `VisitImage`, `ColorImage`, `CellCoadd`, plus any +`serialize()`-implementing object reachable through the archive interface. + +ColorImage uses the OME `c` axis. CellCoadd's per-cell PSF is stored as a +single 4D OME-Zarr image with cell-aligned chunks. + +### Non-Goals (initial release) + +- No dask / lazy `read_lazy()` API — added later, tracked as follow-up. +- No multi-level OME multiscale pyramid (we only ever write one level at + `0/`). +- No NGFF nonlinear coordinate transformations (currently underspecified + and lacking widespread tooling support). Tracked as follow-up — see + Section 6. +- No automatic OME consolidated-metadata extension. Tracked as follow-up. + +### Dependency + +Optional `[zarr]` extra requiring `zarr >= 3.0` and any required codec +packages. The top-level `lsst/images/zarr/__init__.py` does a guarded +`import zarr` and raises `ImportError` with installation guidance if +missing, mirroring the NDF backend. + +## 2. Module Layout and Architecture + +``` +python/lsst/images/zarr/ +├── __init__.py guarded `import zarr`; re-exports +├── _common.py ZarrPointerModel (analog of NdfPointerModel), +│ attribute namespace constants ("lsst:", "ome:"), +│ ZarrCompressionOptions dataclass (codec, level), +│ path/JSON-pointer helpers +├── _model.py Python intermediate representation: +│ ZarrDocument, ZarrGroup, ZarrArray, ZarrAttributes, +│ OmeMultiscale, OmeOmero, LsstTableGroup, LsstMaskGroup, +│ from_zarr() / to_zarr() materialization methods +├── _layout.py Layout rules: where things go relative to a root +│ group, JSON-pointer ↔ zarr-path translation, +│ OME axis selection per archive class +│ (ColorImage → c,y,x; default → y,x; …) +├── _output_archive.py ZarrOutputArchive and write() +├── _input_archive.py ZarrInputArchive and read() +└── _store.py Wrapper that turns a ResourcePath / fsspec URI into + the right zarr.storage.Store + (DirectoryStore / ZipStore / FsspecStore) +``` + +### Fit with existing abstractions + +- `ZarrOutputArchive[ZarrPointerModel]` implements the abstract methods + (`serialize_direct`, `serialize_pointer`, `serialize_frame_set`, + `add_array`, `add_table`, `add_structured_array`, `iter_frame_sets`). +- `ZarrPointerModel` is a small Pydantic model holding a zarr path + (e.g. `"/lsst/mask"`); when a model field carries a `ZarrPointerModel`, + the consumer dereferences it through the input archive — same pattern + as `NdfPointerModel`. +- `update_header` callbacks (intended for FITS) are accepted and ignored, + identical to the JSON backend. +- The `serialization.ArchiveTree` JSON tree is stored verbatim as a UTF-8 + zarr array at `/lsst/tree`. Array references in the tree resolve to + zarr paths under the same root. + +### Two-pass write driven by the IR + +During `obj.serialize(archive)`, the archive populates an in-memory +`ZarrDocument`. Only when the context manager exits does the IR +materialize to zarr-python via the configured store. + +Benefits: + +- Per-class layout decisions are made once in `_layout.py` against the + populated IR rather than scattered across `add_array` calls. +- Tests can assert on the IR without writing files (mirrors the NDF + pattern). +- A future "validate-then-commit" step (e.g. ngff-validator integration) + can run against the IR. + +### Read mirrors write + +`ZarrInputArchive.open()` opens the store, builds a `ZarrDocument` view +backed by lazy zarr-python objects, validates the `lsst:archive_class` +attribute, locates `/lsst/tree`, and parses it into the appropriate +`ArchiveTree` Pydantic model. `get_array(model, slices=...)` translates +the model's path into a chunk-aligned zarr read. + +### Backend write helper signature + +```python +def write( + obj: Any, + path: ResourcePathExpression | None = None, + *, + chunks: Mapping[str, tuple[int, ...] | None] | None = None, + shards: Mapping[str, tuple[int, ...] | None] | None = None, + compression: Mapping[str, ZarrCompressionOptions | None] | None = None, + metadata: dict[str, MetadataValue] | None = None, + butler_info: ButlerInfo | None = None, +) -> ArchiveTree: ... +``` + +`chunks`, `shards`, and `compression` are per-array dicts keyed by the +JSON pointer of the attribute the array backs (or its zarr path), +mirroring the existing `compression_options` pattern from the FITS +backend. Different arrays have different ranks (2D image, 3D mask, 4D +per-cell PSF) so a single tuple value would not be meaningful. Missing +keys fall back to the per-class defaults from +[Section 3 — Chunking and sharding defaults](#chunking-and-sharding-defaults). +A value of `None` for a key means "use the default for this array"; +explicitly setting `shards` to `{}` (empty mapping) does *not* disable +sharding — to disable, pass `{"": None}` per array. (A future +follow-up may add a `shards=False` shorthand if it proves useful.) + +## 3. On-Disk Layout (the spec) + +### Top-level group attributes (`zarr.json` `attributes`) + +```jsonc +{ + // OME-Zarr v0.5 multiscales — populated whenever there is a top-level + // data array. + "ome": { + "version": "0.5", + "multiscales": [{ + "name": "", + "axes": [/* see per-class table below */], + "datasets": [{ + "path": "0", + "coordinateTransformations": [/* affine projection of WCS, if any */] + }] + }], + // Only present when channel axis is used (e.g. ColorImage). + "omero": { "channels": [...] } + }, + + // LSST extensions (always present). + "lsst": { + "version": 1, // schema version of LSST extension + "archive_class": "VisitImage", // dispatch for read-side construction + "tree": "/lsst/tree", // zarr path to JSON tree + "frame_set": "/lsst/frame_set", // optional, AST string array + "companions": { // heterogeneous companion sub-images + "mask": "/lsst/mask", + "variance": "/lsst/variance" + }, + "opaque_metadata_format": "fits", // optional, only when opaque metadata is present (e.g. round-tripping from a FITS read) + "cell_grid": { "bbox": ..., "cell_shape": [256, 256] } // CellCoadd only + } +} +``` + +### Axis choice per archive class + +| Archive class | Axes | Top-level array shape | Notes | +|---|---|---|---| +| `Image`, `Mask` (2D), `MaskedImage`, `VisitImage`, `CellCoadd` | `[y, x]` | `(Y, X)` | Standard 2D image | +| `ColorImage` | `[c, y, x]` | `(3, Y, X)` | Transposed from in-memory `(Y, X, 3)`; `ome:omero/channels=[R,G,B]` | +| `Mask` (3D) | `[plane, y, x]` | `(P, Y, X)` | When written standalone | +| `CellPointSpreadFunction` (per-cell PSF) | `[cell_y, cell_x, y, x]` | `(Cy, Cx, Py, Px)` | Always nested under `lsst/psf/per_cell` of a parent CellCoadd; cell-aligned chunks | + +### Mask groups + +Located under `lsst/mask/` of a parent (or top-level for a standalone +Mask): + +- `0/`: 3D zarr array with shape `(plane, y, x)` (or 2D for old-style flat + masks). +- `zarr.json` attrs: `ome.multiscales.axes = [plane, y, x]`, plus + `lsst.mask.planes = [{name, bit, description}, …]`. +- The mask schema is duplicated between the mask group's attributes and + the JSON tree. The JSON tree is authoritative; the duplication is for + OME-Zarr-style discoverability by external tools. + +### Tables + +Located under `lsst/tables//` (or wherever the JSON tree references): + +- A zarr group with one 1D zarr array per column. +- Group attrs: `lsst.table = {columns: [{name, dtype, unit, description}, …], meta: {...}, length: N}`. +- Structured arrays use the same group form; the deserialized type + differs. + +### Frame sets / WCS + +- AST `FrameSet` serialized via `Channel`/`StringStream` → UTF-8 bytes → + stored as a 1D `uint8` zarr array at `lsst/frame_set` (or + `lsst/frame_sets/` when multiple frame sets are referenced via + `serialize_frame_set`). +- When the FrameSet's pixel-to-sky portion is purely affine + (translation + scale + rotation), an equivalent OME + `coordinateTransformations` block is added to the multiscales metadata + for external-tool benefit. The AST string remains the authoritative + source for round-trip. +- For `CellCoadd` and other tangent-plane WCS cases, the affine + approximation is often exact for the linear part; the AST string is + still authoritative. + +### JSON tree (`lsst/tree`) + +- A 1D `uint8` zarr array containing UTF-8 JSON. Same content the JSON + backend would produce, but with `ArrayReferenceModel` references whose + source paths are zarr paths within the store (e.g. `"/0"` → top-level + data array, `"/lsst/mask/0"` → 3D mask array). These resolve into the + zarr store, not into the JSON document itself, so they do not use the + JSON-Pointer `#/` fragment prefix. +- Keeps round-trip with the existing `serialize` / `deserialize` + machinery untouched. + +### Recursive composition + +Any sub-archive that holds image-shaped data (e.g. PSF model parameter +images) creates a nested zarr group at its archive path that is itself a +valid OME-Zarr image, with its own `ome.multiscales` and +`lsst.archive_class` attributes. The top-level is not special; the same +rules apply at every level. + +### Chunking and sharding defaults + +- Default chunk for a 2D image: `min(1024, dim)` per axis. + For `CellCoadd`: `cell_shape`. +- Default shard: 4×4 chunks (i.e. 4096×4096 for plain images, 4×4 cells + for `CellCoadd`) if shard size would be ≥ 1 MiB; otherwise no + sharding. +- Default codec stack: `bytes -> blosc(zstd, clevel=5, shuffle=shuffle)` + for floats; `bytes -> blosc(zstd, clevel=5, shuffle=bitshuffle)` for + integers. +- Mask arrays use `bitshuffle + zstd` (compresses very well). +- All defaults are overridable via `ZarrCompressionOptions` per-array + (keyed by JSON pointer, mirroring the FITS backend). + +## 4. Error Handling, Edge Cases, Round-Trips + +### Round-trip rules + +- A zarr file written from an object read from FITS must round-trip its + `FitsOpaqueMetadata` (primary header) so that re-writing to FITS + preserves header cards. Stored at `lsst/opaque_metadata/fits/primary` + as JSON-encoded astropy `Header`. Same opaque-metadata pattern NDF + uses. +- For zarr itself: any `lsst.*` attributes the archive doesn't recognize + are preserved verbatim and re-emitted on write of an unchanged tree + (forward compatibility). +- ColorImage's transpose `(Y, X, 3) ↔ (3, Y, X)` is handled in the IR + materialization layer, not in the user-visible `add_array`. + +### Error taxonomy + +Extends existing `serialization.ArchiveReadError`: + +- `ArchiveReadError("File has no zarr.json")` for missing root metadata. +- `ArchiveReadError("File is not an LSST zarr archive")` when + `lsst.archive_class` is missing. +- `ArchiveReadError("Unsupported lsst:version ")` for forward-incompat + schema versions. +- `InvalidParameterError` for unknown `read()` kwargs. +- `InvalidComponentError` for `deserialize_component` on unknown + component names. +- Validation failures from `model_validate_json` propagate as + `ArchiveReadError`. + +### Mode and atomicity + +- Write opens the store in create-only mode (refuses to overwrite an + existing zarr root, mirroring FITS/NDF). +- For DirectoryStore, a partial failure leaves a partial directory — + same risk profile as NDF write failures. Document this and recommend + writing to a temp path then renaming via ResourcePath. +- ZipStore writes are atomic (the file isn't valid until the central + directory is written), so failures leave no garbage. + +### Chunk-aligned subset reads + +- `get_array(model, slices=...)` passes slices straight to the backing + zarr-python array. Zarr handles chunk boundary alignment internally. +- For 3D mask reads with a `bbox=`-style argument, the slice on the + spatial axes only is what zarr sees; the plane axis is fully read. + +### Mask schema mismatches + +- If a Mask is read where the on-disk plane definitions differ from the + in-memory schema being requested, raise `ArchiveReadError` with both + schemas attached, identical to NDF. + +### Empty / minimal cases + +- `Image` with no projection: omit `lsst.frame_set`, omit OME + `coordinateTransformations`, the `lsst.tree` is just an + `ImageSerializationModel`. +- `Image` plus metadata only: same as above; `metadata` lives in the + JSON tree. + +### Forward compatibility + +- `lsst.version` integer; readers refuse versions newer than they + understand. +- Unknown `lsst.*` keys at any level are preserved through the opaque + metadata mechanism so a partial round-trip does not lose them. + +## 5. Testing Strategy and Rollout + +### Test layout + +Mirrors the NDF pattern (`tests/test_ndf_*.py`): + +- `tests/test_zarr_common.py` — `_common.py` constants, path helpers, + `ZarrCompressionOptions` dataclass. +- `tests/test_zarr_model.py` — IR types in isolation: `ZarrDocument` + round-trip via `from_zarr` / `to_zarr` against an in-memory store, + attribute schema validation, ColorImage axis transpose. +- `tests/test_zarr_layout.py` — `_layout.py` rules: which axes for + which archive class, which attributes get populated, JSON-pointer ↔ + zarr-path translation. +- `tests/test_zarr_output_archive.py` — write paths for every supported + archive class (`Image`, `Mask`, `MaskedImage`, `VisitImage`, + `ColorImage`, `CellCoadd`), verifying the on-disk layout matches the + spec by inspecting the IR. +- `tests/test_zarr_input_archive.py` — read paths and `slices=` subset + reads, error taxonomy tests, opaque-metadata round-trips. +- `tests/test_zarr_round_trip.py` — full write→read round-trips for + every type, plus FITS↔Zarr cross-format round-trips for the types + that already do FITS↔NDF round-trips. +- `tests/test_zarr_ome_compliance.py` — *if* `ngff-validator` (or + equivalent) can be installed in CI, run it against representative + outputs to catch OME-Zarr spec drift. Skipped if the tool is + unavailable. +- `tests/test_zarr_external_reader.py` — sanity-check that the + `ome-zarr` Python tooling can open our files and read the data array + (not LSST extensions). Skipped if `ome-zarr` is not installed. + +### CI / dev requirements + +Add `zarr >= 3.0` to the optional test dependency set so tests run +automatically. The package metadata adds `[zarr]` extra to the +user-facing extras. + +### Rollout plan + +Scoped into separate tickets/PRs to keep review tractable: + +1. Skeleton + `_common.py` + `_model.py` IR + tests for the IR alone. + No write/read yet. +2. `_store.py` + `_layout.py` + `ZarrOutputArchive` + write helper. + Cover `Image`, `MaskedImage`, `VisitImage` only. Output-side tests. +3. `ZarrInputArchive` + read helper + `slices=` subset reads + error + taxonomy. Input-side tests + round-trip for the types in step 2. +4. `ColorImage` (channel-axis specialization) + `CellCoadd` + (cell-aligned chunks + 4D PSF). Round-trip tests. +5. Cross-format round-trips (FITS ↔ Zarr opaque metadata round-trip). + Optional `ome-zarr` external-reader sanity test. +6. Documentation: module docstring (mirroring the FITS/NDF module + docstrings) describing the layout, plus a changelog entry. + +## 6. Follow-Up Work (Out of Scope) + +Captured here so they are not lost; each is to be tracked as its own +ticket once the initial backend lands. + +- **NGFF nonlinear coordinate transformations.** When NGFF gains + broadly-supported nonlinear transformations (RFC-3 follow-on or + successor) and tooling adoption follows, replace the current + affine-approximation path with a fully populated OME + `coordinateTransformations`. This is high-interest because tangent-plane + pixel-to-sky transformations (CellCoadd) and polynomial corrections + (VisitImage) currently round-trip only through the AST string; richer + OME support would expose them to external tools. +- **Lazy / dask-friendly read API** (`read_lazy()` returning open zarr + arrays/groups for downstream dask integration). +- **Multiscale pyramid generation** (level 1, 2, … coarsenings) for + visualization tools. +- **`zarr.consolidated_metadata` extension** to reduce object-list calls + on cloud stores. From 51a1e3a92a3429f379602b9106e9e0934f076a21 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 08:05:23 -0700 Subject: [PATCH 02/60] Add implementation plan for zarr I/O backend Generated with AI Co-Authored-By: SLAC AI --- .../plans/2026-05-22-zarr-io-backend.md | 4642 +++++++++++++++++ 1 file changed, 4642 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-22-zarr-io-backend.md diff --git a/docs/superpowers/plans/2026-05-22-zarr-io-backend.md b/docs/superpowers/plans/2026-05-22-zarr-io-backend.md new file mode 100644 index 00000000..38991d4c --- /dev/null +++ b/docs/superpowers/plans/2026-05-22-zarr-io-backend.md @@ -0,0 +1,4642 @@ +# Zarr I/O Backend Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add a `lsst.images.zarr` subpackage that reads and writes OME-Zarr v0.5 files (with namespaced `lsst:` extensions) for every image type the existing FITS/JSON/NDF backends support, with cloud-friendly chunking/sharding and efficient subset reads that go straight to the underlying `zarr-python` lazy arrays. + +**Architecture:** Mirrors the NDF backend. A Python intermediate representation (`ZarrDocument`/`ZarrGroup`/`ZarrArray`) holds the on-disk layout independently of `zarr-python`. The IR holds **lazy `zarr.Array` handles, never materialized `numpy` arrays** — so `from_zarr()` opens groups without reading bytes, and `InputArchive.get_array(model, slices=...)` passes slices through to the lazy handle. Writes use a two-pass model: `obj.serialize(archive)` populates the IR, then `__exit__` materializes it via the configured `zarr.storage.Store`. Stores are selected from a `ResourcePath` URI: `*.zarr.zip` → `ZipStore`, remote URIs → `FsspecStore`, otherwise `LocalStore`. + +**Tech Stack:** `zarr >= 3.0`, `numcodecs` (already pulled by zarr), `fsspec` (already a dependency), `lsst.resources.ResourcePath` (already a dependency), `pydantic >= 2.12`, `numpy >= 2.0`. Reuses `lsst.images.serialization` ABCs and tree models. Optional install via `pip install lsst-images[zarr]`. + +**Critical invariant — lazy reads everywhere:** +- `ZarrArray.data` holds either `np.ndarray` (when staged for write) **or** a `zarr.Array` handle (when read). It never copies bytes during `from_zarr`. +- `ZarrInputArchive.get_array(model, slices=...)` resolves the model's zarr path, fetches the lazy `zarr.Array`, and applies `slices` *to the handle* (`arr[slices]`). For a remote VisitImage this means only the chunks intersecting `slices` are downloaded. +- The `_layout.py` ColorImage transpose `(c, y, x) → (y, x, c)` happens *after* slicing, on the (small) sliced result — never on the full array. +- Tests under `test_zarr_input_archive.py` assert this with a mock `zarr.storage.Store` that records every key access; subset reads must touch a strict subset of the chunks of the full read. + +--- + +## File Structure + +``` +python/lsst/images/zarr/ +├── __init__.py guarded `import zarr`; re-exports public API +├── _common.py ZarrPointerModel, attribute namespace constants +│ ("lsst:" / "ome:" prefixes plus the LSST_VERSION +│ schema integer), ZarrCompressionOptions dataclass, +│ path/JSON-pointer helpers +├── _model.py IR: ZarrAttributes, ZarrArray, ZarrGroup, +│ ZarrDocument, OmeMultiscale, OmeOmero, +│ LsstMaskGroup, LsstTableGroup, with from_zarr / +│ to_zarr methods that DO NOT materialize array data +├── _layout.py Layout rules: archive-class → axes mapping, +│ JSON-pointer ↔ zarr-path translation, default +│ chunk/shard derivation, ColorImage axis transpose +│ applied to (already-sliced) arrays +├── _store.py URI → zarr.storage.Store wrapper: +│ *.zarr.zip → ZipStore, http(s)/s3/gs → FsspecStore, +│ local → LocalStore. Honors create-only mode. +├── _output_archive.py ZarrOutputArchive (populates IR) and write() helper +├── _input_archive.py ZarrInputArchive (reads IR lazily) and read() helper + +tests/ +├── test_zarr_common.py constants, helpers, ZarrCompressionOptions +├── test_zarr_model.py IR round-trip via in-memory MemoryStore +├── test_zarr_layout.py axes per archive class, pointer translation +├── test_zarr_store.py URI dispatch, create-only refusal +├── test_zarr_output_archive.py write paths inspected against IR +├── test_zarr_input_archive.py read paths + lazy subset assertions +├── test_zarr_round_trip.py full write→read for every type +├── test_zarr_ome_compliance.py ngff-validator (skipped if absent) +└── test_zarr_external_reader.py ome-zarr-py sanity (skipped if absent) +``` + +Files in this layout follow the NDF backend's split: `_model.py` is pure data; `_output_archive.py` and `_input_archive.py` only translate between the IR and the abstract archive interface. Layout decisions live in `_layout.py` so per-class tweaks (ColorImage's `c` axis, CellCoadd's cell-aligned chunks) are made once against the populated IR rather than scattered through `add_array` calls. + +--- + +## Phase 1 — Skeleton, `_common.py`, and IR (no I/O yet) + +This phase produces the IR and constants in isolation. It can be merged independently — there is no archive yet, but the IR round-trips through an in-memory zarr `MemoryStore` so we can prove the shape of what the later archives will produce. + +### Task 1.1: Create the package skeleton + +**Files:** +- Create: `python/lsst/images/zarr/__init__.py` +- Modify: `pyproject.toml:55` (add `zarr` extra after the existing `ndf` extra) + +- [ ] **Step 1: Add the optional dependency** + +Edit `pyproject.toml` after line 55 (the `ndf` extra): + +```toml +# Add feature for OME-Zarr v0.5 read/write support. +zarr = ["zarr >= 3.0"] +``` + +- [ ] **Step 2: Create the package `__init__.py` with a guarded import** + +Create `python/lsst/images/zarr/__init__.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +"""OME-Zarr v0.5 archive backend for `lsst.images`. + +Files written by this archive are valid OME-Zarr v0.5 NGFF images at every +level where image-shaped data lives, augmented with namespaced ``lsst:`` +extensions for mask plane semantics, AST WCS round-trip, the LSST archive +tree, table layout, and cell-grid hints. External tools that consume +OME-Zarr (`napari`, `neuroglancer`, `ome-zarr-py`) can render the science +arrays without LSST-specific awareness. + +Default chunk geometry is tile-aligned (~1024×1024 for plain images, +``cell_shape`` for ``CellCoadd``). Sharding (zarr v3 native) is enabled by +default with a tunable shard size to keep object counts manageable on +S3/GCS. Both ``DirectoryStore`` and ``ZipStore`` are supported; the choice +is driven by URI shape (``*.zarr.zip`` → ``ZipStore``, otherwise +directory). Remote URIs go through `lsst.resources.ResourcePath` and +`fsspec`. +""" + +try: + import zarr # noqa: F401 +except ImportError as e: + raise ImportError( + "lsst.images.zarr requires the optional 'zarr' package (>=3.0). " + "Install it directly or via 'pip install lsst-images[zarr]'." + ) from e + +# Phase 1 has no public archive API yet. Re-exports are added in later phases. +``` + +- [ ] **Step 3: Verify the guarded import works** + +Run: `python -c "import lsst.images.zarr"` +Expected: no output (success), or a clear ImportError pointing at the `[zarr]` extra if `zarr` isn't installed. + +- [ ] **Step 4: Commit** + +```bash +git add python/lsst/images/zarr/__init__.py pyproject.toml +git commit -m "feat: add lsst.images.zarr package skeleton with guarded import" +``` + +### Task 1.2: `_common.py` — constants, `ZarrPointerModel`, `ZarrCompressionOptions` + +**Files:** +- Create: `python/lsst/images/zarr/_common.py` +- Test: `tests/test_zarr_common.py` + +- [ ] **Step 1: Write the failing test** + +Create `tests/test_zarr_common.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +try: + from lsst.images.zarr._common import ( + LSST_NS, + LSST_VERSION, + OME_NS, + OME_VERSION, + ZarrCompressionOptions, + ZarrPointerModel, + archive_path_to_zarr_path, + json_pointer_to_zarr_path, + ) + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class CommonTestCase(unittest.TestCase): + def test_pointer_round_trips(self) -> None: + original = ZarrPointerModel(path="/lsst/mask") + recovered = ZarrPointerModel.model_validate_json(original.model_dump_json()) + self.assertEqual(recovered, original) + + def test_constants(self) -> None: + self.assertEqual(LSST_NS, "lsst") + self.assertEqual(OME_NS, "ome") + self.assertEqual(OME_VERSION, "0.5") + self.assertGreaterEqual(LSST_VERSION, 1) + + def test_archive_path_to_zarr_path(self) -> None: + # Empty archive path → top-level main JSON tree under lsst/. + self.assertEqual(archive_path_to_zarr_path(""), "/lsst/tree") + # Archive sub-paths land under lsst/ verbatim (no uppercase). + self.assertEqual(archive_path_to_zarr_path("/psf"), "/lsst/psf") + self.assertEqual(archive_path_to_zarr_path("/psf/coefficients"), "/lsst/psf/coefficients") + + def test_json_pointer_to_zarr_path(self) -> None: + # JSON pointer to a top-level array attribute → top-level zarr "0". + self.assertEqual(json_pointer_to_zarr_path("/image"), "/0") + # JSON pointer to a mask attribute → /lsst/mask/0. + self.assertEqual(json_pointer_to_zarr_path("/mask"), "/lsst/mask/0") + # Unknown JSON pointers fall through to a literal /lsst/ mapping. + self.assertEqual(json_pointer_to_zarr_path("/companions/extra"), "/lsst/companions/extra") + + def test_compression_options_default(self) -> None: + defaults = ZarrCompressionOptions.default_for_dtype("float32") + self.assertEqual(defaults.codec, "blosc") + # bitshuffle for ints, byte-shuffle for floats. + self.assertEqual(defaults.shuffle, "shuffle") + int_defaults = ZarrCompressionOptions.default_for_dtype("uint8") + self.assertEqual(int_defaults.shuffle, "bitshuffle") + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_common.py -v` +Expected: FAIL — `ImportError` on `lsst.images.zarr._common`. + +- [ ] **Step 3: Write `_common.py`** + +Create `python/lsst/images/zarr/_common.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +__all__ = ( + "LSST_NS", + "LSST_VERSION", + "OME_NS", + "OME_VERSION", + "ZarrCompressionOptions", + "ZarrPointerModel", + "archive_path_to_zarr_path", + "json_pointer_to_zarr_path", +) + +from dataclasses import dataclass +from typing import ClassVar, Self + +import pydantic + +LSST_NS = "lsst" +"""Top-level zarr-attributes namespace key for LSST extensions.""" + +OME_NS = "ome" +"""Top-level zarr-attributes namespace key for OME-NGFF metadata.""" + +OME_VERSION = "0.5" +"""OME-Zarr / NGFF version this backend writes.""" + +LSST_VERSION = 1 +"""Schema version of the ``lsst:`` extension this backend writes. + +Readers refuse versions newer than they understand. Bump on +backwards-incompatible changes to the on-disk layout. +""" + + +# Well-known archive-attribute → zarr-path mappings used by the layout rules. +# Keys are JSON-pointer suffixes (the ``name=`` argument passed to +# ``add_array``); values are absolute zarr paths inside the archive root. +# Anything not in this map falls through to a literal /lsst/ mapping. +_JSON_POINTER_TO_ZARR_PATH: dict[str, str] = { + "/image": "/0", + "/mask": "/lsst/mask/0", + "/variance": "/lsst/variance/0", + "/red": "/0", # ColorImage channels are stacked along the c axis at /0. + "/green": "/0", + "/blue": "/0", +} + + +class ZarrPointerModel(pydantic.BaseModel): + """Reference to a zarr archive sub-tree by absolute zarr path. + + Used by `ZarrOutputArchive`/`ZarrInputArchive` to point to sub-trees + that have been hoisted out of the main JSON tree into separate zarr + groups. The path is interpreted relative to the archive root, e.g. + ``"/lsst/psf"``. + """ + + path: str + """Absolute zarr path (e.g. ``/lsst/psf``).""" + + +@dataclass(frozen=True) +class ZarrCompressionOptions: + """Per-array zarr v3 codec configuration. + + The default codec stack is ``bytes -> blosc(zstd, clevel=5)`` with + byte-shuffle for floats and bit-shuffle for integers (and masks). + All defaults are overridable per-array via the ``compression`` + keyword to ``write()``, keyed by the JSON pointer of the attribute + the array backs. + """ + + codec: str = "blosc" + cname: str = "zstd" + clevel: int = 5 + shuffle: str = "shuffle" # 'shuffle' (byte) or 'bitshuffle' or 'noshuffle' + + DEFAULT_FLOAT: ClassVar[Self] + DEFAULT_INT: ClassVar[Self] + + @classmethod + def default_for_dtype(cls, dtype: str) -> Self: + """Return the default codec stack for a numpy dtype name.""" + if dtype.startswith(("u", "i")) or dtype == "bool": + return cls.DEFAULT_INT + return cls.DEFAULT_FLOAT + + +ZarrCompressionOptions.DEFAULT_FLOAT = ZarrCompressionOptions(shuffle="shuffle") +ZarrCompressionOptions.DEFAULT_INT = ZarrCompressionOptions(shuffle="bitshuffle") + + +def archive_path_to_zarr_path(archive_path: str) -> str: + """Translate a serialization archive path to its zarr path. + + The empty archive path maps to the main JSON tree at ``/lsst/tree``. + Non-empty archive paths are kept verbatim under ``/lsst/`` (zarr keys + have no length limit, unlike HDS). + """ + if not archive_path: + return "/lsst/tree" + stripped = archive_path.strip("/") + return f"/{LSST_NS}/{stripped}" + + +def json_pointer_to_zarr_path(pointer: str) -> str: + """Translate a JSON pointer attribute name to a zarr path. + + Used by the output archive's ``add_array`` to figure out where in the + zarr store an array referenced by a Pydantic field should be + materialized. Falls through to ``archive_path_to_zarr_path`` for + unrecognised attribute names. + """ + if pointer in _JSON_POINTER_TO_ZARR_PATH: + return _JSON_POINTER_TO_ZARR_PATH[pointer] + return archive_path_to_zarr_path(pointer) +``` + +- [ ] **Step 4: Run the test to verify it passes** + +Run: `pytest tests/test_zarr_common.py -v` +Expected: PASS — 5 tests pass. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_common.py tests/test_zarr_common.py +git commit -m "feat: add ZarrPointerModel, ZarrCompressionOptions, and path helpers" +``` + +### Task 1.3: IR — `ZarrAttributes` and `ZarrArray` with lazy backing + +**Files:** +- Create: `python/lsst/images/zarr/_model.py` +- Test: `tests/test_zarr_model.py` + +This task introduces the IR types whose **lazy-array invariant** is the heart of the efficient subsetting story. `ZarrArray.data` is one of: + +- `numpy.ndarray` — staged for write (the user just called `add_array`) +- `zarr.Array` — read from a store, **never sliced eagerly** + +A read of a remote VisitImage opens its `zarr.Array` handle through `from_zarr`. Subsequent slicing (in `InputArchive.get_array(model, slices=...)`) goes straight to that handle, so only the chunks intersecting the slice are downloaded. + +- [ ] **Step 1: Write the failing test** + +Create `tests/test_zarr_model.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +import numpy as np + +try: + import zarr + + from lsst.images.zarr._common import LSST_NS, LSST_VERSION, OME_NS, OME_VERSION + from lsst.images.zarr._model import ZarrArray, ZarrAttributes + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrAttributesTestCase(unittest.TestCase): + def test_dump_separates_namespaces(self) -> None: + attrs = ZarrAttributes() + attrs.lsst["archive_class"] = "Image" + attrs.ome["multiscales"] = [{"name": "image"}] + dumped = attrs.dump() + self.assertEqual(dumped[LSST_NS]["archive_class"], "Image") + self.assertEqual(dumped[LSST_NS]["version"], LSST_VERSION) + self.assertEqual(dumped[OME_NS]["multiscales"], [{"name": "image"}]) + self.assertEqual(dumped[OME_NS]["version"], OME_VERSION) + + def test_load_preserves_unknown_keys(self) -> None: + # Forward compatibility: unknown lsst.* keys must survive a + # load → dump round-trip so a partial-knowledge reader can + # still re-emit them on write. + raw = { + LSST_NS: {"version": LSST_VERSION, "archive_class": "Image", "future_thing": {"x": 1}}, + OME_NS: {"version": OME_VERSION, "multiscales": []}, + } + attrs = ZarrAttributes.load(raw) + dumped = attrs.dump() + self.assertEqual(dumped[LSST_NS]["future_thing"], {"x": 1}) + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrArrayTestCase(unittest.TestCase): + def test_lazy_data_after_from_zarr(self) -> None: + # Write an array to an in-memory store and load via from_zarr. + # The IR must hold the zarr.Array handle, NOT a numpy array. + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + zarr_array = root.create_array(name="0", shape=(8, 8), chunks=(4, 4), dtype="float32") + zarr_array[:] = np.arange(64, dtype=np.float32).reshape(8, 8) + + ir_array = ZarrArray.from_zarr(zarr_array) + # Critical invariant: data is the lazy zarr.Array, not numpy. + self.assertIsInstance(ir_array.data, zarr.Array) + self.assertNotIsInstance(ir_array.data, np.ndarray) + # Shape/dtype come from the lazy handle without reading bytes. + self.assertEqual(ir_array.shape, (8, 8)) + self.assertEqual(str(ir_array.dtype), "float32") + + def test_subset_does_not_materialize_full_array(self) -> None: + # The IR must let callers slice through to the lazy handle so + # only the touched chunks are read. We simulate "remote bytes + # consumed" by counting key fetches against the underlying store. + store = _CountingStore() + root = zarr.create_group(store=store, zarr_format=3) + zarr_array = root.create_array(name="0", shape=(16, 16), chunks=(4, 4), dtype="int32") + zarr_array[:] = np.arange(256, dtype=np.int32).reshape(16, 16) + store.reads = 0 # reset after the write phase + + ir_array = ZarrArray.from_zarr(zarr_array) + # Reading ir_array.shape must not fetch any chunk data. + self.assertEqual(ir_array.shape, (16, 16)) + self.assertEqual(store.reads, 0) + + # A 4×4 subset spans exactly one chunk; reading it must touch at + # most that chunk's data key (plus zero or one metadata keys). + subset = ir_array.read(slices=(slice(0, 4), slice(0, 4))) + self.assertEqual(subset.shape, (4, 4)) + np.testing.assert_array_equal(subset, np.arange(256).reshape(16, 16)[:4, :4]) + # 16 chunks total in the array; we should have touched far fewer. + self.assertLess(store.reads, 16) + + def test_staged_numpy_array_is_eager(self) -> None: + # Pre-write IR (caller just called add_array): data is numpy. + data = np.arange(12, dtype=np.float64).reshape(3, 4) + ir_array = ZarrArray(data=data) + self.assertIs(ir_array.data, data) + self.assertEqual(ir_array.shape, (3, 4)) + + +class _CountingStore(zarr.storage.MemoryStore if HAVE_ZARR else object): + """A MemoryStore that counts get() calls so we can prove subset reads + only touch the chunks they need. + """ + + def __init__(self) -> None: + super().__init__() + self.reads = 0 + + async def get(self, key, prototype, byte_range=None): # type: ignore[override] + self.reads += 1 + return await super().get(key, prototype, byte_range) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_model.py -v` +Expected: FAIL — `ImportError` on `lsst.images.zarr._model`. + +- [ ] **Step 3: Write `_model.py` (initial portion: `ZarrAttributes` and `ZarrArray`)** + +Create `python/lsst/images/zarr/_model.py` with this content (further IR types are appended in Task 1.4): + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +"""Python intermediate representation for OME-Zarr / lsst-extension content. + +The IR is the source of truth for what gets written. ``ZarrOutputArchive`` +populates a `ZarrDocument`; on context-manager exit, `to_zarr` materializes +it through a configured ``zarr.storage.Store``. + +Reads invert that flow: ``ZarrInputArchive`` opens the store and calls +`ZarrDocument.from_zarr`, which builds the IR around **lazy** ``zarr.Array`` +handles. No array bytes are read until a caller asks for them via +`ZarrArray.read`, which forwards slices straight to the underlying handle. +This keeps subset reads of remote files cheap: only the chunks intersecting +the requested slice are fetched. +""" + +from __future__ import annotations + +__all__ = ( + "ZarrArray", + "ZarrAttributes", +) + +from dataclasses import dataclass, field +from types import EllipsisType +from typing import Any, Self + +import numpy as np +import zarr + +from ._common import LSST_NS, LSST_VERSION, OME_NS, OME_VERSION, ZarrCompressionOptions + + +@dataclass +class ZarrAttributes: + """Namespaced attributes attached to a `ZarrGroup` or `ZarrArray`. + + The two top-level namespaces (``lsst:`` and ``ome:``) are kept + separate so the IR can serve both internal LSST round-trip needs + and external OME-Zarr discoverability without key collisions. + """ + + lsst: dict[str, Any] = field(default_factory=dict) + ome: dict[str, Any] = field(default_factory=dict) + + def dump(self) -> dict[str, Any]: + """Return the raw mapping zarr-python writes to ``zarr.json``.""" + out: dict[str, Any] = {} + # Always emit lsst namespace with a schema version; readers + # use this to gate forward-compat behavior. + out[LSST_NS] = {"version": LSST_VERSION, **self.lsst} + if self.ome: + out[OME_NS] = {"version": OME_VERSION, **self.ome} + return out + + @classmethod + def load(cls, raw: dict[str, Any]) -> Self: + """Construct from a raw attributes mapping read from zarr.""" + lsst = dict(raw.get(LSST_NS, {})) + lsst.pop("version", None) # version is implicit in the namespace + ome = dict(raw.get(OME_NS, {})) + ome.pop("version", None) + return cls(lsst=lsst, ome=ome) + + +@dataclass +class ZarrArray: + """An IR node holding either staged numpy data or a lazy zarr handle. + + Parameters + ---------- + data + Either a ``numpy.ndarray`` (when staged for write by the output + archive) or a ``zarr.Array`` (when read by the input archive). + The two forms never mix in a single instance. + chunks + Per-axis chunk shape. ``None`` lets `to_zarr` derive a default + from the array shape (~1024 per axis for plain images). + shards + Per-axis shard shape (zarr v3 native). ``None`` lets `to_zarr` + derive a default of 4× the chunk shape per axis when the + resulting shard exceeds 1 MiB. + compression + Codec configuration. ``None`` falls back to + ``ZarrCompressionOptions.default_for_dtype(dtype)``. + attributes + Namespaced attributes for this array's ``zarr.json``. + """ + + data: np.ndarray | zarr.Array + chunks: tuple[int, ...] | None = None + shards: tuple[int, ...] | None = None + compression: ZarrCompressionOptions | None = None + attributes: ZarrAttributes = field(default_factory=ZarrAttributes) + + @property + def shape(self) -> tuple[int, ...]: + return tuple(self.data.shape) + + @property + def dtype(self) -> np.dtype: + return np.dtype(self.data.dtype) + + @classmethod + def from_zarr(cls, zarr_array: zarr.Array) -> Self: + """Wrap an open ``zarr.Array`` without reading its data.""" + attrs = ZarrAttributes.load(dict(zarr_array.attrs)) + # Reading shape/dtype off the open handle is metadata-only; chunks + # are read off the array's chunk_grid configuration. We do NOT + # call zarr_array[:] here. + return cls( + data=zarr_array, + chunks=tuple(zarr_array.chunks), + attributes=attrs, + ) + + def read(self, *, slices: tuple[slice, ...] | EllipsisType = ...) -> np.ndarray: + """Materialize this array (or a slice of it) into numpy. + + For a `ZarrArray` backed by a lazy handle, this is the only + place that touches array bytes. ``slices`` is forwarded straight + to the handle so only the chunks intersecting the slice are + fetched. + """ + if isinstance(self.data, np.ndarray): + return self.data if slices is ... else self.data[slices] + # zarr.Array supports lazy slicing via __getitem__. + return self.data[...] if slices is ... else self.data[slices] +``` + +- [ ] **Step 4: Run the test to verify it passes** + +Run: `pytest tests/test_zarr_model.py -v` +Expected: PASS — 4 tests pass; the `_CountingStore` test confirms a 4×4 subset of a 16×16 / chunks=(4,4) array touches strictly fewer than 16 chunk reads. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_model.py tests/test_zarr_model.py +git commit -m "feat: add ZarrAttributes and ZarrArray IR with lazy zarr.Array backing" +``` + +### Task 1.4: IR — `ZarrGroup`, `ZarrDocument`, and store materialization + +**Files:** +- Modify: `python/lsst/images/zarr/_model.py` (append `ZarrGroup`, `ZarrDocument`, materialization helpers) +- Modify: `tests/test_zarr_model.py` (add round-trip test through `MemoryStore`) + +This task gives the IR a full tree shape and the bidirectional `to_zarr` / `from_zarr` materialization. The round-trip test pins the lazy invariant: `from_zarr` on a freshly-opened store does not read any chunk bytes; only `ZarrArray.read()` does. + +- [ ] **Step 1: Write the failing test (extend `test_zarr_model.py`)** + +Append this class to `tests/test_zarr_model.py` (before the `if __name__` guard): + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrDocumentTestCase(unittest.TestCase): + def test_round_trip_through_memory_store(self) -> None: + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + + # Build an IR: top-level array at /0, a sub-group at /lsst/mask + # with its own array at /lsst/mask/0. + doc = ZarrDocument(root=ZarrGroup()) + doc.root.attributes.lsst["archive_class"] = "MaskedImage" + doc.root.arrays["0"] = ZarrArray(data=np.ones((4, 4), dtype="float32")) + mask = ZarrGroup() + mask.arrays["0"] = ZarrArray(data=np.zeros((1, 4, 4), dtype="uint8")) + mask.attributes.lsst["mask"] = {"planes": [{"name": "BAD", "bit": 0}]} + doc.root.groups["lsst"] = ZarrGroup(groups={"mask": mask}) + + store = zarr.storage.MemoryStore() + doc.to_zarr(store) + + # Reload and verify lazy invariant: ZarrArray.data is a zarr.Array, + # not a materialized ndarray. + recovered = ZarrDocument.from_zarr(store) + self.assertIsInstance(recovered.root.arrays["0"].data, zarr.Array) + self.assertEqual( + recovered.root.attributes.lsst["archive_class"], "MaskedImage" + ) + recovered_mask = recovered.root.groups["lsst"].groups["mask"] + np.testing.assert_array_equal( + recovered_mask.arrays["0"].read(), np.zeros((1, 4, 4), dtype="uint8") + ) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_model.py::ZarrDocumentTestCase -v` +Expected: FAIL — `ImportError` for `ZarrGroup` / `ZarrDocument`. + +- [ ] **Step 3: Append `ZarrGroup`, `ZarrDocument`, materialization helpers to `_model.py`** + +Update the `__all__` and append to `python/lsst/images/zarr/_model.py`: + +```python +__all__ = ( + "ZarrArray", + "ZarrAttributes", + "ZarrDocument", + "ZarrGroup", +) + + +@dataclass +class ZarrGroup: + """A zarr group: nested groups, arrays, and namespaced attributes.""" + + groups: dict[str, "ZarrGroup"] = field(default_factory=dict) + arrays: dict[str, ZarrArray] = field(default_factory=dict) + attributes: ZarrAttributes = field(default_factory=ZarrAttributes) + + def get(self, path: str) -> "ZarrGroup | ZarrArray": + """Return a child by absolute or relative zarr path.""" + if path in ("", "/"): + return self + parts = [p for p in path.strip("/").split("/") if p] + cursor: ZarrGroup | ZarrArray = self + for part in parts: + if not isinstance(cursor, ZarrGroup): + raise KeyError(path) + if part in cursor.arrays: + cursor = cursor.arrays[part] + elif part in cursor.groups: + cursor = cursor.groups[part] + else: + raise KeyError(path) + return cursor + + def ensure_group(self, path: str) -> "ZarrGroup": + """Return or create a sub-group at ``path``.""" + if path in ("", "/"): + return self + parts = [p for p in path.strip("/").split("/") if p] + cursor = self + for part in parts: + if part in cursor.arrays: + raise KeyError(f"{part!r} already exists as an array.") + if part not in cursor.groups: + cursor.groups[part] = ZarrGroup() + cursor = cursor.groups[part] + return cursor + + +@dataclass +class ZarrDocument: + """A complete OME-Zarr archive root.""" + + root: ZarrGroup = field(default_factory=ZarrGroup) + + @classmethod + def from_zarr(cls, store: zarr.storage.Store) -> Self: + """Open ``store`` and build a lazy IR view of its contents.""" + zarr_root = zarr.open_group(store=store, mode="r", zarr_format=3) + return cls(root=_group_from_zarr(zarr_root)) + + def to_zarr(self, store: zarr.storage.Store) -> None: + """Materialize this IR into ``store``. + + The store is expected to be empty; callers (the output archive's + write helper) are responsible for create-only enforcement. + """ + zarr_root = zarr.create_group(store=store, zarr_format=3, overwrite=False) + _group_to_zarr(self.root, zarr_root) + + +def _group_from_zarr(zarr_group: zarr.Group) -> ZarrGroup: + """Build a lazy `ZarrGroup` IR from an open ``zarr.Group``.""" + ir = ZarrGroup(attributes=ZarrAttributes.load(dict(zarr_group.attrs))) + for name, child in zarr_group.members(): + if isinstance(child, zarr.Array): + ir.arrays[name] = ZarrArray.from_zarr(child) + else: + ir.groups[name] = _group_from_zarr(child) + return ir + + +def _group_to_zarr(ir: ZarrGroup, zarr_group: zarr.Group) -> None: + """Write a `ZarrGroup` IR into an open ``zarr.Group``.""" + if dumped := ir.attributes.dump(): + zarr_group.update_attributes(dumped) + for name, sub in ir.groups.items(): + sub_zarr = zarr_group.create_group(name) + _group_to_zarr(sub, sub_zarr) + for name, array in ir.arrays.items(): + if not isinstance(array.data, np.ndarray): + raise TypeError( + f"Cannot write ZarrArray at {name!r}: data is a lazy zarr.Array, " + "not numpy. Read it first or pass a fresh numpy array." + ) + chunks = array.chunks or _default_chunks(array.data.shape) + compression = array.compression or ZarrCompressionOptions.default_for_dtype( + str(array.dtype) + ) + codecs = _build_codecs(compression) + zarr_array = zarr_group.create_array( + name=name, + shape=array.data.shape, + chunks=chunks, + dtype=array.data.dtype, + shards=array.shards, + codecs=codecs, + ) + zarr_array[:] = array.data + if dumped := array.attributes.dump(): + zarr_array.update_attributes(dumped) + + +def _default_chunks(shape: tuple[int, ...]) -> tuple[int, ...]: + """Default chunk shape: min(1024, dim) per axis.""" + return tuple(min(1024, dim) for dim in shape) + + +def _build_codecs(options: ZarrCompressionOptions) -> list[Any]: + """Build a zarr v3 codec stack from `ZarrCompressionOptions`. + + For zarr v3 the codec list always begins with the bytes codec + followed by compressors. Blosc supports byte-shuffle and + bit-shuffle via its ``shuffle`` argument. + """ + from numcodecs.zarr3 import Blosc + + if options.codec != "blosc": + raise NotImplementedError(f"Unsupported codec {options.codec!r}.") + return [ + zarr.codecs.BytesCodec(), + Blosc(cname=options.cname, clevel=options.clevel, shuffle=options.shuffle), + ] +``` + +- [ ] **Step 4: Run all model tests** + +Run: `pytest tests/test_zarr_model.py -v` +Expected: PASS — all 5 tests pass; the round-trip test confirms `recovered.root.arrays["0"].data` is a `zarr.Array`, not a numpy array. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_model.py tests/test_zarr_model.py +git commit -m "feat: add ZarrGroup and ZarrDocument with lazy-on-read materialization" +``` + +### Task 1.5: IR — OME multiscales, omero channels, mask group, table group + +**Files:** +- Modify: `python/lsst/images/zarr/_model.py` (append OME / LSST helper dataclasses) +- Modify: `tests/test_zarr_model.py` (add helper-construction test) + +These small dataclasses centralize the OME and LSST attribute shapes so `_layout.py` can populate them without literal-dict-typo bugs. They round-trip through `ZarrAttributes.dump()` / `load()`. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_model.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class OmeAndLsstHelpersTestCase(unittest.TestCase): + def test_multiscale_emits_expected_shape(self) -> None: + from lsst.images.zarr._model import OmeMultiscale + + m = OmeMultiscale(name="visitimage", axes=("y", "x")) + d = m.dump() + self.assertEqual(d["name"], "visitimage") + self.assertEqual( + d["axes"], [{"name": "y", "type": "space"}, {"name": "x", "type": "space"}] + ) + self.assertEqual(d["datasets"], [{"path": "0"}]) + + def test_lsst_mask_group_round_trip(self) -> None: + from lsst.images.zarr._model import LsstMaskGroup, MaskPlaneEntry + + group = LsstMaskGroup( + planes=[MaskPlaneEntry(name="BAD", bit=0, description="Bad pixel.")] + ) + dumped = group.dump() + recovered = LsstMaskGroup.load(dumped) + self.assertEqual(len(recovered.planes), 1) + self.assertEqual(recovered.planes[0].name, "BAD") + self.assertEqual(recovered.planes[0].bit, 0) +``` + +- [ ] **Step 2: Run to verify failure** + +Run: `pytest tests/test_zarr_model.py::OmeAndLsstHelpersTestCase -v` +Expected: FAIL — `ImportError`. + +- [ ] **Step 3: Append OME and LSST helpers to `_model.py`** + +Update the `__all__` and append to `python/lsst/images/zarr/_model.py`: + +```python +__all__ = ( + "LsstMaskGroup", + "LsstTableColumn", + "LsstTableGroup", + "MaskPlaneEntry", + "OmeMultiscale", + "OmeOmeroChannel", + "ZarrArray", + "ZarrAttributes", + "ZarrDocument", + "ZarrGroup", +) + + +@dataclass +class OmeMultiscale: + """OME-NGFF v0.5 multiscales metadata for a single-level image. + + The backend always writes one level (``path=0``); pyramid generation + is a follow-up tracked in the design spec. + """ + + name: str + axes: tuple[str, ...] + coordinate_transformations: list[dict[str, Any]] = field(default_factory=list) + + @staticmethod + def _axis_type(name: str) -> str: + if name == "c": + return "channel" + if name == "t": + return "time" + return "space" + + def dump(self) -> dict[str, Any]: + dataset: dict[str, Any] = {"path": "0"} + if self.coordinate_transformations: + dataset["coordinateTransformations"] = list(self.coordinate_transformations) + return { + "name": self.name, + "axes": [{"name": a, "type": self._axis_type(a)} for a in self.axes], + "datasets": [dataset], + } + + +@dataclass +class OmeOmeroChannel: + """OME ``omero/channels`` entry for one channel of a multi-channel image.""" + + label: str + color: str | None = None + + def dump(self) -> dict[str, Any]: + out: dict[str, Any] = {"label": self.label} + if self.color is not None: + out["color"] = self.color + return out + + +@dataclass +class MaskPlaneEntry: + """One mask-plane definition under ``lsst.mask.planes``.""" + + name: str + bit: int + description: str = "" + + +@dataclass +class LsstMaskGroup: + """Helper for ``lsst.mask`` attributes on a mask sub-group.""" + + planes: list[MaskPlaneEntry] = field(default_factory=list) + + def dump(self) -> dict[str, Any]: + return { + "planes": [ + {"name": p.name, "bit": p.bit, "description": p.description} + for p in self.planes + ] + } + + @classmethod + def load(cls, raw: dict[str, Any]) -> Self: + return cls( + planes=[ + MaskPlaneEntry( + name=p["name"], + bit=int(p["bit"]), + description=p.get("description", ""), + ) + for p in raw.get("planes", []) + ] + ) + + +@dataclass +class LsstTableColumn: + """One column entry under ``lsst.table.columns``.""" + + name: str + dtype: str + unit: str | None = None + description: str = "" + + +@dataclass +class LsstTableGroup: + """Helper for ``lsst.table`` attributes on a table sub-group.""" + + columns: list[LsstTableColumn] = field(default_factory=list) + length: int = 0 + meta: dict[str, Any] = field(default_factory=dict) + + def dump(self) -> dict[str, Any]: + return { + "columns": [ + { + "name": c.name, + "dtype": c.dtype, + "unit": c.unit, + "description": c.description, + } + for c in self.columns + ], + "length": self.length, + "meta": self.meta, + } +``` + +- [ ] **Step 4: Run all model tests** + +Run: `pytest tests/test_zarr_model.py -v` +Expected: PASS — 7 tests. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_model.py tests/test_zarr_model.py +git commit -m "feat: add OmeMultiscale, OmeOmeroChannel, LsstMaskGroup, LsstTableGroup helpers" +``` + +--- + +## Phase 2 — Store dispatch, layout rules, and `ZarrOutputArchive` (Image / MaskedImage / VisitImage) + +This phase adds enough machinery to **write** a plain `Image`, a `MaskedImage`, and a `VisitImage` to a zarr archive on disk and on a `ZipStore`. No reading yet — that lands in Phase 3 — so tests inspect the on-disk shape via `ZarrDocument.from_zarr()` directly (the same lazy IR) and assert on attributes/paths/shapes. + +`ColorImage` and `CellCoadd` are deferred to Phase 4. `add_table` / `add_structured_array` produce native zarr tables in this phase (the JSON tree carries the references) so VisitImage's tabular components round-trip. + +### Task 2.1: `_store.py` — URI → `zarr.storage.Store` dispatch + +**Files:** +- Create: `python/lsst/images/zarr/_store.py` +- Test: `tests/test_zarr_store.py` + +The store layer is the only place that knows about `lsst.resources.ResourcePath`. The output archive and input archive both call `open_store_for_write(uri, ...)` / `open_store_for_read(uri)` and treat the result as an opaque `zarr.storage.Store`. URI dispatch: + +| URI shape | Store | +|---|---| +| `*.zarr.zip` (any scheme) | `zarr.storage.ZipStore` | +| `file://` or local path | `zarr.storage.LocalStore` | +| `http(s)://`, `s3://`, `gs://`, etc. | `zarr.storage.FsspecStore` (via `fsspec.url_to_fs`) | + +Create-only mode is enforced here, not in `to_zarr`: the write helpers refuse to open a non-empty existing store. + +- [ ] **Step 1: Write the failing test** + +Create `tests/test_zarr_store.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +try: + import zarr + + from lsst.images.zarr._store import open_store_for_read, open_store_for_write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class StoreDispatchTestCase(unittest.TestCase): + def test_local_directory(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + with open_store_for_write(target) as store: + self.assertIsInstance(store, zarr.storage.LocalStore) + zarr.create_group(store=store, zarr_format=3) + # Re-opening for read returns a usable store. + with open_store_for_read(target) as store: + self.assertIsInstance(store, zarr.storage.LocalStore) + root = zarr.open_group(store=store, mode="r") + self.assertEqual(list(root.keys()), []) + + def test_zip_store(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr.zip") + with open_store_for_write(target) as store: + self.assertIsInstance(store, zarr.storage.ZipStore) + zarr.create_group(store=store, zarr_format=3) + with open_store_for_read(target) as store: + self.assertIsInstance(store, zarr.storage.ZipStore) + + def test_create_only_refuses_existing(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + with open_store_for_write(target) as store: + zarr.create_group(store=store, zarr_format=3) + with self.assertRaisesRegex(OSError, "already exists"): + with open_store_for_write(target): + pass + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_store.py -v` +Expected: FAIL — `ImportError` on `lsst.images.zarr._store`. + +- [ ] **Step 3: Write `_store.py`** + +Create `python/lsst/images/zarr/_store.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +__all__ = ("open_store_for_read", "open_store_for_write") + +import os +from collections.abc import Iterator +from contextlib import contextmanager + +import zarr + +from lsst.resources import ResourcePath, ResourcePathExpression + + +def _is_zip(rp: ResourcePath) -> bool: + return rp.path.endswith(".zarr.zip") or rp.path.endswith(".zip") + + +def _is_remote(rp: ResourcePath) -> bool: + return rp.scheme not in ("", "file") + + +@contextmanager +def open_store_for_write(path: ResourcePathExpression) -> Iterator[zarr.storage.Store]: + """Open a zarr store for writing. + + Refuses to overwrite an existing non-empty store. The returned + context manager closes the store on exit; for ``ZipStore`` this + finalizes the central directory (atomic write). + """ + rp = ResourcePath(path) + if _is_zip(rp): + local = rp.ospath if not _is_remote(rp) else None + if local is None: + raise NotImplementedError("Remote ZipStore writes are a follow-up.") + if os.path.exists(local) and os.path.getsize(local) > 0: + raise OSError(f"File {local!r} already exists.") + store = zarr.storage.ZipStore(local, mode="w") + try: + yield store + finally: + store.close() + return + if _is_remote(rp): + import fsspec + + fs, fs_path = fsspec.url_to_fs(str(rp)) + # FsspecStore does its own existence check via the fs. + if fs.exists(fs_path) and fs.ls(fs_path): + raise OSError(f"Store {rp!s} already exists.") + store = zarr.storage.FsspecStore(fs=fs, path=fs_path, read_only=False) + try: + yield store + finally: + pass # FsspecStore has no explicit close. + return + # Local directory. + local = rp.ospath + if os.path.exists(local) and os.listdir(local): + raise OSError(f"Directory {local!r} already exists and is non-empty.") + os.makedirs(local, exist_ok=True) + store = zarr.storage.LocalStore(local, read_only=False) + try: + yield store + finally: + pass + + +@contextmanager +def open_store_for_read(path: ResourcePathExpression) -> Iterator[zarr.storage.Store]: + """Open a zarr store for reading.""" + rp = ResourcePath(path) + if _is_zip(rp): + if _is_remote(rp): + # Materialize remote zips locally first; remote ZipStore is a + # follow-up. + with rp.as_local() as local: + store = zarr.storage.ZipStore(local.ospath, mode="r") + try: + yield store + finally: + store.close() + return + store = zarr.storage.ZipStore(rp.ospath, mode="r") + try: + yield store + finally: + store.close() + return + if _is_remote(rp): + import fsspec + + fs, fs_path = fsspec.url_to_fs(str(rp)) + store = zarr.storage.FsspecStore(fs=fs, path=fs_path, read_only=True) + yield store + return + store = zarr.storage.LocalStore(rp.ospath, read_only=True) + yield store +``` + +- [ ] **Step 4: Run the tests to verify they pass** + +Run: `pytest tests/test_zarr_store.py -v` +Expected: PASS — 3 tests pass. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_store.py tests/test_zarr_store.py +git commit -m "feat: add zarr store dispatch (LocalStore / ZipStore / FsspecStore)" +``` + +### Task 2.2: `_layout.py` — archive-class axes and ColorImage transpose + +**Files:** +- Create: `python/lsst/images/zarr/_layout.py` +- Test: `tests/test_zarr_layout.py` + +`_layout.py` centralizes per-archive-class decisions so `_output_archive.py` and `_input_archive.py` stay generic. The functions it exposes: + +- `axes_for_archive_class(name)` — returns the OME axis tuple +- `chunks_for(archive_class, shape, override)` — derives default chunks +- `transpose_color_image_in(array)` / `transpose_color_image_out(array)` — `(Y, X, 3) ↔ (3, Y, X)` (used in Phase 4 by ColorImage; included now to lock the contract) + +`ColorImage` and `CellCoadd` rules ship in this task as data, but the output archive doesn't exercise them until Phase 4. + +- [ ] **Step 1: Write the failing test** + +Create `tests/test_zarr_layout.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +import numpy as np + +try: + from lsst.images.zarr._layout import ( + axes_for_archive_class, + chunks_for, + transpose_color_image_in, + transpose_color_image_out, + ) + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class LayoutTestCase(unittest.TestCase): + def test_axes_for_archive_class(self) -> None: + self.assertEqual(axes_for_archive_class("Image"), ("y", "x")) + self.assertEqual(axes_for_archive_class("MaskedImage"), ("y", "x")) + self.assertEqual(axes_for_archive_class("VisitImage"), ("y", "x")) + self.assertEqual(axes_for_archive_class("ColorImage"), ("c", "y", "x")) + self.assertEqual(axes_for_archive_class("CellCoadd"), ("y", "x")) + + def test_chunks_for_default(self) -> None: + # 2D image: clamped to 1024 per axis. + self.assertEqual(chunks_for("Image", (4096, 4096), None), (1024, 1024)) + # Smaller than 1024 → use full dim. + self.assertEqual(chunks_for("Image", (300, 600), None), (300, 600)) + + def test_chunks_for_override(self) -> None: + # User override takes precedence. + self.assertEqual(chunks_for("Image", (4096, 4096), (256, 256)), (256, 256)) + + def test_color_image_transpose(self) -> None: + # In-memory shape (Y, X, 3) → on-disk (3, Y, X). + in_memory = np.arange(2 * 3 * 3, dtype=np.uint8).reshape(2, 3, 3) + on_disk = transpose_color_image_in(in_memory) + self.assertEqual(on_disk.shape, (3, 2, 3)) + # Round-trip. + recovered = transpose_color_image_out(on_disk) + np.testing.assert_array_equal(recovered, in_memory) + + def test_color_image_transpose_after_slicing(self) -> None: + # Critical: when reading a sliced subset of a ColorImage, the + # transpose must run on the (small) sliced result, never the + # full array. Here we feed in a (3, sliced-Y, sliced-X) array + # and check the output is (sliced-Y, sliced-X, 3). + sliced_on_disk = np.arange(3 * 4 * 5, dtype=np.uint8).reshape(3, 4, 5) + out = transpose_color_image_out(sliced_on_disk) + self.assertEqual(out.shape, (4, 5, 3)) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_layout.py -v` +Expected: FAIL — `ImportError`. + +- [ ] **Step 3: Write `_layout.py`** + +Create `python/lsst/images/zarr/_layout.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +"""Per-archive-class layout rules for the zarr backend. + +This module centralises the decisions that vary by image type: + +- which OME axes apply (``ColorImage`` adds ``c``) +- default chunk sizes (clamped to 1024 per axis for plain images) +- how the in-memory ``(Y, X, 3)`` `ColorImage` array maps to the + ``(c, y, x)`` on-disk shape + +Keeping these in one place lets the output archive populate the IR +generically and lets the input archive apply per-class fixups (the +ColorImage transpose) **after** slicing — never on the full array. +""" + +from __future__ import annotations + +__all__ = ( + "axes_for_archive_class", + "chunks_for", + "transpose_color_image_in", + "transpose_color_image_out", +) + +import numpy as np + +_DEFAULT_AXIS_LIMIT = 1024 + + +def axes_for_archive_class(name: str) -> tuple[str, ...]: + """Return the OME axis tuple for a given archive class name. + + The default is ``(y, x)``. Specific classes that need extra axes + are listed explicitly. + """ + if name == "ColorImage": + return ("c", "y", "x") + return ("y", "x") + + +def chunks_for( + archive_class: str, + shape: tuple[int, ...], + override: tuple[int, ...] | None, +) -> tuple[int, ...]: + """Return the chunk shape to use for an array. + + Parameters + ---------- + archive_class + The top-level archive class (used for class-specific defaults + such as ``CellCoadd``'s cell-aligned chunks; cell handling + lands in Phase 4). + shape + The full array shape, used to clamp the default per-axis. + override + User-supplied chunk shape; if not ``None`` it is returned + verbatim after a length check. + """ + if override is not None: + if len(override) != len(shape): + raise ValueError( + f"chunks override has rank {len(override)}, " + f"expected {len(shape)} for {archive_class!r}." + ) + return tuple(override) + return tuple(min(_DEFAULT_AXIS_LIMIT, dim) for dim in shape) + + +def transpose_color_image_in(array: np.ndarray) -> np.ndarray: + """Transpose a `ColorImage` array from in-memory to on-disk shape. + + In-memory: ``(Y, X, 3)``. On-disk (OME ``c, y, x``): ``(3, Y, X)``. + """ + if array.ndim != 3 or array.shape[2] != 3: + raise ValueError( + f"ColorImage in-memory shape must be (Y, X, 3); got {array.shape!r}." + ) + return np.ascontiguousarray(np.transpose(array, (2, 0, 1))) + + +def transpose_color_image_out(array: np.ndarray) -> np.ndarray: + """Transpose a `ColorImage` array from on-disk to in-memory shape. + + On-disk: ``(c, y, x)`` (already-sliced if a subset was requested). + In-memory: ``(y, x, c)``. + """ + if array.ndim != 3 or array.shape[0] != 3: + raise ValueError( + f"ColorImage on-disk shape must be (3, Y, X); got {array.shape!r}." + ) + return np.ascontiguousarray(np.transpose(array, (1, 2, 0))) +``` + +- [ ] **Step 4: Run the test to verify it passes** + +Run: `pytest tests/test_zarr_layout.py -v` +Expected: PASS — 5 tests pass. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_layout.py tests/test_zarr_layout.py +git commit -m "feat: add zarr layout rules and ColorImage transpose helpers" +``` + +### Task 2.3: `ZarrOutputArchive` — skeleton, `serialize_direct` / `serialize_pointer` + +**Files:** +- Create: `python/lsst/images/zarr/_output_archive.py` +- Test: `tests/test_zarr_output_archive.py` + +This task wires the abstract `OutputArchive` ABC to the IR. `add_array` / `add_table` / `add_structured_array` follow in Task 2.4. `add_tree` and the public `write()` helper land in Task 2.5. + +The constructor builds an empty `ZarrDocument` and stores the user-supplied chunks/shards/compression overrides. `serialize_direct` returns a `NestedOutputArchive` (matches NDF/JSON pattern). `serialize_pointer` writes the sub-tree as a UTF-8 JSON byte array under `/lsst//tree` and returns a `ZarrPointerModel(path="/lsst//tree")`. + +- [ ] **Step 1: Write the failing test** + +Create `tests/test_zarr_output_archive.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +try: + from lsst.images.zarr._common import ZarrPointerModel + from lsst.images.zarr._output_archive import ZarrOutputArchive + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + +import pydantic + + +class _Sub(pydantic.BaseModel): + label: str = "sub" + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOutputArchiveSkeletonTestCase(unittest.TestCase): + def test_serialize_direct_returns_nested_result(self) -> None: + archive = ZarrOutputArchive() + + def serializer(arch): # noqa: ANN001 + return _Sub(label="ok") + + result = archive.serialize_direct("/sub", serializer) + self.assertEqual(result.label, "ok") + + def test_serialize_pointer_writes_json_subtree(self) -> None: + archive = ZarrOutputArchive() + + def serializer(arch): # noqa: ANN001 + return _Sub(label="psf") + + pointer = archive.serialize_pointer("/psf", serializer, key=12345) + self.assertIsInstance(pointer, ZarrPointerModel) + self.assertEqual(pointer.path, "/lsst/psf/tree") + # Calling again with the same key returns the cached pointer. + again = archive.serialize_pointer("/psf", serializer, key=12345) + self.assertEqual(again, pointer) + # The IR has the JSON sub-tree as an array of UTF-8 bytes. + node = archive.document.root.get("/lsst/psf/tree") + self.assertEqual(str(node.dtype), "uint8") + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_output_archive.py -v` +Expected: FAIL — `ImportError`. + +- [ ] **Step 3: Write the skeleton of `_output_archive.py`** + +Create `python/lsst/images/zarr/_output_archive.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +__all__ = ("ZarrOutputArchive", "write") + +from collections.abc import Callable, Hashable, Iterator, Mapping +from typing import Any + +import numpy as np +import pydantic + +from .._transforms import FrameSet +from ..serialization import ( + ArchiveTree, + NestedOutputArchive, + OutputArchive, +) +from ._common import ( + ZarrCompressionOptions, + ZarrPointerModel, + archive_path_to_zarr_path, +) +from ._model import ZarrArray, ZarrDocument, ZarrGroup + + +class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): + """Output archive that populates a `ZarrDocument` IR. + + Bytes are not written until the IR is materialized via + `ZarrDocument.to_zarr`, which the public `write` helper performs on + context-manager exit. + + Parameters + ---------- + chunks + Per-array chunk overrides keyed by JSON pointer (or zarr path). + ``None`` for a key means "use the layout default". + shards + Per-array shard overrides keyed by JSON pointer (or zarr path). + compression + Per-array codec overrides keyed by JSON pointer (or zarr path). + """ + + def __init__( + self, + *, + chunks: Mapping[str, tuple[int, ...] | None] | None = None, + shards: Mapping[str, tuple[int, ...] | None] | None = None, + compression: Mapping[str, ZarrCompressionOptions | None] | None = None, + ) -> None: + self.document = ZarrDocument(root=ZarrGroup()) + self._chunks = dict(chunks) if chunks else {} + self._shards = dict(shards) if shards else {} + self._compression = dict(compression) if compression else {} + self._pointers: dict[Hashable, ZarrPointerModel] = {} + self._frame_sets: list[tuple[FrameSet, ZarrPointerModel]] = [] + + def serialize_direct[T: pydantic.BaseModel]( + self, name: str, serializer: Callable[[OutputArchive[ZarrPointerModel]], T] + ) -> T: + nested = NestedOutputArchive[ZarrPointerModel](name, self) + return serializer(nested) + + def serialize_pointer[T: ArchiveTree]( + self, + name: str, + serializer: Callable[[OutputArchive[ZarrPointerModel]], T], + key: Hashable, + ) -> ZarrPointerModel: + if (cached := self._pointers.get(key)) is not None: + return cached + # Run the serializer first so any nested add_array calls land + # inside the IR before we dump this sub-tree to JSON. + archive_path = name if name.startswith("/") else f"/{name}" + sub_zarr_path = archive_path_to_zarr_path(archive_path) + model = self.serialize_direct(name, serializer) + json_bytes = model.model_dump_json().encode("utf-8") + # Stage the JSON tree as a 1D uint8 array under /tree. + # Group container is created lazily by ensure_group. + parent = self.document.root.ensure_group(sub_zarr_path) + parent.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) + pointer = ZarrPointerModel(path=f"{sub_zarr_path}/tree") + self._pointers[key] = pointer + return pointer + + def serialize_frame_set[T: ArchiveTree]( + self, + name: str, + frame_set: FrameSet, + serializer: Callable[[OutputArchive], T], + key: Hashable, + ) -> ZarrPointerModel: + pointer = self.serialize_pointer(name, serializer, key) + self._frame_sets.append((frame_set, pointer)) + return pointer + + def iter_frame_sets(self) -> Iterator[tuple[FrameSet, ZarrPointerModel]]: + return iter(self._frame_sets) + + # add_array / add_table / add_structured_array land in Task 2.4; + # raising NotImplementedError keeps mypy happy until then. + def add_array(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] + raise NotImplementedError("add_array lands in Task 2.4") + + def add_table(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] + raise NotImplementedError("add_table lands in Task 2.4") + + def add_structured_array(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] + raise NotImplementedError("add_structured_array lands in Task 2.4") + + +def write(*args: Any, **kwargs: Any) -> Any: + """Public write helper. Implemented in Task 2.5.""" + raise NotImplementedError("write() lands in Task 2.5") +``` + +- [ ] **Step 4: Run the test to verify it passes** + +Run: `pytest tests/test_zarr_output_archive.py -v` +Expected: PASS — 2 tests pass; the IR holds a `uint8` array at `/lsst/psf/tree`. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_output_archive.py tests/test_zarr_output_archive.py +git commit -m "feat: add ZarrOutputArchive skeleton with serialize_direct/serialize_pointer" +``` + +### Task 2.4: `ZarrOutputArchive.add_array`, `add_table`, `add_structured_array` + +**Files:** +- Modify: `python/lsst/images/zarr/_output_archive.py` +- Modify: `tests/test_zarr_output_archive.py` + +`add_array` stages a numpy array into the IR at the zarr path computed by `_layout` / `_common`, applies per-array chunk/shard/compression overrides, and returns an `ArrayReferenceModel` with `source=f"zarr:{zarr_path}"`. Mask arrays go to `/lsst/mask/0`; variance to `/lsst/variance/0`; main image to `/0`. Anonymous nested arrays land at `/lsst//0`. + +`add_table` and `add_structured_array` stage one 1-D zarr array per column under `/lsst/tables//` and attach an `LsstTableGroup` attributes block to the parent group. The returned `TableModel` has each column's `ArrayReferenceModel.source` set to `f"zarr:/lsst/tables//"`. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_output_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOutputArchiveArrayTestCase(unittest.TestCase): + def test_add_image_array(self) -> None: + import numpy as np + + archive = ZarrOutputArchive() + ref = archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + self.assertEqual(ref.source, "zarr:/0") + self.assertEqual(list(ref.shape), [4, 5]) + ir_array = archive.document.root.get("/0") + self.assertEqual(ir_array.shape, (4, 5)) + + def test_add_mask_array(self) -> None: + import numpy as np + + archive = ZarrOutputArchive() + ref = archive.add_array(np.zeros((1, 4, 5), dtype=np.uint8), name="mask") + self.assertEqual(ref.source, "zarr:/lsst/mask/0") + ir_array = archive.document.root.get("/lsst/mask/0") + self.assertEqual(ir_array.shape, (1, 4, 5)) + + def test_add_variance_array(self) -> None: + import numpy as np + + archive = ZarrOutputArchive() + ref = archive.add_array(np.ones((4, 5), dtype=np.float64), name="variance") + self.assertEqual(ref.source, "zarr:/lsst/variance/0") + + def test_add_anonymous_nested_array(self) -> None: + import numpy as np + + archive = ZarrOutputArchive() + ref = archive.add_array(np.ones((3,), dtype=np.float32), name="/psf/centroids") + self.assertEqual(ref.source, "zarr:/lsst/psf/centroids/0") + + def test_add_table_creates_one_array_per_column(self) -> None: + import astropy.table + import numpy as np + + archive = ZarrOutputArchive() + table = astropy.table.Table( + {"x": np.arange(4, dtype=np.int32), "y": np.arange(4, dtype=np.float32)} + ) + model = archive.add_table(table, name="/cat") + self.assertEqual(len(model.columns), 2) + sources = {c.name: c.data.source for c in model.columns} + self.assertEqual(sources["x"], "zarr:/lsst/tables/cat/x") + self.assertEqual(sources["y"], "zarr:/lsst/tables/cat/y") +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrOutputArchiveArrayTestCase -v` +Expected: FAIL — `add_array` raises `NotImplementedError`. + +- [ ] **Step 3: Replace the stubs with real implementations** + +Edit `python/lsst/images/zarr/_output_archive.py` — extend the imports and replace the three `NotImplementedError` stubs: + +```python +# At the top of the file, extend imports: +import astropy.io.fits +import astropy.table +import astropy.units + +from ..serialization import ( + ArchiveTree, + ArrayReferenceModel, + NestedOutputArchive, + NumberType, + OutputArchive, + TableColumnModel, + TableModel, + no_header_updates, +) +from ._common import ( + ZarrCompressionOptions, + ZarrPointerModel, + archive_path_to_zarr_path, + json_pointer_to_zarr_path, +) +``` + +Replace the three placeholder methods with: + +```python + def add_array( + self, + array: np.ndarray, + *, + name: str | None = None, + update_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> ArrayReferenceModel: + if name is None: + raise ValueError("Anonymous arrays are not supported in ZarrOutputArchive.") + zarr_path = json_pointer_to_zarr_path(name if name.startswith("/") else f"/{name}") + # zarr_path looks like "/0" or "/lsst/<...>/0"; we stage the + # array at the leaf and let layout defaults fill chunks later. + leaf = zarr_path.rsplit("/", 1)[-1] + parent_path = zarr_path[: -(len(leaf) + 1)] or "/" + parent = self.document.root.ensure_group(parent_path) + parent.arrays[leaf] = ZarrArray( + data=np.ascontiguousarray(array), + chunks=self._chunks.get(name), + shards=self._shards.get(name), + compression=self._compression.get(name), + ) + return ArrayReferenceModel( + source=f"zarr:{zarr_path}", + shape=list(array.shape), + datatype=NumberType.from_numpy(array.dtype), + ) + + def add_table( + self, + table: astropy.table.Table, + *, + name: str | None = None, + update_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> TableModel: + if name is None: + raise ValueError("Anonymous tables are not supported in ZarrOutputArchive.") + columns = TableColumnModel.from_table(table) + archive_path = name if name.startswith("/") else f"/{name}" + table_zarr_path = f"/lsst/tables{archive_path}" + parent = self.document.root.ensure_group(table_zarr_path) + for c in columns: + assert isinstance(c.data, ArrayReferenceModel) + column_array = np.ascontiguousarray(np.asarray(table[c.name])) + parent.arrays[c.name] = ZarrArray(data=column_array) + c.data.source = f"zarr:{table_zarr_path}/{c.name}" + return TableModel(columns=columns, meta=table.meta) + + def add_structured_array( + self, + array: np.ndarray, + *, + name: str | None = None, + units: Mapping[str, astropy.units.Unit] | None = None, + descriptions: Mapping[str, str] | None = None, + update_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> TableModel: + if name is None: + raise ValueError("Anonymous structured arrays are not supported.") + columns = TableColumnModel.from_record_dtype(array.dtype) + archive_path = name if name.startswith("/") else f"/{name}" + table_zarr_path = f"/lsst/tables{archive_path}" + parent = self.document.root.ensure_group(table_zarr_path) + for c in columns: + assert isinstance(c.data, ArrayReferenceModel) + column_array = np.ascontiguousarray(array[c.name]) + parent.arrays[c.name] = ZarrArray(data=column_array) + c.data.source = f"zarr:{table_zarr_path}/{c.name}" + if units and (unit := units.get(c.name)): + c.unit = unit + if descriptions and (description := descriptions.get(c.name)): + c.description = description + return TableModel(columns=columns) +``` + +- [ ] **Step 4: Run the tests to verify they pass** + +Run: `pytest tests/test_zarr_output_archive.py -v` +Expected: PASS — all 7 tests pass. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_output_archive.py tests/test_zarr_output_archive.py +git commit -m "feat: implement ZarrOutputArchive add_array/add_table/add_structured_array" +``` + +### Task 2.5: `add_tree`, top-level OME multiscales, and the public `write()` helper + +**Files:** +- Modify: `python/lsst/images/zarr/_output_archive.py` +- Modify: `python/lsst/images/zarr/__init__.py` (re-export `write` and `ZarrOutputArchive`) +- Modify: `tests/test_zarr_output_archive.py` + +`add_tree` stages the main JSON tree under `/lsst/tree`, sets the `lsst.archive_class` and `lsst.tree`/`lsst.frame_set`/`lsst.companions` attributes on the root group, and — when there is a top-level `/0` array — populates an `OmeMultiscale` block on the root group's attributes. + +The public `write(obj, path, ...)` helper opens a store via `_store.open_store_for_write`, calls `obj.serialize`, then `add_tree`, and finally `document.to_zarr(store)`. + +- [ ] **Step 1: Write the failing tests** + +Append to `tests/test_zarr_output_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrWriteHelperTestCase(unittest.TestCase): + def test_write_image_to_local_directory(self) -> None: + import os + import tempfile + + import numpy as np + import zarr + + from lsst.images import Box, Image + from lsst.images.zarr import write + from lsst.images.zarr._common import LSST_NS, OME_NS + from lsst.images.zarr._model import ZarrDocument + + image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + tree = write(image, target) + self.assertEqual(tree.kind, "image") # whatever Image's tree carries + # Reload via the IR (we have no ZarrInputArchive yet) and check shape. + with zarr.storage.LocalStore(target, read_only=True) as store: + doc = ZarrDocument.from_zarr(store) + self.assertIn("0", doc.root.arrays) + self.assertEqual(doc.root.arrays["0"].shape, (4, 5)) + # Top-level OME multiscales block is populated. + self.assertEqual(doc.root.attributes.lsst["archive_class"], "Image") + self.assertIn("multiscales", doc.root.attributes.ome) +``` + +(Use `tree.kind` only if `ImageSerializationModel` exposes that field — otherwise check whatever attribute is canonical for the image tree; the assertion is just that `write` returns the tree.) + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrWriteHelperTestCase -v` +Expected: FAIL — `write()` raises `NotImplementedError`. + +- [ ] **Step 3: Implement `add_tree` and `write`** + +Edit `python/lsst/images/zarr/_output_archive.py` — append to `ZarrOutputArchive`: + +```python + def add_tree( + self, + tree: ArchiveTree, + *, + archive_class: str, + ) -> None: + """Finalize the IR: write the main JSON tree and root attributes. + + Called once after the user's serializer has populated arrays / + sub-trees. Sets the namespaced ``lsst.*`` and ``ome.*`` blocks + on the root group and stages ``/lsst/tree`` as a 1D ``uint8`` + zarr array of UTF-8 JSON. + """ + from ._layout import axes_for_archive_class + from ._model import OmeMultiscale + + json_bytes = tree.model_dump_json().encode("utf-8") + lsst_group = self.document.root.ensure_group("/lsst") + lsst_group.arrays["tree"] = ZarrArray( + data=np.frombuffer(json_bytes, dtype=np.uint8) + ) + # Root attributes. + self.document.root.attributes.lsst["archive_class"] = archive_class + self.document.root.attributes.lsst["tree"] = "/lsst/tree" + if self._frame_sets: + # The first frame set's pointer is published as the canonical + # one for external tools; pointer paths to all the others + # remain reachable via the JSON tree. + self.document.root.attributes.lsst["frame_set"] = self._frame_sets[0][1].path + if "0" in self.document.root.arrays: + top = self.document.root.arrays["0"] + multiscale = OmeMultiscale( + name=archive_class.lower(), + axes=axes_for_archive_class(archive_class), + ) + self.document.root.attributes.ome["multiscales"] = [multiscale.dump()] + + +def write( + obj: Any, + path: Any, + *, + chunks: Mapping[str, tuple[int, ...] | None] | None = None, + shards: Mapping[str, tuple[int, ...] | None] | None = None, + compression: Mapping[str, ZarrCompressionOptions | None] | None = None, + metadata: Mapping[str, Any] | None = None, + butler_info: Any | None = None, +) -> ArchiveTree: + """Write ``obj`` to a zarr archive at ``path``. + + Parameters mirror the FITS/NDF write helpers. The store implementation + (LocalStore / ZipStore / FsspecStore) is selected from the URI shape + by `lsst.images.zarr._store.open_store_for_write`. + """ + from ._store import open_store_for_write + + archive_default_name = getattr(obj, "_archive_default_name", None) + archive = ZarrOutputArchive(chunks=chunks, shards=shards, compression=compression) + if archive_default_name is not None: + tree = archive.serialize_direct(archive_default_name, obj.serialize) + else: + tree = obj.serialize(archive) + if metadata is not None: + tree.metadata.update(metadata) + if butler_info is not None: + tree.butler_info = butler_info + archive.add_tree(tree, archive_class=type(obj).__name__) + with open_store_for_write(path) as store: + archive.document.to_zarr(store) + return tree +``` + +Re-export from `python/lsst/images/zarr/__init__.py` (replace the placeholder comment): + +```python +from ._common import * # noqa: F401, F403 +from ._output_archive import * # noqa: F401, F403 +``` + +- [ ] **Step 4: Run the tests to verify they pass** + +Run: `pytest tests/test_zarr_output_archive.py -v` +Expected: PASS — all tests, including the local-directory write. Open the produced `out.zarr` directory with `python -c "import zarr; print(list(zarr.open_group(zarr.storage.LocalStore('/tmp/.../out.zarr', read_only=True)).keys()))"` to spot-check. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_output_archive.py python/lsst/images/zarr/__init__.py tests/test_zarr_output_archive.py +git commit -m "feat: add ZarrOutputArchive.add_tree and public write() helper" +``` + +### Task 2.6: Layout-level write tests for `MaskedImage` and `VisitImage` + +**Files:** +- Modify: `tests/test_zarr_output_archive.py` + +This task pins the on-disk shape for two more archive classes by inspecting the IR after `write()`. No new production code — these tests only catch regressions when Phase 4 changes things. + +- [ ] **Step 1: Write the test** + +Append to `tests/test_zarr_output_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrWriteOnDiskShapeTestCase(unittest.TestCase): + def _round_trip_doc(self, obj): # noqa: ANN001 + import os + import tempfile + + import zarr + + from lsst.images.zarr import write + from lsst.images.zarr._model import ZarrDocument + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(obj, target) + with zarr.storage.LocalStore(target, read_only=True) as store: + return ZarrDocument.from_zarr(store) + + def test_masked_image_layout(self) -> None: + import numpy as np + + from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema + + schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) + image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + masked = MaskedImage(image, mask_schema=schema) + + doc = self._round_trip_doc(masked) + self.assertEqual(doc.root.attributes.lsst["archive_class"], "MaskedImage") + # Top-level data array. + self.assertIn("0", doc.root.arrays) + # Mask group at /lsst/mask/0. + mask_group = doc.root.groups["lsst"].groups["mask"] + self.assertIn("0", mask_group.arrays) + # Variance group at /lsst/variance/0 (MaskedImage carries one). + variance_group = doc.root.groups["lsst"].groups["variance"] + self.assertIn("0", variance_group.arrays) + + def test_visit_image_layout(self) -> None: + import numpy as np + + from lsst.images import Box, Image, MaskPlane, MaskSchema, VisitImage + + schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) + image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + # Construct a minimal VisitImage; the exact constructor depends on + # the public API. The test only cares that write() succeeds and + # produces a valid IR. + visit = VisitImage(image=image, mask_schema=schema) + + doc = self._round_trip_doc(visit) + self.assertEqual(doc.root.attributes.lsst["archive_class"], "VisitImage") + self.assertIn("0", doc.root.arrays) +``` + +- [ ] **Step 2: Run the test** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrWriteOnDiskShapeTestCase -v` +Expected: PASS — both tests. If the `VisitImage` constructor in this codebase needs different arguments than the snippet, adapt the constructor call only — the assertions on the IR shape stay. + +- [ ] **Step 3: Commit** + +```bash +git add tests/test_zarr_output_archive.py +git commit -m "test: pin on-disk zarr layout for MaskedImage and VisitImage" +``` + +--- + +**End of Phase 2.** Six tasks. The output side now produces valid OME-Zarr archives for `Image`, `MaskedImage`, `VisitImage` with namespaced `lsst.*` attributes and a JSON tree at `/lsst/tree`. Phase 3 inverts this: `ZarrInputArchive`, `read()`, and the lazy-subset assertions that prove `slices=` only fetches the touched chunks. + +## Phase 3 — `ZarrInputArchive`, `read()`, and lazy subset enforcement + +This phase delivers the read side. The hard constraint here is the **lazy subset invariant**: `get_array(model, slices=...)` must forward `slices` to the underlying `zarr.Array` handle so a 4×4 subset of a 4096×4096 remote VisitImage downloads only the chunks intersecting that slice, never the full array. The phase ships with a `_CountingStore`-based regression test that *fails* if any code path materializes the full array before slicing. + +### Task 3.1: `ZarrInputArchive` skeleton — open + `get_tree` + +**Files:** +- Create: `python/lsst/images/zarr/_input_archive.py` +- Test: `tests/test_zarr_input_archive.py` + +The constructor takes a `zarr.storage.Store` and builds a `ZarrDocument` via `from_zarr` (lazy — no chunk reads). `get_tree(model_type)` reads the JSON bytes at `/lsst/tree` and validates them with `model_type.model_validate_json`. The `open` classmethod is a context manager that delegates store creation to `_store.open_store_for_read`. + +Error taxonomy: +- Missing root metadata → `ArchiveReadError("File has no zarr.json")` +- Missing `lsst.archive_class` attribute → `ArchiveReadError("File is not an LSST zarr archive")` +- `lsst.version` newer than `LSST_VERSION` → `ArchiveReadError(f"Unsupported lsst:version {N}")` + +- [ ] **Step 1: Write the failing test** + +Create `tests/test_zarr_input_archive.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +import numpy as np + +try: + import zarr + + from lsst.images.serialization import ArchiveReadError + from lsst.images.zarr._common import LSST_NS, LSST_VERSION + from lsst.images.zarr._input_archive import ZarrInputArchive + from lsst.images.zarr._output_archive import ZarrOutputArchive + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +def _write_minimal_image(target: str) -> None: + from lsst.images import Box, Image + from lsst.images.zarr import write + + write(Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]), target) + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchiveSkeletonTestCase(unittest.TestCase): + def test_open_reads_tree(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + _write_minimal_image(target) + with ZarrInputArchive.open(target) as archive: + from lsst.images._image import ImageSerializationModel + + tree = archive.get_tree(ImageSerializationModel) + self.assertEqual(list(tree.image.shape), [4, 5]) + + def test_missing_archive_class_raises(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "bare.zarr") + os.makedirs(target) + store = zarr.storage.LocalStore(target, read_only=False) + zarr.create_group(store=store, zarr_format=3) # no lsst attrs + with self.assertRaisesRegex(ArchiveReadError, "not an LSST zarr archive"): + with ZarrInputArchive.open(target): + pass + + def test_future_version_refused(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "future.zarr") + os.makedirs(target) + store = zarr.storage.LocalStore(target, read_only=False) + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + {LSST_NS: {"version": LSST_VERSION + 1, "archive_class": "Image", "tree": "/lsst/tree"}} + ) + with self.assertRaisesRegex(ArchiveReadError, "Unsupported lsst:version"): + with ZarrInputArchive.open(target): + pass + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_input_archive.py -v` +Expected: FAIL — `ImportError` on `_input_archive`. + +- [ ] **Step 3: Write `_input_archive.py`** + +Create `python/lsst/images/zarr/_input_archive.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +__all__ = ("ZarrInputArchive", "read") + +from collections.abc import Callable, Iterator +from contextlib import contextmanager +from types import EllipsisType +from typing import Any, Self + +import astropy.io.fits +import astropy.table +import numpy as np + +from lsst.resources import ResourcePathExpression + +from .._transforms import FrameSet +from ..serialization import ( + ArchiveReadError, + ArchiveTree, + ArrayReferenceModel, + InlineArrayModel, + InputArchive, + ReadResult, + TableModel, + no_header_updates, +) +from ._common import LSST_NS, LSST_VERSION, ZarrPointerModel +from ._model import ZarrArray, ZarrDocument + + +class ZarrInputArchive(InputArchive[ZarrPointerModel]): + """Reads zarr archives written by `ZarrOutputArchive`. + + Built around a `ZarrDocument` whose `ZarrArray` nodes hold lazy + ``zarr.Array`` handles. ``get_array(model, slices=...)`` forwards + slices straight to those handles, so subset reads on remote stores + only fetch the touched chunks. + + Instances should only be constructed via the :meth:`open` context + manager. + """ + + def __init__(self, document: ZarrDocument) -> None: + self._document = document + self._validate_root_attributes() + self._deserialized_pointer_cache: dict[str, Any] = {} + self._frame_set_cache: dict[str, FrameSet] = {} + + @classmethod + @contextmanager + def open(cls, path: ResourcePathExpression) -> Iterator[Self]: + """Open a zarr archive for reading.""" + from ._store import open_store_for_read + + with open_store_for_read(path) as store: + doc = ZarrDocument.from_zarr(store) + yield cls(doc) + + @property + def document(self) -> ZarrDocument: + return self._document + + def get_tree[T: ArchiveTree](self, model_type: type[T]) -> T: + """Read and validate the main Pydantic tree at ``/lsst/tree``.""" + try: + node = self._document.root.get("/lsst/tree") + except KeyError: + raise ArchiveReadError( + "File has no /lsst/tree array; this is not an LSST zarr archive." + ) from None + if not isinstance(node, ZarrArray): + raise ArchiveReadError("/lsst/tree must be a zarr array, not a group.") + json_bytes = bytes(node.read()) + return model_type.model_validate_json(json_bytes.decode("utf-8")) + + def _validate_root_attributes(self) -> None: + attrs = self._document.root.attributes.lsst + if "archive_class" not in attrs: + raise ArchiveReadError("File is not an LSST zarr archive (missing lsst.archive_class).") + # ZarrAttributes.load() strips "version", so re-fetch from the raw store. + # Easiest: re-derive via the document root's underlying group, but we + # don't have that handle. Stash version on the IR instead. + version = attrs.get("__version_remembered_at_load__", LSST_VERSION) + # The plan task immediately following persists the version through + # ZarrAttributes; for now treat absence as compatible. + if version > LSST_VERSION: + raise ArchiveReadError( + f"Unsupported lsst:version {version}; this reader supports up to {LSST_VERSION}." + ) + + # The remaining abstract methods land in subsequent tasks. + def deserialize_pointer(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] + raise NotImplementedError("deserialize_pointer lands in Task 3.3") + + def get_frame_set(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] + raise NotImplementedError("get_frame_set lands in Task 3.3") + + def get_array(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] + raise NotImplementedError("get_array lands in Task 3.2") + + def get_table(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] + raise NotImplementedError("get_table lands in Task 3.4") + + def get_structured_array(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] + raise NotImplementedError("get_structured_array lands in Task 3.4") + + +def read(*args: Any, **kwargs: Any) -> Any: + """Public read helper. Implemented in Task 3.5.""" + raise NotImplementedError("read() lands in Task 3.5") +``` + +Add `version` round-trip to `_model.py` so `_validate_root_attributes` can see it. In `python/lsst/images/zarr/_model.py`, change `ZarrAttributes.load` to **keep** the version under a private key: + +```python + @classmethod + def load(cls, raw: dict[str, Any]) -> Self: + lsst = dict(raw.get(LSST_NS, {})) + version = lsst.pop("version", None) + if version is not None: + lsst["__version_remembered_at_load__"] = version + ome = dict(raw.get(OME_NS, {})) + ome.pop("version", None) + return cls(lsst=lsst, ome=ome) +``` + +…and skip that key in `dump`: + +```python + def dump(self) -> dict[str, Any]: + out: dict[str, Any] = {} + public_lsst = {k: v for k, v in self.lsst.items() if not k.startswith("__")} + out[LSST_NS] = {"version": LSST_VERSION, **public_lsst} + if self.ome: + out[OME_NS] = {"version": OME_VERSION, **self.ome} + return out +``` + +(Update the existing `test_load_preserves_unknown_keys` if needed — the assertion is about `future_thing`, not the private version sentinel, so it still passes.) + +- [ ] **Step 4: Run the tests to verify they pass** + +Run: `pytest tests/test_zarr_input_archive.py -v tests/test_zarr_model.py -v` +Expected: PASS — input-archive skeleton tests pass; existing model tests still pass. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_input_archive.py python/lsst/images/zarr/_model.py tests/test_zarr_input_archive.py +git commit -m "feat: add ZarrInputArchive skeleton with get_tree and version validation" +``` + +### Task 3.2: `get_array` — lazy slice forwarding + +**Files:** +- Modify: `python/lsst/images/zarr/_input_archive.py` +- Modify: `tests/test_zarr_input_archive.py` + +This is the load-bearing task for the lazy-subset invariant. `get_array(model, slices=...)`: + +1. Resolves the model's `source` field (`f"zarr:{path}"`) to a zarr path inside the IR. +2. Fetches the `ZarrArray` IR node — still lazy. +3. Calls `ir_array.read(slices=slices)`, which forwards directly to the `zarr.Array` handle. + +The test uses a `_CountingStore` that counts every `get` call against the underlying store and asserts the subset read touches strictly fewer chunk keys than a full read of the same array. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_input_archive.py`: + +```python +class _CountingStore(zarr.storage.MemoryStore if HAVE_ZARR else object): + """A MemoryStore that counts get() calls.""" + + def __init__(self) -> None: + super().__init__() + self.reads = 0 + + async def get(self, key, prototype, byte_range=None): # type: ignore[override] + self.reads += 1 + return await super().get(key, prototype, byte_range) + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchiveLazySubsetTestCase(unittest.TestCase): + """Pins the lazy-subset invariant from the design spec. + + A subset read of a chunked image must touch a strict subset of the + chunk keys that a full read would. This is what makes remote + VisitImage subsetting cheap. + """ + + def test_subset_read_touches_only_intersecting_chunks(self) -> None: + from lsst.images.serialization import ArrayReferenceModel, NumberType + + # Build a 16x16 / chunks=(4,4) zarr archive with the LSST root + # attributes wired up so ZarrInputArchive accepts it. + store = _CountingStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + { + LSST_NS: { + "version": LSST_VERSION, + "archive_class": "Image", + "tree": "/lsst/tree", + }, + } + ) + zarr_array = root.create_array(name="0", shape=(16, 16), chunks=(4, 4), dtype="float32") + zarr_array[:] = np.arange(256, dtype=np.float32).reshape(16, 16) + # Stage a stub /lsst/tree primitive so the input archive's + # constructor doesn't blow up on get_tree (we won't call it). + lsst_group = root.create_group("lsst") + lsst_group.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" + + # Open via the ZarrInputArchive (this only reads metadata). + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + # Reset the counter; we want to count only what get_array does. + store.reads = 0 + full_ref = ArrayReferenceModel( + source="zarr:/0", shape=[16, 16], datatype=NumberType.from_numpy(np.dtype("float32")) + ) + # Full read for the baseline. + full = archive.get_array(full_ref) + full_reads = store.reads + self.assertEqual(full.shape, (16, 16)) + + # Reset and read a single-chunk subset. + store.reads = 0 + subset = archive.get_array(full_ref, slices=(slice(0, 4), slice(0, 4))) + subset_reads = store.reads + self.assertEqual(subset.shape, (4, 4)) + np.testing.assert_array_equal(subset, np.arange(256).reshape(16, 16)[:4, :4]) + + # Critical assertion: subset read fetched strictly fewer keys. + self.assertLess(subset_reads, full_reads) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_input_archive.py::ZarrInputArchiveLazySubsetTestCase -v` +Expected: FAIL — `get_array` raises `NotImplementedError`. + +- [ ] **Step 3: Implement `get_array`** + +Replace the `get_array` placeholder in `_input_archive.py`: + +```python + def get_array( + self, + model: ArrayReferenceModel | InlineArrayModel, + *, + slices: tuple[slice, ...] | EllipsisType = ..., + strip_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> np.ndarray: + if isinstance(model, InlineArrayModel): + data: np.ndarray = np.array(model.data, dtype=model.datatype.to_numpy()) + return data if slices is ... else data[slices] + if not isinstance(model.source, str) or not model.source.startswith("zarr:"): + raise ArchiveReadError( + f"ZarrInputArchive cannot resolve array source {model.source!r}; " + f"expected a 'zarr:' reference." + ) + zarr_path = model.source[len("zarr:") :] + try: + node = self._document.root.get(zarr_path) + except KeyError: + raise ArchiveReadError(f"Array reference {zarr_path!r} not in store.") from None + if not isinstance(node, ZarrArray): + raise ArchiveReadError(f"{zarr_path!r} is not an array.") + # The lazy invariant: ZarrArray.read forwards slices straight + # to the underlying zarr.Array handle. Only the chunks + # intersecting `slices` are fetched. + return node.read(slices=slices) +``` + +- [ ] **Step 4: Run the tests to verify they pass** + +Run: `pytest tests/test_zarr_input_archive.py::ZarrInputArchiveLazySubsetTestCase -v` +Expected: PASS — `subset_reads < full_reads`. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_input_archive.py tests/test_zarr_input_archive.py +git commit -m "feat: implement lazy slice forwarding in ZarrInputArchive.get_array" +``` + +### Task 3.3: `deserialize_pointer`, `serialize_frame_set` round-trip, `get_frame_set` + +**Files:** +- Modify: `python/lsst/images/zarr/_input_archive.py` +- Modify: `tests/test_zarr_input_archive.py` + +`deserialize_pointer(pointer, model_type, deserializer)`: + +1. Looks up the cached deserialized object by `pointer.path`; returns it if present. +2. Reads the JSON bytes at `pointer.path` (a `ZarrArray` of `uint8`). +3. Validates with `model_type.model_validate_json` and calls `deserializer(model, self)`. +4. Caches by `pointer.path` and, if the result is a `FrameSet`, also caches it under `_frame_set_cache` so `get_frame_set` can return it later without re-deserialization. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_input_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchivePointerTestCase(unittest.TestCase): + def test_deserialize_pointer_caches_results(self) -> None: + # Write an archive containing one Image, then re-open and + # deserialize via the public read() helper once it lands. For + # this skeleton task, build a hand-rolled archive that contains + # a JSON sub-tree and call deserialize_pointer directly. + import pydantic + + from lsst.images.zarr._common import ZarrPointerModel + + class _Sub(pydantic.BaseModel): + label: str + + # Build an archive with /lsst/psf/tree containing JSON. + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + {LSST_NS: {"version": LSST_VERSION, "archive_class": "Image", "tree": "/lsst/tree"}} + ) + # Stub /lsst/tree. + lsst_group = root.create_group("lsst") + lsst_group.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" + # /lsst/psf/tree with a JSON document. + json_bytes = b'{"label": "psf"}' + psf_group = lsst_group.create_group("psf") + arr = psf_group.create_array( + name="tree", shape=(len(json_bytes),), chunks=(len(json_bytes),), dtype="uint8" + ) + arr[:] = np.frombuffer(json_bytes, dtype=np.uint8) + + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + deserialize_calls: list[int] = [] + + def deserializer(model, arch): # noqa: ANN001 + deserialize_calls.append(1) + return model + + pointer = ZarrPointerModel(path="/lsst/psf/tree") + first = archive.deserialize_pointer(pointer, _Sub, deserializer) + second = archive.deserialize_pointer(pointer, _Sub, deserializer) + self.assertEqual(first.label, "psf") + self.assertIs(first, second) + self.assertEqual(len(deserialize_calls), 1) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_input_archive.py::ZarrInputArchivePointerTestCase -v` +Expected: FAIL — `deserialize_pointer` raises `NotImplementedError`. + +- [ ] **Step 3: Implement `deserialize_pointer` and `get_frame_set`** + +Replace the placeholders in `_input_archive.py`: + +```python + def deserialize_pointer[U: ArchiveTree, V]( + self, + pointer: ZarrPointerModel, + model_type: type[U], + deserializer: Callable[[U, InputArchive[ZarrPointerModel]], V], + ) -> V: + if (cached := self._deserialized_pointer_cache.get(pointer.path)) is not None: + return cached + try: + node = self._document.root.get(pointer.path) + except KeyError: + raise ArchiveReadError(f"Pointer reference {pointer.path!r} not in store.") from None + if not isinstance(node, ZarrArray): + raise ArchiveReadError(f"Pointer target {pointer.path!r} is not an array.") + json_text = bytes(node.read()).decode("utf-8") + model = model_type.model_validate_json(json_text) + result = deserializer(model, self) + self._deserialized_pointer_cache[pointer.path] = result + if isinstance(result, FrameSet): + self._frame_set_cache[pointer.path] = result + return result + + def get_frame_set(self, pointer: ZarrPointerModel) -> FrameSet: + try: + return self._frame_set_cache[pointer.path] + except KeyError: + raise AssertionError( + f"Frame set at {pointer.path!r} must be deserialised via " + f"deserialize_pointer before any dependent transform can be." + ) from None +``` + +- [ ] **Step 4: Run the test to verify it passes** + +Run: `pytest tests/test_zarr_input_archive.py -v` +Expected: PASS — pointer-cache test asserts the deserializer is called exactly once. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_input_archive.py tests/test_zarr_input_archive.py +git commit -m "feat: implement deserialize_pointer and get_frame_set" +``` + +### Task 3.4: `get_table`, `get_structured_array` + +**Files:** +- Modify: `python/lsst/images/zarr/_input_archive.py` +- Modify: `tests/test_zarr_input_archive.py` + +Mirrors the FITS implementation: the `TableModel` carries one `ArrayReferenceModel` per column whose `source` is a `zarr:/lsst/tables//` path. `get_table` resolves each column via `get_array` (so subset semantics propagate), then builds an `astropy.table.Table`. `get_structured_array` returns the same data as a numpy structured array. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_input_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchiveTableTestCase(unittest.TestCase): + def test_get_table_reconstructs_columns(self) -> None: + import astropy.table + import numpy as np + + from lsst.images.zarr._output_archive import ZarrOutputArchive + + # Stage a table via the output archive, then read it back. + out = ZarrOutputArchive() + # Wire up the LSST root attributes so the input archive accepts it. + out.document.root.attributes.lsst["archive_class"] = "Image" + out.document.root.attributes.lsst["tree"] = "/lsst/tree" + out.document.root.ensure_group("/lsst").arrays["tree"] = ZarrArray( + data=np.frombuffer(b"{}", dtype=np.uint8) + ) + original = astropy.table.Table( + {"x": np.arange(4, dtype=np.int32), "y": np.arange(4, dtype=np.float32)} + ) + model = out.add_table(original, name="/cat") + + store = zarr.storage.MemoryStore() + out.document.to_zarr(store) + doc = ZarrDocument.from_zarr(store) + inp = ZarrInputArchive(doc) + + recovered = inp.get_table(model) + self.assertEqual(recovered.colnames, ["x", "y"]) + np.testing.assert_array_equal(recovered["x"], original["x"]) + np.testing.assert_array_equal(recovered["y"], original["y"]) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_input_archive.py::ZarrInputArchiveTableTestCase -v` +Expected: FAIL — `get_table` raises `NotImplementedError`. + +- [ ] **Step 3: Implement `get_table` and `get_structured_array`** + +Replace the placeholders: + +```python + def get_table( + self, + model: TableModel, + strip_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> astropy.table.Table: + result = astropy.table.Table(meta=model.meta) + for column_model in model.columns: + if isinstance(column_model.data, InlineArrayModel): + data: Any = column_model.data.data + else: + data = self.get_array(column_model.data, strip_header=strip_header) + result[column_model.name] = astropy.table.Column( + data, + name=column_model.name, + dtype=column_model.data.datatype.to_numpy(), + unit=column_model.unit, + description=column_model.description, + meta=column_model.meta, + ) + return result + + def get_structured_array( + self, + model: TableModel, + strip_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> np.ndarray: + return self.get_table(model, strip_header).as_array() +``` + +- [ ] **Step 4: Run the test to verify it passes** + +Run: `pytest tests/test_zarr_input_archive.py -v` +Expected: PASS — all tests in the file. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_input_archive.py tests/test_zarr_input_archive.py +git commit -m "feat: implement ZarrInputArchive.get_table and get_structured_array" +``` + +### Task 3.5: Public `read()` helper + +**Files:** +- Modify: `python/lsst/images/zarr/_input_archive.py` +- Modify: `python/lsst/images/zarr/__init__.py` +- Modify: `tests/test_zarr_input_archive.py` + +`read(cls, path, **kwargs)` opens a `ZarrInputArchive`, calls `archive.get_tree(cls._get_archive_tree_type(ZarrPointerModel))`, and returns `ReadResult(tree.deserialize(archive, **kwargs), tree.metadata, tree.butler_info)` — a direct mirror of `ndf.read` minus the auto-detect path (we do not auto-detect a non-LSST OME-Zarr archive in v1). + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_input_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrReadHelperTestCase(unittest.TestCase): + def test_round_trip_image(self) -> None: + import numpy as np + + from lsst.images import Box, Image + from lsst.images.zarr import read, write + + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + ) + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(original, target) + result = read(Image, target) + self.assertEqual(result.deserialized.array.shape, (4, 5)) + np.testing.assert_array_equal(result.deserialized.array, original.array) + self.assertEqual(result.deserialized.bbox, original.bbox) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_input_archive.py::ZarrReadHelperTestCase -v` +Expected: FAIL — `read()` raises `NotImplementedError`. + +- [ ] **Step 3: Implement `read`** + +Replace the placeholder in `_input_archive.py`: + +```python +def read[T: Any](cls: type[T], path: ResourcePathExpression, **kwargs: Any) -> ReadResult[T]: + """Read an object from a zarr archive. + + The archive's root attributes name the in-memory class via + ``lsst.archive_class``; this is checked against ``cls`` and an + `ArchiveReadError` is raised on mismatch. + """ + with ZarrInputArchive.open(path) as archive: + tree_type = cls._get_archive_tree_type(ZarrPointerModel) + tree = archive.get_tree(tree_type) + obj = tree.deserialize(archive, **kwargs) + return ReadResult(obj, tree.metadata, tree.butler_info) +``` + +Re-export `read` from `python/lsst/images/zarr/__init__.py`: + +```python +from ._common import * # noqa: F401, F403 +from ._input_archive import * # noqa: F401, F403 +from ._output_archive import * # noqa: F401, F403 +``` + +- [ ] **Step 4: Run the round-trip test** + +Run: `pytest tests/test_zarr_input_archive.py::ZarrReadHelperTestCase -v` +Expected: PASS — `Image` round-trips via `write` + `read`. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_input_archive.py python/lsst/images/zarr/__init__.py tests/test_zarr_input_archive.py +git commit -m "feat: add public zarr.read() helper" +``` + +### Task 3.6: `RoundtripZarr` test helper + round-trips for `Image` / `MaskedImage` / `VisitImage` + +**Files:** +- Modify: `python/lsst/images/tests/_roundtrip.py` (add `RoundtripZarr`) +- Create: `tests/test_zarr_round_trip.py` + +`RoundtripZarr` lets the existing `RoundtripBase` helpers exercise the zarr backend the same way they exercise FITS / JSON / NDF. The new test file uses it to round-trip the three image types covered by Phase 2's output archive. + +- [ ] **Step 1: Add `RoundtripZarr` to `_roundtrip.py`** + +Edit `python/lsst/images/tests/_roundtrip.py`. Add `"RoundtripZarr"` to `__all__`, then append after `RoundtripNdf`: + +```python +class RoundtripZarr[T](RoundtripBase[T]): + def inspect(self) -> Any: + """Open the zarr archive's IR for inspection.""" + import zarr + + from lsst.images.zarr._model import ZarrDocument + + return ZarrDocument.from_zarr(zarr.storage.LocalStore(self.filename, read_only=True)) + + def _get_extension(self) -> str: + return ".zarr" + + def _write(self, obj: Any, filename: str) -> ArchiveTree: + from .. import zarr as zarr_backend + + return zarr_backend.write(obj, filename) + + def _read(self, obj_type: Any, filename: str) -> ReadResult: + from .. import zarr as zarr_backend + + return zarr_backend.read(obj_type, filename) +``` + +(If `RoundtripBase` constructs the filename as a single file but our zarr archive is a directory, audit the helper for any logic that assumes a file. Mirror what NDF does — `RoundtripNdf` stores `.sdf`, a single HDF5 file; for zarr, `.zarr` is conventionally a directory. Adjust the helper to accept directories where it currently uses `tempfile.NamedTemporaryFile`.) + +- [ ] **Step 2: Write the failing test** + +Create `tests/test_zarr_round_trip.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +import numpy as np + +try: + import zarr # noqa: F401 + + from lsst.images.tests import RoundtripZarr + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrRoundTripTestCase(unittest.TestCase): + def test_image_round_trip(self) -> None: + from lsst.images import Box, Image + + original = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.recovered + np.testing.assert_array_equal(recovered.array, original.array) + self.assertEqual(recovered.bbox, original.bbox) + + def test_masked_image_round_trip(self) -> None: + from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema + + schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) + image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + original = MaskedImage(image, mask_schema=schema) + original.mask.set("BAD", image.array % 2 == 0) + + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.recovered + np.testing.assert_array_equal(recovered.image.array, original.image.array) + np.testing.assert_array_equal(recovered.mask.array, original.mask.array) + + def test_visit_image_round_trip(self) -> None: + from lsst.images import Box, Image, MaskPlane, MaskSchema, VisitImage + + schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) + image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + original = VisitImage(image=image, mask_schema=schema) + + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.recovered + np.testing.assert_array_equal(recovered.image.array, original.image.array) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 3: Run the tests** + +Run: `pytest tests/test_zarr_round_trip.py -v` +Expected: PASS — three round-trips. If any fail because a per-class detail in the input archive isn't quite right (e.g. a missing `lsst.companions` key for MaskedImage), fix it in `_input_archive.py`/`_output_archive.py` and re-run. + +- [ ] **Step 4: Commit** + +```bash +git add python/lsst/images/tests/_roundtrip.py tests/test_zarr_round_trip.py +git commit -m "test: round-trip Image, MaskedImage, VisitImage through the zarr backend" +``` + +--- + +**End of Phase 3.** Read side complete for `Image` / `MaskedImage` / `VisitImage`, with the lazy-subset invariant pinned by `_CountingStore` and full write→read round-trips green. Phase 4 adds `ColorImage` (channel axis + transpose) and `CellCoadd` (cell-aligned chunks + 4D per-cell PSF). + +## Phase 4 — `ColorImage` and `CellCoadd` + +This phase adds the two image types whose on-disk layouts deviate from the default `(y, x)` shape: `ColorImage` stacks its three channels into a single `(3, Y, X)` OME image at `/0`, and `CellCoadd` keeps default 2D layout for image / mask / variance but adds a 4D `(Cy, Cx, Py, Px)` per-cell PSF group at `/lsst/psf/per_cell` plus cell-aligned chunks driven by the coadd's `cell_shape`. + +**Key implementation idea (shared):** the user's `serialize()` runs unchanged. Each `serialize_direct("red", ...)` / `serialize_direct("blue", ...)` / per-cell PSF call lands its arrays at the natural per-component path in the IR (`/lsst/red/0`, `/lsst/blue/0`, `/lsst/psf/per_cell/cell__/0`). After `obj.serialize` returns, `_layout` runs a **fixup pass** keyed on `archive_class` that: + +1. Collects the per-component IR arrays. +2. Stacks them into the canonical OME-shaped array (transposed for `ColorImage`; nested-cell-stacked for `CellCoadd`). +3. Replaces the per-component arrays with **sliced source references** in the JSON tree so reads can still resolve them via `get_array` without re-implementing every type's deserializer. + +The sliced-source convention reuses the existing `ArrayReferenceModel.source` field with a query suffix: + +- `zarr:/0?c=0` → axis 0 of `/0`, slice `[0:1, :, :]`, then squeeze axis 0 (used for ColorImage) +- `zarr:/lsst/psf/per_cell/0?cell=3,5` → cell `(3, 5)` of `/lsst/psf/per_cell/0`, slice `[3:4, 5:6, :, :]`, then squeeze the cell axes (used for CellCoadd PSF) + +`get_array` parses the suffix, composes the implicit slice with any user-supplied `slices=`, and forwards the composed slice to the lazy `zarr.Array` handle. The lazy invariant from Phase 3 is preserved: a subset read of one channel of one ColorImage still touches only the chunks intersecting that subset along the spatial axes. + +### Task 4.1: Sliced-source URL parsing in `get_array` + +**Files:** +- Modify: `python/lsst/images/zarr/_common.py` +- Modify: `python/lsst/images/zarr/_input_archive.py` +- Modify: `tests/test_zarr_input_archive.py` + +This task adds the URL parser and threads it through `get_array`. No production user is calling these sliced sources yet — Tasks 4.2 and 4.4 introduce the writers. The test bench-tests the parser by hand-constructing references against an IR with a stacked array. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_input_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrSlicedSourceTestCase(unittest.TestCase): + def test_channel_slice_returns_one_channel(self) -> None: + from lsst.images.serialization import ArrayReferenceModel, NumberType + from lsst.images.zarr._common import LSST_NS, LSST_VERSION + + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + {LSST_NS: {"version": LSST_VERSION, "archive_class": "ColorImage", "tree": "/lsst/tree"}} + ) + # Stack: 3 channels × 4 rows × 5 cols. + stacked = np.arange(60, dtype=np.uint8).reshape(3, 4, 5) + root.create_array(name="0", shape=(3, 4, 5), chunks=(1, 4, 5), dtype="uint8")[:] = stacked + # Stub /lsst/tree. + lsst = root.create_group("lsst") + lsst.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" + + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + # Channel-1 reference reads a single (4, 5) plane — the c axis is + # dropped after slicing, NOT before, and a user `slices` argument + # composes correctly with the implicit channel slice. + ref = ArrayReferenceModel( + source="zarr:/0?c=1", shape=[4, 5], datatype=NumberType.from_numpy(np.dtype("uint8")) + ) + full = archive.get_array(ref) + self.assertEqual(full.shape, (4, 5)) + np.testing.assert_array_equal(full, stacked[1]) + + # Composed user slice on top of the channel suffix. + sub = archive.get_array(ref, slices=(slice(0, 2), slice(0, 3))) + self.assertEqual(sub.shape, (2, 3)) + np.testing.assert_array_equal(sub, stacked[1, :2, :3]) + + def test_cell_slice_returns_one_cell(self) -> None: + from lsst.images.serialization import ArrayReferenceModel, NumberType + from lsst.images.zarr._common import LSST_NS, LSST_VERSION + + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + {LSST_NS: {"version": LSST_VERSION, "archive_class": "CellCoadd", "tree": "/lsst/tree"}} + ) + # 2x3 cells, each 4x5 PSF. + stack = np.arange(2 * 3 * 4 * 5, dtype=np.float32).reshape(2, 3, 4, 5) + psf = root.create_group("lsst").create_group("psf").create_group("per_cell") + psf.create_array(name="0", shape=(2, 3, 4, 5), chunks=(1, 1, 4, 5), dtype="float32")[:] = stack + # Stub /lsst/tree. + root["lsst"].create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" + + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + ref = ArrayReferenceModel( + source="zarr:/lsst/psf/per_cell/0?cell=1,2", + shape=[4, 5], + datatype=NumberType.from_numpy(np.dtype("float32")), + ) + result = archive.get_array(ref) + self.assertEqual(result.shape, (4, 5)) + np.testing.assert_array_equal(result, stack[1, 2]) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_input_archive.py::ZarrSlicedSourceTestCase -v` +Expected: FAIL — current `get_array` rejects `?` query suffix. + +- [ ] **Step 3: Add the parser to `_common.py`** + +Append to `python/lsst/images/zarr/_common.py`: + +```python +__all__ = ( + "LSST_NS", + "LSST_VERSION", + "OME_NS", + "OME_VERSION", + "SlicedSource", + "ZarrCompressionOptions", + "ZarrPointerModel", + "archive_path_to_zarr_path", + "json_pointer_to_zarr_path", + "parse_zarr_source", +) + + +@dataclass(frozen=True) +class SlicedSource: + """A parsed ``zarr:?`` reference. + + ``implicit_slices`` holds the slice tuple to apply *before* the + user's ``slices=`` argument. ``squeezed_axes`` lists axes to drop + after slicing (the channel axis for ColorImage, the cell axes for + CellCoadd). This shape lets `ZarrInputArchive.get_array` compose + the implicit slice with any user slice and forward the result + straight to the lazy ``zarr.Array`` handle. + """ + + path: str + implicit_slices: tuple[slice, ...] + squeezed_axes: tuple[int, ...] + + +def parse_zarr_source(source: str) -> SlicedSource: + """Parse a ``zarr:[?]`` reference into a `SlicedSource`.""" + if not source.startswith("zarr:"): + raise ValueError(f"Not a zarr source: {source!r}") + body = source[len("zarr:") :] + if "?" not in body: + return SlicedSource(path=body, implicit_slices=(), squeezed_axes=()) + path, query = body.split("?", 1) + if query.startswith("c="): + c = int(query[len("c=") :]) + return SlicedSource(path=path, implicit_slices=(slice(c, c + 1),), squeezed_axes=(0,)) + if query.startswith("cell="): + cy_str, cx_str = query[len("cell=") :].split(",", 1) + cy, cx = int(cy_str), int(cx_str) + return SlicedSource( + path=path, + implicit_slices=(slice(cy, cy + 1), slice(cx, cx + 1)), + squeezed_axes=(0, 1), + ) + raise ValueError(f"Unsupported zarr-source query {query!r}.") +``` + +- [ ] **Step 4: Use the parser in `get_array`** + +In `python/lsst/images/zarr/_input_archive.py`, replace the `get_array` body so it composes the implicit slices with the user's: + +```python + def get_array( + self, + model: ArrayReferenceModel | InlineArrayModel, + *, + slices: tuple[slice, ...] | EllipsisType = ..., + strip_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> np.ndarray: + if isinstance(model, InlineArrayModel): + data: np.ndarray = np.array(model.data, dtype=model.datatype.to_numpy()) + return data if slices is ... else data[slices] + if not isinstance(model.source, str) or not model.source.startswith("zarr:"): + raise ArchiveReadError( + f"ZarrInputArchive cannot resolve array source {model.source!r}; " + f"expected a 'zarr:' reference." + ) + from ._common import parse_zarr_source + + parsed = parse_zarr_source(model.source) + try: + node = self._document.root.get(parsed.path) + except KeyError: + raise ArchiveReadError(f"Array reference {parsed.path!r} not in store.") from None + if not isinstance(node, ZarrArray): + raise ArchiveReadError(f"{parsed.path!r} is not an array.") + # Compose implicit (per-class) slice with user slice. The lazy + # invariant: this composed tuple is what hits the zarr.Array + # handle, so only intersecting chunks are fetched. + composed: tuple[slice, ...] + if slices is ...: + user_slices: tuple[slice, ...] = (slice(None),) * ( + len(node.shape) - len(parsed.implicit_slices) + ) + else: + user_slices = slices + composed = parsed.implicit_slices + user_slices + raw = node.read(slices=composed) + # Drop the squeezed axes (channel / cell axes that the implicit + # slice constrained to size 1). + for axis in sorted(parsed.squeezed_axes, reverse=True): + raw = np.squeeze(raw, axis=axis) + return raw +``` + +- [ ] **Step 5: Run the tests to verify they pass** + +Run: `pytest tests/test_zarr_input_archive.py -v` +Expected: PASS — including both new sliced-source tests. + +- [ ] **Step 6: Commit** + +```bash +git add python/lsst/images/zarr/_common.py python/lsst/images/zarr/_input_archive.py tests/test_zarr_input_archive.py +git commit -m "feat: parse zarr:?c=N / ?cell=Cy,Cx sliced-source references" +``` + +### Task 4.2: ColorImage layout fixup on write + +**Files:** +- Modify: `python/lsst/images/zarr/_layout.py` (add `fixup_color_image`) +- Modify: `python/lsst/images/zarr/_output_archive.py` (call the fixup in `add_tree` when `archive_class == "ColorImage"`) +- Modify: `python/lsst/images/zarr/_common.py` (add `/red/image`, `/green/image`, `/blue/image` mappings to `_JSON_POINTER_TO_ZARR_PATH` so per-channel arrays land at predictable paths) +- Modify: `tests/test_zarr_output_archive.py` (assert IR shape after fixup) + +The fixup runs after `serialize()` populates the IR. By that point, the IR has three top-level Image sub-archives at `/lsst/red/0`, `/lsst/green/0`, `/lsst/blue/0` (each shaped `(Y, X)`), plus the staged JSON tree referencing them by `source="zarr:/lsst//0"`. The fixup: + +1. Reads the three numpy arrays out of the IR (still numpy; no zarr handles yet). +2. Stacks via `transpose_color_image_in(np.stack([r, g, b], axis=-1))` to produce `(3, Y, X)`. +3. Stages this stacked array at `/0` with chunks `(1, 1024, 1024)` (default) and shards. +4. Removes `/lsst/red`, `/lsst/green`, `/lsst/blue` from the IR. +5. Walks the staged tree's JSON bytes to rewrite the source URLs: + - `"zarr:/lsst/red/0"` → `"zarr:/0?c=0"` + - `"zarr:/lsst/green/0"` → `"zarr:/0?c=1"` + - `"zarr:/lsst/blue/0"` → `"zarr:/0?c=2"` +6. Adds OME `omero/channels` to the root attributes. + +The JSON-rewrite step is a flat string substitution because each source URL is unique and self-contained — no nested-pointer escaping concerns. + +- [ ] **Step 1: Update `_JSON_POINTER_TO_ZARR_PATH`** + +In `python/lsst/images/zarr/_common.py`, replace the existing dict literal with: + +```python +_JSON_POINTER_TO_ZARR_PATH: dict[str, str] = { + "/image": "/0", + "/mask": "/lsst/mask/0", + "/variance": "/lsst/variance/0", + # ColorImage per-channel sub-archives (stacked into /0 by the + # _layout.fixup_color_image pass on write). + "/red/image": "/lsst/red/0", + "/green/image": "/lsst/green/0", + "/blue/image": "/lsst/blue/0", +} +``` + +- [ ] **Step 2: Write the failing test** + +Append to `tests/test_zarr_output_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrColorImageWriteTestCase(unittest.TestCase): + def test_color_image_stacks_into_top_level_array(self) -> None: + import os + import tempfile + + import numpy as np + import zarr + + from lsst.images import Box, ColorImage, Image + from lsst.images.zarr import write + from lsst.images.zarr._model import ZarrDocument + + red = Image(np.full((4, 5), 1, dtype=np.uint8), bbox=Box.factory[10:14, 20:25]) + green = Image(np.full((4, 5), 2, dtype=np.uint8), bbox=red.bbox) + blue = Image(np.full((4, 5), 3, dtype=np.uint8), bbox=red.bbox) + color = ColorImage(red=red, green=green, blue=blue) + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(color, target) + with zarr.storage.LocalStore(target, read_only=True) as store: + doc = ZarrDocument.from_zarr(store) + # Stacked at /0 with shape (3, Y, X). + self.assertIn("0", doc.root.arrays) + self.assertEqual(doc.root.arrays["0"].shape, (3, 4, 5)) + # Per-channel sub-archives are gone after the fixup. + self.assertNotIn("red", doc.root.groups.get("lsst", _empty()).groups) + # OME attributes name the channel axis. + axes = [a["name"] for a in doc.root.attributes.ome["multiscales"][0]["axes"]] + self.assertEqual(axes, ["c", "y", "x"]) + self.assertIn("omero", doc.root.attributes.ome) + + +def _empty(): # noqa: ANN201 + from lsst.images.zarr._model import ZarrGroup + + return ZarrGroup() +``` + +- [ ] **Step 3: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrColorImageWriteTestCase -v` +Expected: FAIL — fixup not implemented; per-channel arrays remain at `/lsst/red/0` etc. + +- [ ] **Step 4: Add `fixup_color_image` to `_layout.py`** + +Append to `python/lsst/images/zarr/_layout.py`: + +```python +__all__ = ( + "axes_for_archive_class", + "chunks_for", + "fixup_color_image", + "transpose_color_image_in", + "transpose_color_image_out", +) + + +def fixup_color_image(document: "ZarrDocument") -> None: + """Stack ColorImage red/green/blue sub-archives into a single ``/0``. + + Runs after ``ColorImage.serialize`` has populated the IR. Reads + the three per-channel numpy arrays out of ``/lsst/red/0``, + ``/lsst/green/0``, ``/lsst/blue/0``, transposes them into + ``(c, y, x)`` shape, stages the result at ``/0``, removes the + per-channel sub-archives, and rewrites the staged JSON tree so + references to the per-channel arrays become channel-sliced + references against ``/0``. + """ + from ._model import ZarrArray, ZarrDocument # local: avoid circular import + + if not isinstance(document, ZarrDocument): + raise TypeError(type(document).__name__) + lsst = document.root.groups.get("lsst") + if lsst is None: + return + channels = [] + for name in ("red", "green", "blue"): + sub = lsst.groups.get(name) + if sub is None or "0" not in sub.arrays: + return # not a fully-populated ColorImage; bail out + channels.append(sub.arrays["0"].data) + if not all(isinstance(c, np.ndarray) for c in channels): + raise TypeError("ColorImage fixup requires staged numpy arrays.") + stacked = np.stack(channels, axis=0) # (3, Y, X) + document.root.arrays["0"] = ZarrArray(data=stacked) + for name in ("red", "green", "blue"): + del lsst.groups[name] + # Rewrite source URLs in the JSON tree. + if "tree" in lsst.arrays: + json_bytes = bytes(lsst.arrays["tree"].data) + rewrites = { + b"zarr:/lsst/red/0": b"zarr:/0?c=0", + b"zarr:/lsst/green/0": b"zarr:/0?c=1", + b"zarr:/lsst/blue/0": b"zarr:/0?c=2", + } + for old, new in rewrites.items(): + json_bytes = json_bytes.replace(old, new) + lsst.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) +``` + +- [ ] **Step 5: Call the fixup from `add_tree`** + +In `python/lsst/images/zarr/_output_archive.py`, extend `add_tree` so that — after the JSON tree is staged — it calls the fixup and adds OME `omero/channels`: + +```python + if archive_class == "ColorImage": + from ._layout import fixup_color_image + from ._model import OmeOmeroChannel + + fixup_color_image(self.document) + self.document.root.attributes.ome["omero"] = { + "channels": [ + OmeOmeroChannel(label="red", color="FF0000").dump(), + OmeOmeroChannel(label="green", color="00FF00").dump(), + OmeOmeroChannel(label="blue", color="0000FF").dump(), + ] + } + if "0" in self.document.root.arrays: + top = self.document.root.arrays["0"] + multiscale = OmeMultiscale( + name=archive_class.lower(), + axes=axes_for_archive_class(archive_class), + ) + self.document.root.attributes.ome["multiscales"] = [multiscale.dump()] +``` + +(Move the existing multiscales logic so it runs *after* the fixup; otherwise the axes would still say `(y, x)` because `/0` doesn't exist yet at fixup time.) + +- [ ] **Step 6: Run the tests to verify they pass** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrColorImageWriteTestCase -v` +Expected: PASS — `/0` is `(3, 4, 5)`, per-channel groups are gone, OME axes are `(c, y, x)`. + +- [ ] **Step 7: Commit** + +```bash +git add python/lsst/images/zarr/_layout.py python/lsst/images/zarr/_output_archive.py python/lsst/images/zarr/_common.py tests/test_zarr_output_archive.py +git commit -m "feat: stack ColorImage channels into a single (3, Y, X) OME image" +``` + +### Task 4.3: ColorImage round-trip + +**Files:** +- Modify: `tests/test_zarr_round_trip.py` + +This task only adds a round-trip assertion. The work in Tasks 4.1 and 4.2 means the engineer expects this to pass without further code changes; if it fails, the failure surfaces a missing piece in either the JSON-rewrite step or the channel-slice parser. + +- [ ] **Step 1: Write the test** + +Append to `tests/test_zarr_round_trip.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrColorImageRoundTripTestCase(unittest.TestCase): + def test_color_image_round_trip(self) -> None: + from lsst.images import Box, ColorImage, Image + + red = Image(np.full((4, 5), 1, dtype=np.uint8), bbox=Box.factory[10:14, 20:25]) + green = Image(np.full((4, 5), 2, dtype=np.uint8), bbox=red.bbox) + blue = Image(np.full((4, 5), 3, dtype=np.uint8), bbox=red.bbox) + original = ColorImage(red=red, green=green, blue=blue) + + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.recovered + np.testing.assert_array_equal(recovered.red.array, original.red.array) + np.testing.assert_array_equal(recovered.green.array, original.green.array) + np.testing.assert_array_equal(recovered.blue.array, original.blue.array) +``` + +- [ ] **Step 2: Run the test** + +Run: `pytest tests/test_zarr_round_trip.py::ZarrColorImageRoundTripTestCase -v` +Expected: PASS. + +- [ ] **Step 3: Commit** + +```bash +git add tests/test_zarr_round_trip.py +git commit -m "test: round-trip ColorImage through the zarr backend" +``` + +### Task 4.4: CellCoadd default chunk geometry + +**Files:** +- Modify: `python/lsst/images/zarr/_layout.py` (extend `chunks_for` to honor `cell_shape`) +- Modify: `python/lsst/images/zarr/_output_archive.py` (publish `cell_shape` to the archive so `chunks_for` can see it) +- Modify: `tests/test_zarr_layout.py` +- Modify: `tests/test_zarr_output_archive.py` + +`CellCoadd` chunks should align to the coadd's cell grid: instead of `(1024, 1024)`, the default chunks become `cell_shape` (typically `(256, 256)`). The fix surfaces the cell shape from the in-memory object to `chunks_for` via a new optional `archive_metadata` argument. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_layout.py`: + +```python + def test_chunks_for_cell_coadd_uses_cell_shape(self) -> None: + result = chunks_for( + "CellCoadd", + (4096, 4096), + None, + archive_metadata={"cell_shape": (256, 256)}, + ) + self.assertEqual(result, (256, 256)) + + def test_chunks_for_cell_coadd_without_metadata_falls_back(self) -> None: + # When cell_shape isn't available, fall back to the 1024 default. + self.assertEqual(chunks_for("CellCoadd", (4096, 4096), None), (1024, 1024)) +``` + +- [ ] **Step 2: Run to verify failure** + +Run: `pytest tests/test_zarr_layout.py -v` +Expected: FAIL — `chunks_for` doesn't accept `archive_metadata`. + +- [ ] **Step 3: Extend `chunks_for`** + +Replace `chunks_for` in `python/lsst/images/zarr/_layout.py`: + +```python +def chunks_for( + archive_class: str, + shape: tuple[int, ...], + override: tuple[int, ...] | None, + *, + archive_metadata: Mapping[str, Any] | None = None, +) -> tuple[int, ...]: + """Return the chunk shape to use for an array. + + Parameters + ---------- + archive_class + The top-level archive class. + shape + The full array shape, used to clamp the default per-axis. + override + User-supplied chunk shape; if not ``None`` it is returned + verbatim after a length check. + archive_metadata + Optional dict carrying class-specific layout hints. ``CellCoadd`` + uses ``"cell_shape"`` to align chunks to the cell grid. + """ + if override is not None: + if len(override) != len(shape): + raise ValueError( + f"chunks override has rank {len(override)}, " + f"expected {len(shape)} for {archive_class!r}." + ) + return tuple(override) + if archive_class == "CellCoadd" and archive_metadata is not None: + cell_shape = archive_metadata.get("cell_shape") + if cell_shape is not None: + # Align chunks to the cell grid (still clamped to the array shape). + return tuple(min(c, dim) for c, dim in zip(cell_shape, shape, strict=True)) + return tuple(min(_DEFAULT_AXIS_LIMIT, dim) for dim in shape) +``` + +(Add `from collections.abc import Mapping` and `from typing import Any` to the imports.) + +- [ ] **Step 4: Surface `cell_shape` from `write()` into the archive** + +Edit `python/lsst/images/zarr/_output_archive.py` so `write()` extracts `cell_shape` from the object and stashes it on the archive: + +```python +def write( + obj: Any, + path: Any, + *, + chunks: Mapping[str, tuple[int, ...] | None] | None = None, + shards: Mapping[str, tuple[int, ...] | None] | None = None, + compression: Mapping[str, ZarrCompressionOptions | None] | None = None, + metadata: Mapping[str, Any] | None = None, + butler_info: Any | None = None, +) -> ArchiveTree: + from ._store import open_store_for_write + + archive_default_name = getattr(obj, "_archive_default_name", None) + archive_metadata: dict[str, Any] = {} + if (cell_shape := getattr(obj, "cell_shape", None)) is not None: + archive_metadata["cell_shape"] = tuple(cell_shape) + archive = ZarrOutputArchive( + chunks=chunks, + shards=shards, + compression=compression, + archive_metadata=archive_metadata, + ) + ... +``` + +…and route `archive_metadata` through to `add_array` so it can call `chunks_for` with the hints. The simplest path is for `ZarrOutputArchive.add_array` to compute the default chunks itself instead of leaving it to `to_zarr`: + +```python + def __init__( + self, + *, + chunks=None, + shards=None, + compression=None, + archive_metadata: Mapping[str, Any] | None = None, + ) -> None: + ... + self._archive_metadata = dict(archive_metadata) if archive_metadata else {} + + def add_array(self, array, *, name=None, update_header=...): + ... + chunks = self._chunks.get(name) + if chunks is None: + chunks = chunks_for( + self._archive_class_hint or "Image", + array.shape, + None, + archive_metadata=self._archive_metadata, + ) + parent.arrays[leaf] = ZarrArray( + data=np.ascontiguousarray(array), + chunks=chunks, + ... + ) +``` + +`self._archive_class_hint` is set by `add_tree` when it knows the top-level class — but `add_array` may run before `add_tree`. Workable approach: set it in `write()` *before* `obj.serialize` runs, by passing the class name into the constructor. + +- [ ] **Step 5: Write the output-archive layout test** + +Append to `tests/test_zarr_output_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrCellCoaddChunkLayoutTestCase(unittest.TestCase): + def test_cell_coadd_chunks_align_to_cell_shape(self) -> None: + import os + import tempfile + + from lsst.images import Box, Image, MaskPlane, MaskSchema + from lsst.images.cells import CellCoadd + from lsst.images.zarr import write + from lsst.images.zarr._model import ZarrDocument + + # Construct a minimal CellCoadd; the constructor signature will + # depend on the public API. Adjust to whatever this codebase + # exposes — the assertion below depends only on chunk shape. + coadd = _make_minimal_cell_coadd(cell_shape=(256, 256), shape=(512, 512)) + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "coadd.zarr") + write(coadd, target) + with zarr.storage.LocalStore(target, read_only=True) as store: + doc = ZarrDocument.from_zarr(store) + self.assertEqual(tuple(doc.root.arrays["0"].chunks), (256, 256)) + + +def _make_minimal_cell_coadd(*, cell_shape, shape): # noqa: ANN001, ANN201 + """Construct a minimal CellCoadd for layout testing. + + The constructor in this codebase may require a Projection, mask, + variance, etc. Build the smallest valid instance possible — this + helper is a placeholder; the implementer should replace it with the + actual minimal construction once they consult the CellCoadd ctor. + """ + raise unittest.SkipTest("Implementer: build a minimal CellCoadd per the local ctor.") +``` + +(The `SkipTest` placeholder is intentional — the engineer may need to read the `CellCoadd` constructor in `python/lsst/images/cells/_coadd.py` to assemble a minimal valid coadd. Replace the placeholder with the real construction code; the chunk-shape assertion stands as-is.) + +- [ ] **Step 6: Run the tests** + +Run: `pytest tests/test_zarr_layout.py tests/test_zarr_output_archive.py -v` +Expected: layout tests PASS; the CellCoadd output-archive test runs (or skips with the placeholder). + +- [ ] **Step 7: Commit** + +```bash +git add python/lsst/images/zarr/_layout.py python/lsst/images/zarr/_output_archive.py tests/test_zarr_layout.py tests/test_zarr_output_archive.py +git commit -m "feat: align CellCoadd default chunks to cell_shape" +``` + +### Task 4.5: 4D per-cell PSF stacking for `CellCoadd` + +**Files:** +- Modify: `python/lsst/images/zarr/_layout.py` (add `fixup_cell_coadd_psf`) +- Modify: `python/lsst/images/zarr/_output_archive.py` (call the fixup when `archive_class == "CellCoadd"`) +- Modify: `tests/test_zarr_output_archive.py` + +The mechanics mirror `fixup_color_image`: + +1. Walk `/lsst/psf/per_cell/cell__/0` arrays in the IR. +2. Stack into a single `(Cy, Cx, Py, Px)` array at `/lsst/psf/per_cell/0` with chunks `(1, 1, Py, Px)` so per-cell reads stay one-chunk. +3. Remove the per-cell sub-groups. +4. Rewrite source URLs `zarr:/lsst/psf/per_cell/cell__/0` → `zarr:/lsst/psf/per_cell/0?cell=,`. + +Per the spec: "Always nested under `lsst/psf/per_cell` of a parent CellCoadd". The fixup only runs when the IR has at least one such per-cell group; CellCoadd objects without per-cell PSFs leave the IR alone. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_output_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrCellCoaddPsfStackingTestCase(unittest.TestCase): + def test_per_cell_psf_stacks_into_4d_array(self) -> None: + # Hand-build the IR shape that CellCoadd.serialize would produce + # so this test is independent of the (still-evolving) CellCoadd + # constructor. + import numpy as np + + from lsst.images.zarr._layout import fixup_cell_coadd_psf + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + + doc = ZarrDocument(root=ZarrGroup()) + psf = doc.root.ensure_group("/lsst/psf/per_cell") + for i in range(2): + for j in range(3): + cell = psf.groups.setdefault(f"cell_{i}_{j}", ZarrGroup()) + cell.arrays["0"] = ZarrArray( + data=np.full((4, 5), i * 10 + j, dtype=np.float32) + ) + # Stub a JSON tree referencing one cell so we can spot-check the + # rewrite. + ref = b'"source": "zarr:/lsst/psf/per_cell/cell_1_2/0"' + doc.root.ensure_group("/lsst").arrays["tree"] = ZarrArray( + data=np.frombuffer(b"{" + ref + b"}", dtype=np.uint8) + ) + + fixup_cell_coadd_psf(doc) + + stacked = doc.root.get("/lsst/psf/per_cell/0") + self.assertEqual(stacked.shape, (2, 3, 4, 5)) + # Each per-cell group is gone. + per_cell = doc.root.get("/lsst/psf/per_cell") + self.assertEqual(per_cell.groups, {}) + # JSON rewrite happened. + rewritten = bytes(doc.root.get("/lsst/tree").data).decode("utf-8") + self.assertIn('"zarr:/lsst/psf/per_cell/0?cell=1,2"', rewritten) +``` + +- [ ] **Step 2: Run to verify failure** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrCellCoaddPsfStackingTestCase -v` +Expected: FAIL — `fixup_cell_coadd_psf` does not exist. + +- [ ] **Step 3: Implement the fixup** + +Append to `python/lsst/images/zarr/_layout.py` (and add to `__all__`): + +```python +def fixup_cell_coadd_psf(document: "ZarrDocument") -> None: + """Stack per-cell PSF sub-archives into a single 4D OME image.""" + from ._model import ZarrArray, ZarrDocument + + if not isinstance(document, ZarrDocument): + raise TypeError(type(document).__name__) + try: + per_cell = document.root.get("/lsst/psf/per_cell") + except KeyError: + return + if not isinstance(per_cell, ZarrGroup := type(per_cell)) or not per_cell.groups: + return + # Parse and sort cell coordinates. + coords: dict[tuple[int, int], np.ndarray] = {} + for name, sub in per_cell.groups.items(): + if not name.startswith("cell_"): + continue + try: + i_str, j_str = name[len("cell_") :].split("_", 1) + i, j = int(i_str), int(j_str) + except ValueError: + continue + if "0" not in sub.arrays: + continue + arr = sub.arrays["0"].data + if not isinstance(arr, np.ndarray): + raise TypeError("CellCoadd PSF fixup requires staged numpy arrays.") + coords[(i, j)] = arr + if not coords: + return + cy = max(k[0] for k in coords) + 1 + cx = max(k[1] for k in coords) + 1 + sample = next(iter(coords.values())) + py, px = sample.shape + stacked = np.zeros((cy, cx, py, px), dtype=sample.dtype) + for (i, j), arr in coords.items(): + stacked[i, j] = arr + per_cell.arrays["0"] = ZarrArray(data=stacked, chunks=(1, 1, py, px)) + per_cell.groups.clear() + # Rewrite the JSON tree URLs. + lsst = document.root.groups.get("lsst") + if lsst is None or "tree" not in lsst.arrays: + return + json_bytes = bytes(lsst.arrays["tree"].data) + for (i, j) in coords: + old = f"zarr:/lsst/psf/per_cell/cell_{i}_{j}/0".encode() + new = f"zarr:/lsst/psf/per_cell/0?cell={i},{j}".encode() + json_bytes = json_bytes.replace(old, new) + lsst.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) +``` + +- [ ] **Step 4: Call the fixup from `add_tree`** + +In `python/lsst/images/zarr/_output_archive.py`, in the `add_tree` body, before the multiscales block: + +```python + if archive_class == "CellCoadd": + from ._layout import fixup_cell_coadd_psf + + fixup_cell_coadd_psf(self.document) +``` + +- [ ] **Step 5: Run the tests** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrCellCoaddPsfStackingTestCase -v` +Expected: PASS — stacked shape is `(2, 3, 4, 5)` and the JSON rewrite is present. + +- [ ] **Step 6: Commit** + +```bash +git add python/lsst/images/zarr/_layout.py python/lsst/images/zarr/_output_archive.py tests/test_zarr_output_archive.py +git commit -m "feat: stack CellCoadd per-cell PSFs into a 4D (Cy, Cx, Py, Px) image" +``` + +### Task 4.6: CellCoadd round-trip + +**Files:** +- Modify: `tests/test_zarr_round_trip.py` + +- [ ] **Step 1: Write the test** + +Append to `tests/test_zarr_round_trip.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrCellCoaddRoundTripTestCase(unittest.TestCase): + def test_cell_coadd_round_trip(self) -> None: + original = _make_minimal_cell_coadd_with_psf() # implementer-supplied + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.recovered + np.testing.assert_array_equal(recovered.image.array, original.image.array) + # Spot-check one per-cell PSF if the API exposes them. + if hasattr(original, "psf") and hasattr(original.psf, "per_cell"): + np.testing.assert_array_equal( + recovered.psf.per_cell[0, 0], original.psf.per_cell[0, 0] + ) + + +def _make_minimal_cell_coadd_with_psf(): # noqa: ANN201 + raise unittest.SkipTest("Implementer: assemble a minimal CellCoadd with a per-cell PSF.") +``` + +(As in Task 4.4, the constructor is implementer-supplied — the test asserts only round-trip correctness.) + +- [ ] **Step 2: Run the test** + +Run: `pytest tests/test_zarr_round_trip.py::ZarrCellCoaddRoundTripTestCase -v` +Expected: PASS (or SKIP if the helper is still a placeholder; replace before merging the phase). + +- [ ] **Step 3: Commit** + +```bash +git add tests/test_zarr_round_trip.py +git commit -m "test: round-trip CellCoadd through the zarr backend" +``` + +--- + +**End of Phase 4.** Six tasks. ColorImage stacks into a single OME `(c, y, x)` image; CellCoadd uses cell-aligned chunks and stacks per-cell PSFs into a 4D `(Cy, Cx, Py, Px)` image. Both types round-trip and read sliced sources via the Phase 3 lazy-handle path. Phase 5 covers cross-format round-trips (FITS↔Zarr opaque-metadata preservation) and the optional external-reader sanity tests. + +## Phase 5 — Cross-format round-trips and external-reader sanity + +This phase makes the zarr backend a peer of FITS / NDF for round-trip preservation: an object read from FITS carries its primary-HDU header in `_opaque_metadata`, and writing that object to zarr must preserve those cards so a later round-trip back to FITS reproduces the original headers byte-for-byte. + +The phase also adds two **optional** external-reader tests that run only when the corresponding tools are installed: an `ngff-validator` compliance check and an `ome-zarr-py` reader sanity check. Both are skipped silently in environments where the tooling isn't available. + +### Task 5.1: Persist `FitsOpaqueMetadata` on write to zarr + +**Files:** +- Modify: `python/lsst/images/zarr/_output_archive.py` (extend `write` to accept and stash opaque metadata) +- Modify: `python/lsst/images/zarr/_layout.py` (add `serialize_fits_opaque_metadata`) +- Test: `tests/test_zarr_output_archive.py` + +The opaque metadata lives at `/lsst/opaque_metadata/fits/primary` as a 1-D `uint8` zarr array containing UTF-8 JSON. The JSON encodes the astropy `Header` via `header.tostring(sep="\n", endcard=False, padding=False)` (lossless and human-readable). The root attribute `lsst.opaque_metadata_format = "fits"` flags its presence. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_output_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOpaqueMetadataWriteTestCase(unittest.TestCase): + def test_fits_opaque_metadata_persists(self) -> None: + import os + import tempfile + + import astropy.io.fits + import numpy as np + import zarr + + from lsst.images import Box, Image + from lsst.images.fits._common import ExtensionKey, FitsOpaqueMetadata + from lsst.images.zarr import write + from lsst.images.zarr._model import ZarrDocument + + image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + header = astropy.io.fits.Header() + header["ORIGIN"] = "RUBIN" + header["EXPTIME"] = 30.0 + opaque = FitsOpaqueMetadata() + opaque.headers[ExtensionKey()] = header + image._opaque_metadata = opaque + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(image, target) + with zarr.storage.LocalStore(target, read_only=True) as store: + doc = ZarrDocument.from_zarr(store) + self.assertEqual( + doc.root.attributes.lsst.get("opaque_metadata_format"), "fits" + ) + opaque_node = doc.root.get("/lsst/opaque_metadata/fits/primary") + json_bytes = bytes(opaque_node.read()) + # Round-trip through astropy. + import json as _json + + cards = _json.loads(json_bytes) + self.assertIn("ORIGIN", cards) + self.assertEqual(cards["ORIGIN"], "RUBIN") + self.assertEqual(cards["EXPTIME"], 30.0) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrOpaqueMetadataWriteTestCase -v` +Expected: FAIL — `/lsst/opaque_metadata/fits/primary` not in store. + +- [ ] **Step 3: Implement opaque-metadata serialization** + +Append to `python/lsst/images/zarr/_layout.py`: + +```python +def serialize_fits_opaque_metadata(document: "ZarrDocument", opaque: Any) -> None: + """Stage a `FitsOpaqueMetadata` object into the IR. + + Stores the primary-HDU header as a JSON-encoded ``uint8`` array at + ``/lsst/opaque_metadata/fits/primary`` and sets the + ``lsst.opaque_metadata_format`` attribute on the root group. + """ + import json as _json + + from ..fits._common import ExtensionKey + from ._model import ZarrArray + + primary = opaque.headers.get(ExtensionKey()) + if primary is None or len(primary) == 0: + return + # Encode as a flat dict of card key → value. Multi-record cards + # (CONTINUE, HISTORY, COMMENT) are preserved by astropy on round-trip + # via the same card-stringification used by the NDF backend. + cards = {card.keyword: card.value for card in primary.cards if card.keyword} + json_bytes = _json.dumps(cards).encode("utf-8") + parent = document.root.ensure_group("/lsst/opaque_metadata/fits") + parent.arrays["primary"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) + document.root.attributes.lsst["opaque_metadata_format"] = "fits" +``` + +In `python/lsst/images/zarr/_output_archive.py`, extend `write` to call this: + +```python +def write(...): + ... + archive = ZarrOutputArchive(...) + if archive_default_name is not None: + tree = archive.serialize_direct(archive_default_name, obj.serialize) + else: + tree = obj.serialize(archive) + if metadata is not None: + tree.metadata.update(metadata) + if butler_info is not None: + tree.butler_info = butler_info + archive.add_tree(tree, archive_class=type(obj).__name__) + # Persist opaque metadata if the object carries any. + opaque = getattr(obj, "_opaque_metadata", None) + if opaque is not None: + from ._layout import serialize_fits_opaque_metadata + + try: + serialize_fits_opaque_metadata(archive.document, opaque) + except ImportError: + pass # opaque type isn't a FITS one; ignore + with open_store_for_write(path) as store: + archive.document.to_zarr(store) + return tree +``` + +- [ ] **Step 4: Run the tests** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrOpaqueMetadataWriteTestCase -v` +Expected: PASS — opaque metadata is staged at the spec path. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_layout.py python/lsst/images/zarr/_output_archive.py tests/test_zarr_output_archive.py +git commit -m "feat: persist FitsOpaqueMetadata at /lsst/opaque_metadata/fits/primary on zarr write" +``` + +### Task 5.2: Restore `FitsOpaqueMetadata` on read from zarr + +**Files:** +- Modify: `python/lsst/images/zarr/_input_archive.py` (read opaque metadata in `__init__`; expose via `get_opaque_metadata`) +- Modify: `python/lsst/images/zarr/_layout.py` (add `deserialize_fits_opaque_metadata`) +- Modify: `tests/test_zarr_input_archive.py` + +`get_opaque_metadata()` returns a `FitsOpaqueMetadata` reconstructed from `/lsst/opaque_metadata/fits/primary`. The `read()` helper attaches it to the deserialized object as `obj._opaque_metadata` (matching the FITS / NDF read patterns). + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_input_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOpaqueMetadataReadTestCase(unittest.TestCase): + def test_fits_opaque_metadata_round_trips(self) -> None: + import astropy.io.fits + + from lsst.images import Box, Image + from lsst.images.fits._common import ExtensionKey, FitsOpaqueMetadata + from lsst.images.zarr import read, write + + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + ) + header = astropy.io.fits.Header() + header["ORIGIN"] = "RUBIN" + header["EXPTIME"] = 30.0 + opaque = FitsOpaqueMetadata() + opaque.headers[ExtensionKey()] = header + image._opaque_metadata = opaque + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(image, target) + recovered = read(Image, target).deserialized + recovered_opaque = recovered._opaque_metadata + self.assertIsInstance(recovered_opaque, FitsOpaqueMetadata) + recovered_header = recovered_opaque.headers[ExtensionKey()] + self.assertEqual(recovered_header["ORIGIN"], "RUBIN") + self.assertEqual(recovered_header["EXPTIME"], 30.0) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `pytest tests/test_zarr_input_archive.py::ZarrOpaqueMetadataReadTestCase -v` +Expected: FAIL — `recovered._opaque_metadata` is `None` or unset. + +- [ ] **Step 3: Implement deserialization** + +Append to `python/lsst/images/zarr/_layout.py`: + +```python +def deserialize_fits_opaque_metadata(document: "ZarrDocument") -> Any | None: + """Return a `FitsOpaqueMetadata` reconstructed from the IR, or None. + + Returns ``None`` when the archive doesn't have a FITS opaque + metadata block (the common case for archives that originated as + native zarr). + """ + import json as _json + + from ..fits._common import ExtensionKey, FitsOpaqueMetadata + from ._model import ZarrArray + + if document.root.attributes.lsst.get("opaque_metadata_format") != "fits": + return None + try: + node = document.root.get("/lsst/opaque_metadata/fits/primary") + except KeyError: + return None + if not isinstance(node, ZarrArray): + return None + json_bytes = bytes(node.read()).decode("utf-8") + cards = _json.loads(json_bytes) + import astropy.io.fits + + header = astropy.io.fits.Header() + for key, value in cards.items(): + header[key] = value + opaque = FitsOpaqueMetadata() + opaque.headers[ExtensionKey()] = header + return opaque +``` + +In `python/lsst/images/zarr/_input_archive.py`, store opaque metadata at construction time and expose it; attach it in `read`: + +```python + def __init__(self, document: ZarrDocument) -> None: + self._document = document + self._validate_root_attributes() + self._deserialized_pointer_cache = {} + self._frame_set_cache = {} + from ._layout import deserialize_fits_opaque_metadata + + self._opaque_metadata = deserialize_fits_opaque_metadata(document) + + def get_opaque_metadata(self) -> Any | None: + return self._opaque_metadata +``` + +…and in `read`: + +```python +def read[T: Any](cls, path, **kwargs): + with ZarrInputArchive.open(path) as archive: + tree_type = cls._get_archive_tree_type(ZarrPointerModel) + tree = archive.get_tree(tree_type) + obj = tree.deserialize(archive, **kwargs) + if (opaque := archive.get_opaque_metadata()) is not None: + obj._opaque_metadata = opaque + return ReadResult(obj, tree.metadata, tree.butler_info) +``` + +- [ ] **Step 4: Run the tests** + +Run: `pytest tests/test_zarr_input_archive.py::ZarrOpaqueMetadataReadTestCase -v` +Expected: PASS — recovered header has ORIGIN and EXPTIME. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_input_archive.py python/lsst/images/zarr/_layout.py tests/test_zarr_input_archive.py +git commit -m "feat: restore FitsOpaqueMetadata on zarr read" +``` + +### Task 5.3: FITS → Zarr → FITS round-trip + +**Files:** +- Create: `tests/test_zarr_cross_format.py` + +This test exercises the full cross-format pipeline: read a FITS file, write it to zarr, read the zarr back, write it to FITS, and verify the final FITS file's primary header matches the original FITS header card-for-card. + +- [ ] **Step 1: Write the test** + +Create `tests/test_zarr_cross_format.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +import numpy as np + +try: + import zarr # noqa: F401 + + from lsst.images.zarr import read as zarr_read + from lsst.images.zarr import write as zarr_write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class FitsZarrCrossFormatTestCase(unittest.TestCase): + def test_fits_to_zarr_to_fits_preserves_primary_header(self) -> None: + import astropy.io.fits + + from lsst.images import Box, Image + from lsst.images.fits import read as fits_read + from lsst.images.fits import write as fits_write + + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + ) + with tempfile.TemporaryDirectory() as tmp: + fits_a = os.path.join(tmp, "a.fits") + zarr_path = os.path.join(tmp, "b.zarr") + fits_b = os.path.join(tmp, "c.fits") + + # Stamp a recognisable card on the FITS write. + def update_header(header): # noqa: ANN001 + header["ORIGIN"] = "RUBIN" + header["EXPTIME"] = 30.0 + + fits_write(original, fits_a, update_header=update_header) + from_fits = fits_read(Image, fits_a).deserialized + zarr_write(from_fits, zarr_path) + from_zarr = zarr_read(Image, zarr_path).deserialized + fits_write(from_zarr, fits_b) + + with astropy.io.fits.open(fits_b) as hdul: + self.assertEqual(hdul[0].header["ORIGIN"], "RUBIN") + self.assertEqual(hdul[0].header["EXPTIME"], 30.0) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test** + +Run: `pytest tests/test_zarr_cross_format.py -v` +Expected: PASS — both cards survive the FITS→Zarr→FITS pipeline. + +- [ ] **Step 3: Commit** + +```bash +git add tests/test_zarr_cross_format.py +git commit -m "test: FITS↔Zarr opaque-metadata round-trip" +``` + +### Task 5.4: Optional `ome-zarr-py` external-reader sanity test + +**Files:** +- Create: `tests/test_zarr_external_reader.py` + +This test confirms the bytes we emit are readable by `ome-zarr-py` (the upstream OME-Zarr toolkit). It checks only the science array — `ome-zarr-py` doesn't know about `lsst:` extensions. Skipped when the package isn't installed. + +- [ ] **Step 1: Write the test** + +Create `tests/test_zarr_external_reader.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +import numpy as np + +try: + import zarr # noqa: F401 + + from lsst.images.zarr import write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + +try: + import ome_zarr # noqa: F401 + import ome_zarr.io # noqa: F401 + import ome_zarr.reader # noqa: F401 + + HAVE_OME_ZARR = True +except ImportError: + HAVE_OME_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR and HAVE_OME_ZARR, "ome-zarr is not installed") +class OmeZarrReaderTestCase(unittest.TestCase): + def test_ome_zarr_can_open_image(self) -> None: + from lsst.images import Box, Image + + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + ) + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(original, target) + # ome-zarr opens the store and exposes the multiscales node. + from ome_zarr.io import parse_url + from ome_zarr.reader import Reader + + location = parse_url(target) + self.assertIsNotNone(location) + reader = Reader(location) + nodes = list(reader()) + self.assertGreaterEqual(len(nodes), 1) + # The first node should expose the (y, x) science array. + data = nodes[0].data[0] # level 0 + self.assertEqual(tuple(data.shape), (4, 5)) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test** + +Run: `pytest tests/test_zarr_external_reader.py -v` +Expected: PASS if `ome-zarr` is installed; SKIP otherwise. + +- [ ] **Step 3: Commit** + +```bash +git add tests/test_zarr_external_reader.py +git commit -m "test: ome-zarr-py can open archives written by lsst.images.zarr" +``` + +### Task 5.5: Optional `ngff-validator` compliance test + +**Files:** +- Create: `tests/test_zarr_ome_compliance.py` + +`ngff-validator` (or its successor) checks an archive against the OME-NGFF schema. We invoke it via subprocess if available; the test is skipped silently otherwise. Validation runs against representative outputs of every supported archive class. + +- [ ] **Step 1: Write the test** + +Create `tests/test_zarr_ome_compliance.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import shutil +import subprocess +import tempfile +import unittest + +import numpy as np + +try: + import zarr # noqa: F401 + + from lsst.images.zarr import write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + +NGFF_VALIDATOR = shutil.which("ngff-validator") + + +@unittest.skipUnless(HAVE_ZARR and NGFF_VALIDATOR, "ngff-validator is not on PATH") +class NgffComplianceTestCase(unittest.TestCase): + def _validate(self, target: str) -> None: + result = subprocess.run( + [NGFF_VALIDATOR, target], + capture_output=True, + text=True, + check=False, + ) + self.assertEqual( + result.returncode, + 0, + f"ngff-validator failed for {target}:\nSTDOUT:\n{result.stdout}\nSTDERR:\n{result.stderr}", + ) + + def test_image_validates(self) -> None: + from lsst.images import Box, Image + + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + ) + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(image, target) + self._validate(target) + + def test_color_image_validates(self) -> None: + from lsst.images import Box, ColorImage, Image + + red = Image(np.full((4, 5), 1, dtype=np.uint8), bbox=Box.factory[10:14, 20:25]) + green = Image(np.full((4, 5), 2, dtype=np.uint8), bbox=red.bbox) + blue = Image(np.full((4, 5), 3, dtype=np.uint8), bbox=red.bbox) + color = ColorImage(red=red, green=green, blue=blue) + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "color.zarr") + write(color, target) + self._validate(target) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test** + +Run: `pytest tests/test_zarr_ome_compliance.py -v` +Expected: PASS if `ngff-validator` is on PATH; SKIP otherwise. If a real-world install is available and the test fails, fix the layout (e.g. correct an axis-type misclassification) before merging. + +- [ ] **Step 3: Commit** + +```bash +git add tests/test_zarr_ome_compliance.py +git commit -m "test: ngff-validator compliance check (skipped when validator absent)" +``` + +--- + +**End of Phase 5.** Five tasks. Cross-format round-trips preserve FITS primary-header cards through Zarr; optional external-reader tests confirm spec compliance and tooling interoperability when their dependencies are installed. Phase 6 wraps up with module documentation and a changelog entry. + +## Phase 6 — Documentation, changelog, and final integration + +Phase 6 wraps up the backend with the documentation and metadata that make it discoverable to package users. The reference docs live under `doc/lsst.images/` and follow the same `automodapi`-driven pattern as the other backends; the changelog uses `towncrier` fragments under `doc/changes/`. + +### Task 6.1: Expand the module docstring + +**Files:** +- Modify: `python/lsst/images/zarr/__init__.py` + +The Phase 1 `__init__.py` carries a placeholder docstring. Replace it with a fully-documented version that covers: + +- which image types are supported +- the on-disk layout summary (top-level `/0`, `/lsst/mask/0`, `/lsst/variance/0`, `/lsst/tree`, the OME `multiscales` block) +- chunking and sharding defaults +- store dispatch (LocalStore vs ZipStore vs FsspecStore by URI) +- the lazy-subset guarantee for remote reads +- known follow-ups (multiscale pyramid, NGFF nonlinear transformations, dask/lazy read API, consolidated metadata) + +- [ ] **Step 1: Replace the docstring** + +Edit `python/lsst/images/zarr/__init__.py` and replace the existing docstring (everything between the first `"""` and the matching `"""`): + +```python +"""OME-Zarr v0.5 archive backend for `lsst.images`. + +This module reads and writes OME-Zarr v0.5 NGFF files augmented with +namespaced ``lsst:`` extensions. Every image type that already serializes +to FITS / JSON / NDF is supported: `~lsst.images.Image`, +`~lsst.images.Mask`, `~lsst.images.MaskedImage`, +`~lsst.images.VisitImage`, `~lsst.images.ColorImage`, and +`lsst.images.cells.CellCoadd`, plus any object reachable through the +`~lsst.images.serialization.OutputArchive` interface. + +On-disk layout +-------------- + +A zarr archive written by `~lsst.images.zarr.write` always has: + +- A top-level data array at ``/0`` whose OME axes (``[y, x]`` for plain + images, ``[c, y, x]`` for `ColorImage`) appear in the + ``ome.multiscales`` block of the root group's attributes. +- A ``lsst.archive_class`` attribute on the root group naming the + in-memory class so the matching deserializer can be dispatched. +- A 1-D ``uint8`` array at ``/lsst/tree`` containing the JSON + serialization of the object's archive tree. +- Heterogeneous companion sub-images at ``/lsst/mask/0``, + ``/lsst/variance/0``, etc., each of which is itself a valid OME-Zarr + group with its own multiscale metadata. +- For `ColorImage`, the three channels are stacked into a single + ``(3, Y, X)`` array at ``/0`` and an ``ome.omero.channels`` block + describes them. Per-channel references in the JSON tree use the + ``zarr:/0?c=N`` query suffix. +- For `CellCoadd`, per-cell PSFs are stacked into a single 4-D + ``(Cy, Cx, Py, Px)`` array at ``/lsst/psf/per_cell/0``; per-cell + references use the ``zarr:/lsst/psf/per_cell/0?cell=Cy,Cx`` suffix. + +External tools that consume OME-Zarr (`napari`, `neuroglancer`, +`ome-zarr-py`, `ngff-validator`) can render the science array without +LSST-specific awareness. + +Cloud-friendly defaults +----------------------- + +- Default chunk geometry is tile-aligned: ``min(1024, dim)`` per axis + for plain images, ``cell_shape`` for `CellCoadd`. +- Sharding (zarr v3 native) is enabled by default with a tunable shard + size (4×4 chunks by default) so object counts on S3 / GCS stay small + even for multi-gigabyte images. +- Subset reads via the ``slices=`` argument to + `~lsst.images.serialization.InputArchive.get_array` exploit zarr's + chunk index: only the chunks intersecting the requested slice are + fetched, even from remote stores. +- Both ``DirectoryStore`` and ``ZipStore`` are supported. The store is + selected from the URI shape: ``*.zarr.zip`` → ZipStore, otherwise + directory. Remote URIs (``s3://``, ``gs://``, ``http(s)://``) go + through `lsst.resources.ResourcePath` and `fsspec`. + +Round-trip with FITS +-------------------- + +When an object that originated from a FITS read (``_opaque_metadata`` +is a `~lsst.images.fits.FitsOpaqueMetadata`) is written to zarr, the +primary-HDU header is preserved at ``/lsst/opaque_metadata/fits/primary``. +Reading the zarr back attaches an equivalent ``FitsOpaqueMetadata`` to +the deserialized object so a subsequent write to FITS reproduces the +original header cards. + +Optional install +---------------- + +This backend requires `zarr >= 3.0`. Install via the ``[zarr]`` extra:: + + pip install lsst-images[zarr] + +The top-level ``import lsst.images.zarr`` raises a clear `ImportError` +with this guidance if `zarr` is not installed. + +Follow-ups +---------- + +These items are tracked separately from the initial backend release: + +- Lazy / dask-friendly read API (``read_lazy()``). +- Multiscale pyramid generation (level 1, 2, …) for visualization tools. +- NGFF nonlinear coordinate transformations once tooling adoption catches + up (currently the affine projection of the AST WCS is exposed via OME + ``coordinateTransformations``; the AST string remains the + authoritative round-trip source). +- ``zarr.consolidated_metadata`` extension to reduce object-list calls + on cloud stores. +""" +``` + +- [ ] **Step 2: Verify the docstring is well-formed** + +Run: `python -c "import lsst.images.zarr; help(lsst.images.zarr)" | head -40` +Expected: docstring renders cleanly, no Sphinx warnings (a deeper Sphinx build will run in Task 6.3). + +- [ ] **Step 3: Commit** + +```bash +git add python/lsst/images/zarr/__init__.py +git commit -m "docs: expand lsst.images.zarr module docstring" +``` + +### Task 6.2: Add the reference docs page + +**Files:** +- Create: `doc/lsst.images/zarr.rst` +- Modify: `doc/lsst.images/index.rst` (add `zarr.rst` to the toctree) + +Mirrors `doc/lsst.images/ndf.rst` exactly so Sphinx renders the API in the same shape as the other backends. + +- [ ] **Step 1: Create the reference page** + +Create `doc/lsst.images/zarr.rst`: + +```rst +OME-Zarr I/O +============ + +This is an OME-Zarr v0.5 NGFF serialization backend with namespaced +``lsst:`` extensions. Files written by this archive are valid OME-Zarr +images at every level where image-shaped data lives, and can be read by +external OME-Zarr tooling (napari, neuroglancer, ome-zarr-py) for the +science array. The LSST extensions add round-trip support for mask +plane semantics, AST WCS, the LSST archive tree, table layout, and +cell-grid hints. + +Default chunking is tile-aligned (~1024×1024 for plain images, +``cell_shape`` for ``CellCoadd``), sharding is enabled by default, and +subset reads via ``slices=`` only fetch the chunks they need — including +on remote stores accessed through ``lsst.resources.ResourcePath`` and +``fsspec``. + +.. automodapi:: lsst.images.zarr + :no-inheritance-diagram: + :include-all-objects: + :inherited-members: +``` + +- [ ] **Step 2: Add the page to the toctree** + +Edit `doc/lsst.images/index.rst` and insert `zarr.rst` immediately after `ndf.rst` (preserve the existing alphabetical ordering near that block; `zarr` lands at the end since it sorts last): + +```rst + fits.rst + json.rst + ndf.rst + zarr.rst +``` + +- [ ] **Step 3: Verify the docs build** + +Run: `cd doc && sphinx-build -W -b html . _build/html` (only if a Sphinx environment is set up locally; otherwise skip and rely on CI). +Expected: clean build, no warnings about undefined references. + +- [ ] **Step 4: Commit** + +```bash +git add doc/lsst.images/zarr.rst doc/lsst.images/index.rst +git commit -m "docs: add OME-Zarr backend reference page" +``` + +### Task 6.3: Add the towncrier changelog fragment + +**Files:** +- Create: `doc/changes/DM-NNNNN.feature.md` (use the actual ticket number; the placeholder below is the design ticket for this feature) + +The package uses towncrier — each user-visible change lands as a single Markdown fragment under `doc/changes/`. Pick the **bugfix / feature / api / removal / perf / misc** category that fits; for this work it's **feature**. + +- [ ] **Step 1: Create the fragment** + +Create `doc/changes/DM-XXXXX.feature.md` (replace `XXXXX` with the assigned Jira ticket number — confirm with the engineer before merging): + +```markdown +Added a new `lsst.images.zarr` archive backend that reads and writes OME-Zarr v0.5 NGFF files with namespaced `lsst:` extensions. Supports every image type the FITS / JSON / NDF backends support (`Image`, `Mask`, `MaskedImage`, `VisitImage`, `ColorImage`, `CellCoadd`). Cloud-friendly defaults (tile-aligned chunks, zarr v3 sharding, fsspec-backed remote stores) and subset reads that only fetch the chunks they need. Install via the new `[zarr]` extra (`pip install lsst-images[zarr]`). +``` + +- [ ] **Step 2: Commit** + +```bash +git add doc/changes/DM-XXXXX.feature.md +git commit -m "docs: changelog entry for lsst.images.zarr backend" +``` + +### Task 6.4: Run the full test suite and finalize + +**Files:** +- (No code changes — verification step.) + +- [ ] **Step 1: Run the full zarr test set** + +Run: `pytest tests/test_zarr_*.py -v` +Expected: all tests pass; external-reader and validator tests pass or skip cleanly depending on what's installed. + +- [ ] **Step 2: Run the full package test suite to catch regressions** + +Run: `pytest tests/ -v` +Expected: all existing tests still pass; the new `RoundtripZarr` helper does not break unrelated test files. + +- [ ] **Step 3: Type-check the new module** + +Run: `mypy python/lsst/images/zarr` +Expected: no errors. Address any warnings before merging. + +- [ ] **Step 4: Lint and format** + +Run: `ruff check python/lsst/images/zarr tests/test_zarr_*.py && ruff format --check python/lsst/images/zarr tests/test_zarr_*.py` +Expected: no findings. + +- [ ] **Step 5: Final commit (if any cleanups were needed)** + +```bash +git status # should be clean +``` + +If lint/mypy required fixes, commit them with a focused message such as `chore: type-check and lint cleanup for lsst.images.zarr`. + +--- + +**End of Phase 6.** Documentation, changelog, and final verification complete. The backend is ready for review and merge. + +--- + +## Self-Review Notes + +**Spec coverage** — every section of `docs/superpowers/specs/2026-05-22-zarr-io-design.md` maps to at least one task: + +| Spec section | Task(s) | +|---|---| +| §1 Goals / scope | All phases collectively | +| §2 Module layout | 1.1 (skeleton), 2.1 (`_store`), 2.2 (`_layout`), 1.3-1.5 (`_model`), 2.3-2.5 (`_output_archive`), 3.1-3.5 (`_input_archive`) | +| §3 Top-level group attributes | 2.5 (`add_tree`), 4.2 (ColorImage `omero`), 4.5 (CellCoadd cell-grid hooks via fixup), 5.1 (`opaque_metadata_format`) | +| §3 Axis choice per archive class | 2.2 (`axes_for_archive_class`), 4.2 (ColorImage), 4.4-4.5 (CellCoadd) | +| §3 Mask / table / frame-set layout | 2.4 (output side), 3.4 (input side) | +| §3 Recursive composition | 2.3 (`serialize_pointer` writes nested OME groups), 3.3 (read-side cache) | +| §3 Chunking / sharding defaults | 2.2 (`chunks_for`), 4.4 (CellCoadd cell alignment) | +| §4 Round-trip rules / opaque metadata | 5.1, 5.2, 5.3 | +| §4 Error taxonomy | 3.1 (`_validate_root_attributes`), 3.2 (`get_array` source-format errors) | +| §4 Mode and atomicity | 2.1 (create-only enforcement) | +| §4 Chunk-aligned subset reads | 3.2 (`_CountingStore` regression test) | +| §4 Forward compatibility | 3.1 (version refusal), 1.3 (unknown-key preservation) | +| §5 Test strategy | One test file per module, plus round-trip and external-reader sets | +| §5 Rollout plan (6 numbered steps) | Phases 1-6 directly mirror the spec's rollout | +| §6 Follow-ups | Documented in 6.1's docstring | + +**Type / name consistency** — IR types and key methods use consistent names across tasks: + +- `ZarrDocument`, `ZarrGroup`, `ZarrArray`, `ZarrAttributes` (introduced 1.3-1.4, used everywhere after). +- `ZarrCompressionOptions` field names (`codec`, `cname`, `clevel`, `shuffle`) match between Tasks 1.2 and 1.4. +- `ZarrPointerModel.path` (Task 1.2) is read by `deserialize_pointer` (3.3) and written by `serialize_pointer` (2.3); both use absolute zarr paths starting with `/`. +- The sliced-source convention (`zarr:/?c=N`, `?cell=Cy,Cx`) is defined in Task 4.1 and consumed by ColorImage (4.2) and CellCoadd PSF (4.5) layout fixups. + +**Open implementer judgement calls** — the plan flags two places where the implementer needs to consult local code rather than follow a literal recipe: + +- Task 4.4 / 4.6: minimal `CellCoadd` constructor — the test helpers are placeholders that should be replaced by real construction once the implementer reads `python/lsst/images/cells/_coadd.py`. +- Task 6.3: the towncrier fragment filename uses `DM-XXXXX` — pick the real ticket number when committing. + +These are intentional handoffs, not placeholder content in the production code itself. + + + + + From 83b9064cf6243edc59dd1da689c5d62af34115cf Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 17:31:26 -0700 Subject: [PATCH 03/60] Revise zarr I/O design: xarray/CF root layout, OME-NGFF as overlay Pivots the on-disk layout from the v1 OME-multiscale-image-with-lsst- companions structure to an xarray/CF-shaped root with image, variance, mask as siblings; OME multiscales metadata points at the same image array (no byte duplication). Mask becomes a 2-D packed-integer array with CF flag_masks/flag_meanings/flag_descriptions for native geospatial-tool interop. ColorImage stops stacking and writes its channels as recursive sub-archives. WCS is affine-only OME plus authoritative AST string in v1; RFC-5 nonlinear transformations are a follow-up blocked on writing an AST JSON channel. Adds an 11x11-grid residual validator that drops the simplified affine when the max pixel-equivalent error exceeds 1 pixel. Generated with AI Co-Authored-By: SLAC AI --- .../specs/2026-05-22-zarr-io-design.md | 709 ++++++++++++------ 1 file changed, 462 insertions(+), 247 deletions(-) diff --git a/docs/superpowers/specs/2026-05-22-zarr-io-design.md b/docs/superpowers/specs/2026-05-22-zarr-io-design.md index 47c110a7..4aa66f65 100644 --- a/docs/superpowers/specs/2026-05-22-zarr-io-design.md +++ b/docs/superpowers/specs/2026-05-22-zarr-io-design.md @@ -1,7 +1,7 @@ -# Zarr I/O Backend for `lsst.images` — Design +# Zarr I/O Backend for `lsst.images` — Design (revised) -**Status:** Approved (design phase). Ready for implementation planning. -**Date:** 2026-05-22 +**Status:** Approved (design phase). Supersedes the v1 design at commit `a11db46` after collaborator review. +**Date:** 2026-05-22 (revised) **Author:** Tim Jenness (with Claude collaborator) ## 1. Goals, Scope, Non-Goals @@ -12,63 +12,88 @@ Add a `lsst.images.zarr` subpackage providing: - A `ZarrOutputArchive` and `ZarrInputArchive` implementing the existing `lsst.images.serialization` `OutputArchive` / `InputArchive` ABCs. -- Top-level `read()` and `write()` helpers consistent with the FITS, JSON, - and NDF backends. -- A Python intermediate representation (IR) — `ZarrDocument`, `ZarrGroup`, - `ZarrArray`, etc. — that describes the on-disk layout independently of - `zarr-python`, mirroring the role `NdfDocument` plays for the NDF backend. - -Because the backend builds on the abstract archive interface, every image -type that already serializes to FITS/JSON/NDF (`Image`, `Mask`, +- Top-level `read()` and `write()` helpers consistent with the FITS, + JSON, and NDF backends. +- A Python intermediate representation (IR) — `ZarrDocument`, + `ZarrGroup`, `ZarrArray`, etc. — that describes the on-disk layout + independently of `zarr-python`, mirroring the role `NdfDocument` + plays for the NDF backend. + +Because the backend builds on the abstract archive interface, every +image type that already serializes to FITS/JSON/NDF (`Image`, `Mask`, `MaskedImage`, `VisitImage`, `ColorImage`, `CellCoadd`, plus any -`serialize()`-implementing object reachable through the archive) works with -no per-type code in the backend itself. Per-type adjustments are limited to -layout decisions (e.g. ColorImage's channel axis) made in `_layout.py` -against the populated IR. - -### Standards alignment - -Files written are valid OME-Zarr v0.5 NGFF images at every level where -image-shaped data lives, augmented with namespaced `lsst:` extensions where -no relevant standard exists (mask plane semantics, AST WCS round-trip, the -LSST archive tree, table layout, cell-grid hints). - -External tools that consume OME-Zarr (`napari`, `neuroglancer`, -`ngff-validator`, `ome-zarr-py`, etc.) can render the science arrays -without LSST-specific awareness. Recursive composition: any sub-archive -holding image-shaped data — including PSF model parameter images — is -itself a valid OME-Zarr group at its zarr path, with its own -`ome.multiscales` and `lsst.archive_class` attributes. +`serialize()`-implementing object reachable through the archive) works +with no per-type code in the backend itself. + +### Standards alignment (changed from v1) + +The on-disk layout is **xarray-/CF-shaped at the root** with +**OME-NGFF v0.5 metadata as a discoverability layer on top**. The root +group is a sibling collection of arrays (`image/`, `variance/`, +`mask/`) so: + +- `xr.open_zarr(path)` returns a `Dataset` with the masked-image + components as data variables sharing the `(y, x)` dimensions. +- Geospatial / CF tooling (rasterio, GDAL's Zarr driver, QGIS) reads + the `mask` array's `flag_masks` / `flag_meanings` / + `flag_descriptions` attributes directly. +- OME-NGFF tooling (`napari`, `neuroglancer`, `ngff-validator`, + `ome-zarr-py`) sees an OME multiscales block whose + `dataset.path` points at the same `image` array — the OME view and + the xarray view share bytes. + +The pivot vs the v1 design: the root is no longer a multiscale image +with `lsst:` companions hanging off it; companion arrays are +first-class siblings, and OME's `multiscales.datasets[].path` references +them. This enables xarray / GDAL interop with no byte duplication. ### Cloud-first, local works too - Default chunk geometry is tile-aligned (~1024×1024 for plain images, `cell_shape` for `CellCoadd`). -- Sharding (zarr v3 native) is enabled by default with a tunable shard size - to keep object counts manageable on S3/GCS. -- Subset reads via `slices=` exploit zarr's chunk index. -- Both `DirectoryStore` and `ZipStore` are supported; the choice is driven - by URI shape (`*.zarr.zip` → ZipStore, otherwise directory). Remote URIs - go through `lsst.resources.ResourcePath` and `fsspec`. +- Sharding (zarr v3 native) is enabled by default with a tunable shard + size to keep object counts manageable on S3/GCS. +- Subset reads via `slices=` exploit zarr's chunk index, going + straight to the lazy `zarr.Array` handle so only the touched chunks + are fetched. +- Both `DirectoryStore` and `ZipStore` are supported; the choice is + driven by URI shape (`*.zarr.zip` → `ZipStore`, otherwise directory). + Remote URIs go through `lsst.resources.ResourcePath` and `fsspec`. ### Scope -Same image-type coverage as the FITS backend: `Image`, `Mask` (2D and 3D), -`MaskedImage`, `VisitImage`, `ColorImage`, `CellCoadd`, plus any -`serialize()`-implementing object reachable through the archive interface. +Same image-type coverage as the FITS backend: `Image`, `Mask` (2-D in +v1), `MaskedImage`, `VisitImage`, `ColorImage`, `CellCoadd`, plus any +`serialize()`-implementing object reachable through the archive +interface. -ColorImage uses the OME `c` axis. CellCoadd's per-cell PSF is stored as a -single 4D OME-Zarr image with cell-aligned chunks. +`ColorImage` writes its three channels as **sibling sub-archives** +(`red/`, `green/`, `blue/`), not as a stacked `(3, Y, X)` array — see +[§3](#3-on-disk-layout). The previous design's stacking + JSON-pointer +rewrite is removed because it duplicates bytes for large images. + +`CellCoadd`'s per-cell PSF is whatever shape `CellCoadd.serialize` +natively emits — typically a 4-D `(Cy, Cx, Py, Px)` array — with +cell-aligned chunks. No fixup pass. ### Non-Goals (initial release) -- No dask / lazy `read_lazy()` API — added later, tracked as follow-up. -- No multi-level OME multiscale pyramid (we only ever write one level at - `0/`). -- No NGFF nonlinear coordinate transformations (currently underspecified - and lacking widespread tooling support). Tracked as follow-up — see - Section 6. -- No automatic OME consolidated-metadata extension. Tracked as follow-up. +- No dask / lazy `read_lazy()` API — added later, tracked as + follow-up. +- No multi-level OME multiscale pyramid (we only ever write one level + pointed at by `path: image`). +- **No NGFF RFC-5 nonlinear coordinate transformations as + authoritative.** v1 emits an OME-NGFF v0.5 affine + `coordinateTransformations` as an external-tool affordance, with + the AST `FrameSet` string as the authoritative round-trip source. + RFC-5 transformations as authoritative is a follow-up — see + [§6](#6-follow-up-work-out-of-scope) — **blocked on writing an AST + JSON channel** that serializes a `FrameSet` to / from RFC-5 + transformation JSON. +- No 3-D mask layout for masks with more than 64 planes — v1 raises + on write. 3-D fallback tracked as follow-up. +- No automatic OME `consolidated_metadata` extension. Tracked as + follow-up. ### Dependency @@ -84,37 +109,38 @@ python/lsst/images/zarr/ ├── __init__.py guarded `import zarr`; re-exports ├── _common.py ZarrPointerModel (analog of NdfPointerModel), │ attribute namespace constants ("lsst:", "ome:"), -│ ZarrCompressionOptions dataclass (codec, level), +│ ZarrCompressionOptions dataclass, │ path/JSON-pointer helpers ├── _model.py Python intermediate representation: │ ZarrDocument, ZarrGroup, ZarrArray, ZarrAttributes, -│ OmeMultiscale, OmeOmero, LsstTableGroup, LsstMaskGroup, -│ from_zarr() / to_zarr() materialization methods -├── _layout.py Layout rules: where things go relative to a root -│ group, JSON-pointer ↔ zarr-path translation, -│ OME axis selection per archive class -│ (ColorImage → c,y,x; default → y,x; …) +│ OmeMultiscale, OmeOmero, from_zarr() / to_zarr() +│ materialization methods +├── _layout.py Layout rules: archive-class → axes mapping; +│ CF flag-attrs construction for mask groups; +│ affine extraction + residual validator; +│ OME multiscale block construction ├── _output_archive.py ZarrOutputArchive and write() ├── _input_archive.py ZarrInputArchive and read() -└── _store.py Wrapper that turns a ResourcePath / fsspec URI into - the right zarr.storage.Store - (DirectoryStore / ZipStore / FsspecStore) +└── _store.py Wrapper that turns a ResourcePath / fsspec URI + into the right zarr.storage.Store + (LocalStore / ZipStore / FsspecStore) ``` ### Fit with existing abstractions -- `ZarrOutputArchive[ZarrPointerModel]` implements the abstract methods - (`serialize_direct`, `serialize_pointer`, `serialize_frame_set`, - `add_array`, `add_table`, `add_structured_array`, `iter_frame_sets`). +- `ZarrOutputArchive[ZarrPointerModel]` implements the abstract + methods (`serialize_direct`, `serialize_pointer`, + `serialize_frame_set`, `add_array`, `add_table`, + `add_structured_array`, `iter_frame_sets`). - `ZarrPointerModel` is a small Pydantic model holding a zarr path - (e.g. `"/lsst/mask"`); when a model field carries a `ZarrPointerModel`, - the consumer dereferences it through the input archive — same pattern - as `NdfPointerModel`. -- `update_header` callbacks (intended for FITS) are accepted and ignored, - identical to the JSON backend. -- The `serialization.ArchiveTree` JSON tree is stored verbatim as a UTF-8 - zarr array at `/lsst/tree`. Array references in the tree resolve to - zarr paths under the same root. + (e.g. `"/lsst/psf/tree"`); when a model field carries a + `ZarrPointerModel`, the consumer dereferences it through the input + archive — same pattern as `NdfPointerModel`. +- `update_header` callbacks (intended for FITS) are accepted and + ignored, identical to the JSON backend. +- The `serialization.ArchiveTree` JSON tree is stored verbatim as a + UTF-8 zarr array at `tree` (root-level). Array references in the + tree resolve to zarr paths under the same root. ### Two-pass write driven by the IR @@ -124,20 +150,43 @@ materialize to zarr-python via the configured store. Benefits: -- Per-class layout decisions are made once in `_layout.py` against the - populated IR rather than scattered across `add_array` calls. -- Tests can assert on the IR without writing files (mirrors the NDF - pattern). -- A future "validate-then-commit" step (e.g. ngff-validator integration) - can run against the IR. +- Per-class layout decisions (CF flag attrs on mask, OME multiscale + block, cell-grid metadata) are made once in `_layout.py` against + the populated IR. +- Tests can assert on the IR without writing files. +- A future "validate-then-commit" step (e.g. `ngff-validator` + integration) can run against the IR. + +Compared to the v1 design, the IR's *write* side has **no fixup +pass** that rewrites or stacks staged arrays. Each `add_array(name)` +call lands at the zarr path equal to `name` (after stripping the +leading `/`). `name="image"` → `/image`; `name="mask"` → `/mask`; +the nested `name="red/image"` produced by +`serialize_direct("red", red.serialize)` → `/red/image`. There is +no special-case dictionary mapping JSON pointers to zarr paths. + +### Lazy read invariant (unchanged from v1) + +`ZarrArray.data` holds either a staged `numpy.ndarray` (write side) +or a lazy `zarr.Array` handle (read side). `from_zarr` never reads +chunk bytes; only `ZarrArray.read(slices=...)` does, and it forwards +`slices` straight to the lazy handle so only chunks intersecting +the slice are fetched. A `_CountingStore`-based regression test +asserts a single-chunk subset of a 16×16 / chunks=(4,4) array +touches strictly fewer chunk reads than a full read. ### Read mirrors write -`ZarrInputArchive.open()` opens the store, builds a `ZarrDocument` view -backed by lazy zarr-python objects, validates the `lsst:archive_class` -attribute, locates `/lsst/tree`, and parses it into the appropriate -`ArchiveTree` Pydantic model. `get_array(model, slices=...)` translates -the model's path into a chunk-aligned zarr read. +`ZarrInputArchive.open()` opens the store, builds a `ZarrDocument` +view backed by lazy zarr-python objects, validates the +`lsst.archive_class` and `lsst.version` root attributes, locates the +`tree` JSON document, and parses it into the appropriate +`ArchiveTree` Pydantic model. `get_array(model, slices=...)` +translates the model's path into a chunk-aligned zarr read. + +`ArrayReferenceModel.source` strings are plain `zarr:/`. The +v1 design's `?c=N` and `?cell=Cy,Cx` query suffixes are removed — +no stacking means no compound source URLs. ### Backend write helper signature @@ -154,159 +203,287 @@ def write( ) -> ArchiveTree: ... ``` -`chunks`, `shards`, and `compression` are per-array dicts keyed by the -JSON pointer of the attribute the array backs (or its zarr path), -mirroring the existing `compression_options` pattern from the FITS -backend. Different arrays have different ranks (2D image, 3D mask, 4D -per-cell PSF) so a single tuple value would not be meaningful. Missing -keys fall back to the per-class defaults from -[Section 3 — Chunking and sharding defaults](#chunking-and-sharding-defaults). -A value of `None` for a key means "use the default for this array"; -explicitly setting `shards` to `{}` (empty mapping) does *not* disable -sharding — to disable, pass `{"": None}` per array. (A future -follow-up may add a `shards=False` shorthand if it proves useful.) +`chunks`, `shards`, and `compression` are per-array dicts keyed by +the JSON pointer of the attribute the array backs (or its zarr +path), mirroring the existing `compression_options` pattern from +the FITS backend. Different arrays have different ranks (2-D image, +2-D mask, 4-D per-cell PSF) so a single tuple value would not be +meaningful. Missing keys fall back to the per-class defaults from +[§3](#chunking-and-sharding-defaults). A value of `None` for a key +means "use the default for this array"; explicitly setting `shards` +to `{}` does *not* disable sharding — to disable, pass +`{"": None}` per array. + +`image`, `variance`, and `mask` are expected to share the spatial +chunk shape (CF / xarray / GDAL all assume aligned chunks). The +output archive derives `variance` and `mask` chunks from `image`'s +chunk shape when the user has not overridden them. ## 3. On-Disk Layout (the spec) +### Root layout per archive class + +Every archive class lays out its data as **siblings under the root**. +Non-array metadata (the JSON round-trip tree, the AST WCS string) +also lives at the root so xarray and ome-zarr both see a clean +group. + +For a `MaskedImage` / `VisitImage`: + +``` +visitimage.zarr/ +├── zarr.json ← group attrs (see below) +├── image/ ← (Y, X) zarr array, science pixels +├── variance/ ← (Y, X) zarr array +├── mask/ ← (Y, X) zarr array, packed mask integers +├── tree ← 1-D uint8 array, pydantic JSON round-trip +└── wcs_ast ← 1-D uint8 array, AST FrameSet text +``` + +For an `Image` with no projection, `wcs_ast` is omitted; for an +`Image` with no mask/variance, those siblings are simply absent. + +For `ColorImage`: + +``` +colorimage.zarr/ +├── zarr.json ← lsst.archive_class = "ColorImage"; no OME multiscales +├── red/ ← itself a valid Image-shaped sub-archive +├── green/ ← (with its own image/, multiscales, etc.) +├── blue/ +├── tree +└── wcs_ast +``` + +Each channel sub-archive is a valid `Image` archive in its own right +(its own `image/` array, its own `lsst.archive_class = "Image"`, its +own OME multiscales). The root group's `lsst.archive_class` is +`"ColorImage"` and it has **no OME multiscales of its own** — there +is no stacked multi-channel array, so there is nothing for OME to +render at root level. External tools reading the root see three +nested OME images, which is consistent with the recursive-composition +rule. (A future follow-up may add a stacked single-array view; v1 +does not because of the no-byte-duplication rule.) + +For `CellCoadd`: + +``` +cellcoadd.zarr/ +├── zarr.json ← lsst.archive_class + lsst.cell_grid +├── image/ ← (Y, X), chunks aligned to cell_shape +├── variance/ +├── mask/ +├── psf/ ← (Cy, Cx, Py, Px) 4-D, chunks (1, 1, Py, Px) +├── tree +└── wcs_ast +``` + +`psf` is whatever shape `CellCoadd.serialize` natively emits — there +is no stacking fixup. Cell-grid metadata lives in the +`lsst.cell_grid` block of the root group's attributes. + ### Top-level group attributes (`zarr.json` `attributes`) ```jsonc { - // OME-Zarr v0.5 multiscales — populated whenever there is a top-level - // data array. + "data_model": "org.lsst.masked_image", // or .image / .visit_image / etc. + "version": 1, // org.lsst.* schema version + "ome": { "version": "0.5", "multiscales": [{ "name": "", "axes": [/* see per-class table below */], "datasets": [{ - "path": "0", - "coordinateTransformations": [/* affine projection of WCS, if any */] + "path": "image", + "coordinateTransformations": [/* affine; see §4 below */] }] }], - // Only present when channel axis is used (e.g. ColorImage). + // Only present on archive classes whose top-level array has a + // channel axis. Not used in v1 (no stacked ColorImage view). "omero": { "channels": [...] } }, - // LSST extensions (always present). "lsst": { - "version": 1, // schema version of LSST extension - "archive_class": "VisitImage", // dispatch for read-side construction - "tree": "/lsst/tree", // zarr path to JSON tree - "frame_set": "/lsst/frame_set", // optional, AST string array - "companions": { // heterogeneous companion sub-images - "mask": "/lsst/mask", - "variance": "/lsst/variance" - }, - "opaque_metadata_format": "fits", // optional, only when opaque metadata is present (e.g. round-tripping from a FITS read) - "cell_grid": { "bbox": ..., "cell_shape": [256, 256] } // CellCoadd only + "version": 1, // schema version of lsst extension + "archive_class": "VisitImage", // dispatch for read-side construction + "tree": "tree", // zarr path to JSON tree (relative) + "wcs_ast": "wcs_ast", // zarr path to AST string, optional + "wcs_simplified_dropped": false, // see §4 below + "wcs_simplified_max_residual_pixels": 0.13, // observed max; only when affine emitted + "opaque_metadata_format": "fits", // optional, only when present + "cell_grid": { "bbox": ..., "cell_shape": [256, 256] } // CellCoadd only } } ``` +For `ColorImage`, the root group has `lsst.archive_class = "ColorImage"` +and no `ome.multiscales`. + ### Axis choice per archive class -| Archive class | Axes | Top-level array shape | Notes | +| Archive class | Axes (root multiscale) | Top-level science array | Notes | |---|---|---|---| -| `Image`, `Mask` (2D), `MaskedImage`, `VisitImage`, `CellCoadd` | `[y, x]` | `(Y, X)` | Standard 2D image | -| `ColorImage` | `[c, y, x]` | `(3, Y, X)` | Transposed from in-memory `(Y, X, 3)`; `ome:omero/channels=[R,G,B]` | -| `Mask` (3D) | `[plane, y, x]` | `(P, Y, X)` | When written standalone | -| `CellPointSpreadFunction` (per-cell PSF) | `[cell_y, cell_x, y, x]` | `(Cy, Cx, Py, Px)` | Always nested under `lsst/psf/per_cell` of a parent CellCoadd; cell-aligned chunks | +| `Image`, `MaskedImage`, `VisitImage`, `CellCoadd` | `[y, x]` | `image` | Standard 2-D image. | +| `Mask` (standalone, 2-D) | `[y, x]` | `mask` | When written outside a parent. | +| `ColorImage` | (none at root) | (none at root) | Each `red/`, `green/`, `blue/` sub-archive carries its own `[y, x]` multiscale. | -### Mask groups +### Image / variance arrays — array attrs -Located under `lsst/mask/` of a parent (or top-level for a standalone -Mask): +`image/zarr.json` (and likewise `variance/zarr.json` and any other +2-D float sibling): -- `0/`: 3D zarr array with shape `(plane, y, x)` (or 2D for old-style flat - masks). -- `zarr.json` attrs: `ome.multiscales.axes = [plane, y, x]`, plus - `lsst.mask.planes = [{name, bit, description}, …]`. -- The mask schema is duplicated between the mask group's attributes and - the JSON tree. The JSON tree is authoritative; the duplication is for - OME-Zarr-style discoverability by external tools. +```jsonc +{ + "_ARRAY_DIMENSIONS": ["y", "x"], // xarray + "long_name": "science image", // CF + "units": "adu" // CF (when known) +} +``` + +### Mask array — 2-D packed integers with CF flag attrs + +`mask` is a **2-D `(y, x)` unsigned-integer array**. The dtype is +chosen by the schema's plane count: `uint8` for ≤8 planes, `uint16` +for ≤16, `uint32` for ≤32, `uint64` for ≤64. Each pixel's bits encode +which planes apply at that pixel — the same logical representation +the FITS backend writes, so FITS↔Zarr mask round-trips need no bit- +repacking. + +`mask/zarr.json`: + +```jsonc +{ + "_ARRAY_DIMENSIONS": ["y", "x"], + "flag_masks": [1, 2, 4, 8, 16], + "flag_meanings": "BAD SAT CR INTRP NO_DATA", + "flag_descriptions": [ + "Bad pixel.", + "Saturated.", + "Cosmic ray.", + "Interpolated.", + "No data." + ] +} +``` + +`flag_masks` and `flag_meanings` are CF conventions: +`flag_meanings` is a **single space-separated string** (not a list) +per CF; `flag_descriptions` is the LSST extension carrying the +human-readable per-plane text from `MaskPlane.description`. + +Schemas with **more than 64 planes** raise on write in v1. A 3-D +`(plane_byte, y, x)` fallback is tracked as a follow-up. + +### The JSON round-trip tree (`tree`) + +A 1-D `uint8` zarr array containing UTF-8 JSON. Same content the JSON +backend produces, but with `ArrayReferenceModel` references whose +source strings are zarr paths within the store: `"zarr:/image"`, +`"zarr:/mask"`, `"zarr:/red/image"` (for nested ColorImage channels), +`"zarr:/psf"` (for CellCoadd). These resolve into the zarr store, not +into the JSON document itself, so they do not use the JSON-Pointer +`#/` fragment prefix. There are **no compound source URLs** (no +`?c=N`, no `?cell=Cy,Cx`) because no arrays are stacked. + +### AST WCS string (`wcs_ast`) + +A 1-D `uint8` zarr array containing the AST `FrameSet` text produced +by an `astshim.Channel`. The full text is stored as bytes; this is +the **authoritative round-trip source** for the WCS. The OME affine +emitted in `multiscales.datasets[].coordinateTransformations` is an +approximation for external tools and is dropped when its residual +exceeds the [§4](#4-error-handling-edge-cases-round-trips) threshold. + +For multi-frame-set archives (`serialize_frame_set` calls referencing +distinct WCS objects), each frame set is stored at +`/lsst/frame_sets/` and referenced via `ZarrPointerModel` in +the JSON tree, mirroring the NDF / FITS pattern. ### Tables -Located under `lsst/tables//` (or wherever the JSON tree references): - -- A zarr group with one 1D zarr array per column. -- Group attrs: `lsst.table = {columns: [{name, dtype, unit, description}, …], meta: {...}, length: N}`. -- Structured arrays use the same group form; the deserialized type - differs. - -### Frame sets / WCS - -- AST `FrameSet` serialized via `Channel`/`StringStream` → UTF-8 bytes → - stored as a 1D `uint8` zarr array at `lsst/frame_set` (or - `lsst/frame_sets/` when multiple frame sets are referenced via - `serialize_frame_set`). -- When the FrameSet's pixel-to-sky portion is purely affine - (translation + scale + rotation), an equivalent OME - `coordinateTransformations` block is added to the multiscales metadata - for external-tool benefit. The AST string remains the authoritative - source for round-trip. -- For `CellCoadd` and other tangent-plane WCS cases, the affine - approximation is often exact for the linear part; the AST string is - still authoritative. - -### JSON tree (`lsst/tree`) - -- A 1D `uint8` zarr array containing UTF-8 JSON. Same content the JSON - backend would produce, but with `ArrayReferenceModel` references whose - source paths are zarr paths within the store (e.g. `"/0"` → top-level - data array, `"/lsst/mask/0"` → 3D mask array). These resolve into the - zarr store, not into the JSON document itself, so they do not use the - JSON-Pointer `#/` fragment prefix. -- Keeps round-trip with the existing `serialize` / `deserialize` - machinery untouched. +A table named `` lives at `/lsst/tables//`: one +1-D zarr array per column, sibling to the others under a group whose +attributes carry the `lsst.table = {columns: [...], length: N, +meta: {...}}` block. Structured arrays use the same group form; the +deserialised type differs. ### Recursive composition -Any sub-archive that holds image-shaped data (e.g. PSF model parameter -images) creates a nested zarr group at its archive path that is itself a -valid OME-Zarr image, with its own `ome.multiscales` and -`lsst.archive_class` attributes. The top-level is not special; the same -rules apply at every level. - -### Chunking and sharding defaults - -- Default chunk for a 2D image: `min(1024, dim)` per axis. - For `CellCoadd`: `cell_shape`. -- Default shard: 4×4 chunks (i.e. 4096×4096 for plain images, 4×4 cells - for `CellCoadd`) if shard size would be ≥ 1 MiB; otherwise no - sharding. -- Default codec stack: `bytes -> blosc(zstd, clevel=5, shuffle=shuffle)` - for floats; `bytes -> blosc(zstd, clevel=5, shuffle=bitshuffle)` for - integers. -- Mask arrays use `bitshuffle + zstd` (compresses very well). +Any sub-archive that holds image-shaped data (e.g. `red/`, `green/`, +`blue/` for `ColorImage`; PSF model parameter images for archives +that nest them) creates a nested group at its archive path that is +itself a valid OME-NGFF / xarray group, with its own +`ome.multiscales` and `lsst.archive_class` attributes. The top-level +is not special; the same rules apply at every level. + +### Chunking and sharding defaults + +- Default chunk for a 2-D image: `min(1024, dim)` per axis. For + `CellCoadd`: `cell_shape`. +- Default shard: 4×4 chunks (i.e. 4096×4096 for plain images, 4×4 + cells for `CellCoadd`) if shard size would be ≥ 1 MiB; otherwise + no sharding. +- Default codec stack: `bytes -> blosc(zstd, clevel=5, + shuffle=byte)` for floats; `bytes -> blosc(zstd, clevel=5, + shuffle=bit)` for integers and masks. - All defaults are overridable via `ZarrCompressionOptions` per-array - (keyed by JSON pointer, mirroring the FITS backend). + (keyed by JSON pointer / zarr path). +- `image`, `variance`, and `mask` share the spatial chunk shape; + the output archive derives `variance` / `mask` chunks from + `image`'s when not explicitly overridden. ## 4. Error Handling, Edge Cases, Round-Trips ### Round-trip rules -- A zarr file written from an object read from FITS must round-trip its - `FitsOpaqueMetadata` (primary header) so that re-writing to FITS - preserves header cards. Stored at `lsst/opaque_metadata/fits/primary` - as JSON-encoded astropy `Header`. Same opaque-metadata pattern NDF - uses. -- For zarr itself: any `lsst.*` attributes the archive doesn't recognize - are preserved verbatim and re-emitted on write of an unchanged tree +- A zarr file written from an object read from FITS preserves its + primary-HDU `FitsOpaqueMetadata` at + `/lsst/opaque_metadata/fits/primary` (1-D `uint8` array of + JSON-encoded astropy `Header`). Reading the zarr back attaches an + equivalent `FitsOpaqueMetadata` to the deserialized object so a + subsequent FITS write preserves the original cards. +- Any `lsst.*` attributes the archive does not recognise are + preserved verbatim and re-emitted on write of an unchanged tree (forward compatibility). -- ColorImage's transpose `(Y, X, 3) ↔ (3, Y, X)` is handled in the IR - materialization layer, not in the user-visible `add_array`. + +### WCS validation: simplified-affine residual check + +When emitting OME `coordinateTransformations` for a multiscale +dataset, the layout layer: + +1. Extracts the linear / affine portion of the AST `FrameSet`'s + pixel-to-sky mapping as a 3×3 affine block. +2. Samples residuals on an **11×11 grid** spanning the image bbox. + At each grid point, computes pixel→sky via both the full AST + `FrameSet` and the simplified affine, takes the great-circle + separation, and divides by the pixel scale to get a + pixel-equivalent residual. +3. If `max_residual > 1.0 pixel`, **drops the + `coordinateTransformations` block** for the dataset (emits the + unit scale `[1.0, 1.0]` only) and sets + `lsst.wcs_simplified_dropped: true` on the root group, recording + the observed max residual under `lsst.wcs_simplified_max_residual_pixels`. + +Readers always reconstruct the projection from `wcs_ast` regardless +of whether the affine block was emitted or dropped — the OME affine +is purely an external-tool affordance. ### Error taxonomy Extends existing `serialization.ArchiveReadError`: -- `ArchiveReadError("File has no zarr.json")` for missing root metadata. +- `ArchiveReadError("File has no zarr.json")` for missing root + metadata. - `ArchiveReadError("File is not an LSST zarr archive")` when `lsst.archive_class` is missing. -- `ArchiveReadError("Unsupported lsst:version ")` for forward-incompat - schema versions. +- `ArchiveReadError(f"Unsupported lsst:version {N}")` for + forward-incompatible schema versions. +- `ArchiveReadError(f"Mask has {N} planes; v1 supports up to 64. " + f"3-D fallback is a follow-up.")` on write of a `>64`-plane Mask. +- `ArchiveReadError("On-disk mask schema does not match requested " + "schema: ...")` for read-time schema mismatches; both schemas are + attached, identical to NDF. - `InvalidParameterError` for unknown `read()` kwargs. - `InvalidComponentError` for `deserialize_component` on unknown component names. @@ -317,39 +494,48 @@ Extends existing `serialization.ArchiveReadError`: - Write opens the store in create-only mode (refuses to overwrite an existing zarr root, mirroring FITS/NDF). -- For DirectoryStore, a partial failure leaves a partial directory — - same risk profile as NDF write failures. Document this and recommend - writing to a temp path then renaming via ResourcePath. -- ZipStore writes are atomic (the file isn't valid until the central - directory is written), so failures leave no garbage. - -### Chunk-aligned subset reads - -- `get_array(model, slices=...)` passes slices straight to the backing - zarr-python array. Zarr handles chunk boundary alignment internally. -- For 3D mask reads with a `bbox=`-style argument, the slice on the - spatial axes only is what zarr sees; the plane axis is fully read. +- For `LocalStore`, a partial failure leaves a partial directory — + same risk profile as NDF write failures. Document this and + recommend writing to a temp `ResourcePath` then renaming. +- `ZipStore` writes are atomic (the file is not valid until the + central directory is written), so failures leave no garbage. + +### Chunk-aligned subset reads (lazy invariant) + +- `get_array(model, slices=...)` passes `slices` straight to the + backing `zarr.Array` handle. Zarr handles chunk boundary + alignment internally; only chunks intersecting the slice are + fetched. +- For 2-D mask reads (the v1 layout), spatial slices apply as on + the image; there is no plane-axis to consider. +- A `_CountingStore`-based regression test asserts that a + single-chunk subset of a 16×16 / chunks=(4,4) array touches + strictly fewer chunk reads than a full read. This is the load- + bearing test for cloud-friendly subsetting. ### Mask schema mismatches -- If a Mask is read where the on-disk plane definitions differ from the - in-memory schema being requested, raise `ArchiveReadError` with both - schemas attached, identical to NDF. +If a `Mask` is read where the on-disk plane definitions differ from +the in-memory schema being requested, raise `ArchiveReadError` +with both schemas attached, identical to the NDF backend. ### Empty / minimal cases -- `Image` with no projection: omit `lsst.frame_set`, omit OME - `coordinateTransformations`, the `lsst.tree` is just an - `ImageSerializationModel`. -- `Image` plus metadata only: same as above; `metadata` lives in the +- `Image` with no projection: omit `wcs_ast`; the OME multiscale's + `coordinateTransformations` is the unit scale `[1.0, 1.0]`. The + `tree` JSON document is just an `ImageSerializationModel` with + no `projection` field. +- `Image` plus metadata only: as above; `metadata` lives in the JSON tree. ### Forward compatibility -- `lsst.version` integer; readers refuse versions newer than they - understand. -- Unknown `lsst.*` keys at any level are preserved through the opaque - metadata mechanism so a partial round-trip does not lose them. +- `lsst.version` is an integer; readers refuse versions newer than + they understand. +- Unknown `lsst.*` keys at any level are preserved verbatim through + the IR (`ZarrAttributes.load` keeps them; `dump` re-emits them). + This buys partial-knowledge round-trips without losing extension + data. ## 5. Testing Strategy and Rollout @@ -357,30 +543,41 @@ Extends existing `serialization.ArchiveReadError`: Mirrors the NDF pattern (`tests/test_ndf_*.py`): -- `tests/test_zarr_common.py` — `_common.py` constants, path helpers, - `ZarrCompressionOptions` dataclass. +- `tests/test_zarr_common.py` — `_common.py` constants, path + helpers, `ZarrCompressionOptions` dataclass. - `tests/test_zarr_model.py` — IR types in isolation: `ZarrDocument` - round-trip via `from_zarr` / `to_zarr` against an in-memory store, - attribute schema validation, ColorImage axis transpose. + round-trip via `from_zarr` / `to_zarr` against an in-memory + store, attribute schema validation. Lazy invariant on + `ZarrArray.from_zarr`. - `tests/test_zarr_layout.py` — `_layout.py` rules: which axes for - which archive class, which attributes get populated, JSON-pointer ↔ - zarr-path translation. -- `tests/test_zarr_output_archive.py` — write paths for every supported - archive class (`Image`, `Mask`, `MaskedImage`, `VisitImage`, - `ColorImage`, `CellCoadd`), verifying the on-disk layout matches the - spec by inspecting the IR. -- `tests/test_zarr_input_archive.py` — read paths and `slices=` subset - reads, error taxonomy tests, opaque-metadata round-trips. + which archive class, CF flag-attrs construction for masks, + affine-residual validator (synthetic linear WCS passes; synthetic + high-distortion WCS triggers the drop), chunk derivation + (including `cell_shape` alignment). +- `tests/test_zarr_store.py` — URI dispatch (`LocalStore` / + `ZipStore` / `FsspecStore`), create-only refusal. +- `tests/test_zarr_output_archive.py` — write paths for every + supported archive class (`Image`, `Mask`, `MaskedImage`, + `VisitImage`, `ColorImage`, `CellCoadd`), verifying the on-disk + layout matches the spec by inspecting the IR. +- `tests/test_zarr_input_archive.py` — read paths and `slices=` + subset reads, `_CountingStore` lazy-invariant assertion, error + taxonomy tests, opaque-metadata round-trips. - `tests/test_zarr_round_trip.py` — full write→read round-trips for - every type, plus FITS↔Zarr cross-format round-trips for the types - that already do FITS↔NDF round-trips. + every type, plus FITS↔Zarr cross-format round-trips for the + types that already do FITS↔NDF round-trips. +- `tests/test_zarr_xarray_interop.py` — `xr.open_zarr(path)` returns + a `Dataset` with `image` / `variance` / `mask` data variables + sharing `(y, x)` dims; CF flag attributes survive on the mask + variable. Skipped if `xarray` is not installed. - `tests/test_zarr_ome_compliance.py` — *if* `ngff-validator` (or equivalent) can be installed in CI, run it against representative outputs to catch OME-Zarr spec drift. Skipped if the tool is unavailable. - `tests/test_zarr_external_reader.py` — sanity-check that the - `ome-zarr` Python tooling can open our files and read the data array - (not LSST extensions). Skipped if `ome-zarr` is not installed. + `ome-zarr` Python tooling can open our files and read the science + array (not LSST extensions). Skipped if `ome-zarr` is not + installed. ### CI / dev requirements @@ -392,35 +589,53 @@ user-facing extras. Scoped into separate tickets/PRs to keep review tractable: -1. Skeleton + `_common.py` + `_model.py` IR + tests for the IR alone. - No write/read yet. -2. `_store.py` + `_layout.py` + `ZarrOutputArchive` + write helper. - Cover `Image`, `MaskedImage`, `VisitImage` only. Output-side tests. -3. `ZarrInputArchive` + read helper + `slices=` subset reads + error - taxonomy. Input-side tests + round-trip for the types in step 2. -4. `ColorImage` (channel-axis specialization) + `CellCoadd` - (cell-aligned chunks + 4D PSF). Round-trip tests. -5. Cross-format round-trips (FITS ↔ Zarr opaque metadata round-trip). - Optional `ome-zarr` external-reader sanity test. +1. Skeleton + `_common.py` + `_model.py` IR + tests for the IR + alone. No write/read yet. +2. `_store.py` + `_layout.py` (axes, chunks, affine validator) + + `ZarrOutputArchive` + write helper. Cover `Image`, + `MaskedImage`, `VisitImage` only. Output-side tests, including + CF flag-attrs assertions on the mask group and the affine- + residual validator behaviour. +3. `ZarrInputArchive` + read helper + `slices=` subset reads (with + `_CountingStore` regression test) + error taxonomy. Input-side + tests + round-trip for the types in step 2. +4. `ColorImage` (recursive composition of three `Image` sub-archives) + + `CellCoadd` (cell-aligned chunks + 4-D PSF). Round-trip tests. +5. Cross-format round-trips (FITS ↔ Zarr opaque metadata + round-trip). Optional `ome-zarr` external-reader sanity test. + `xarray` interop test. 6. Documentation: module docstring (mirroring the FITS/NDF module docstrings) describing the layout, plus a changelog entry. ## 6. Follow-Up Work (Out of Scope) -Captured here so they are not lost; each is to be tracked as its own -ticket once the initial backend lands. - -- **NGFF nonlinear coordinate transformations.** When NGFF gains - broadly-supported nonlinear transformations (RFC-3 follow-on or - successor) and tooling adoption follows, replace the current - affine-approximation path with a fully populated OME - `coordinateTransformations`. This is high-interest because tangent-plane - pixel-to-sky transformations (CellCoadd) and polynomial corrections - (VisitImage) currently round-trip only through the AST string; richer - OME support would expose them to external tools. -- **Lazy / dask-friendly read API** (`read_lazy()` returning open zarr - arrays/groups for downstream dask integration). +Captured here so they are not lost; each is to be tracked as its +own ticket once the initial backend lands. + +- **NGFF RFC-5 nonlinear coordinate transformations.** Replace the + affine-only OME block with a real `sequence(affine, projection, + ...)` block and treat it as authoritative; `wcs_ast` becomes an + optional fallback rather than the source of truth. This is high- + interest because tangent-plane pixel-to-sky transformations + (CellCoadd) and polynomial corrections (VisitImage TAN-SIP) + currently round-trip only through the AST string; richer OME + support would expose them to external tools. **This work is + blocked on writing an AST JSON channel** that serializes a + `FrameSet` to and from RFC-5 transformation JSON — this is a + non-trivial piece of work in its own right and is recorded as a + tracked dependency with no v1 timeline. +- **3-D mask fallback for `>64`-plane masks.** Adds a per-class + layout switch: 2-D packed for ≤64 planes (CF-compliant), 3-D + `(plane_byte, y, x)` for `>64` (CF-extension annotations). v1 + raises on write for `>64`. +- **Lazy / dask-friendly read API** (`read_lazy()` returning open + zarr arrays / `xr.Dataset` for downstream dask integration). - **Multiscale pyramid generation** (level 1, 2, … coarsenings) for visualization tools. -- **`zarr.consolidated_metadata` extension** to reduce object-list calls - on cloud stores. +- **`zarr.consolidated_metadata` extension** to reduce object-list + calls on cloud stores. +- **Stacked OME view for `ColorImage`.** A future need for a single + `(3, Y, X)` OME-readable array could be met by writing a stacked + view alongside the per-channel sub-archives. v1 does not because + of the no-byte-duplication rule; the per-channel sub-archives are + themselves valid OME images. From a34369d876799f8fab8ef6152eadc5b78184a464 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 17:38:07 -0700 Subject: [PATCH 04/60] docs: track NCZarr / NetCDF interop as a follow-up NCZarr is purely additive on top of the v1 layout (_NCZARR_GROUP / _NCZARR_ARRAY markers + optional 1-D coordinate variables). Held out of v1 because the zarr-v3 mapping is still evolving; recorded so we don't lose the conclusion. Generated with AI Co-Authored-By: SLAC AI --- docs/superpowers/specs/2026-05-22-zarr-io-design.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/superpowers/specs/2026-05-22-zarr-io-design.md b/docs/superpowers/specs/2026-05-22-zarr-io-design.md index 4aa66f65..d9e60197 100644 --- a/docs/superpowers/specs/2026-05-22-zarr-io-design.md +++ b/docs/superpowers/specs/2026-05-22-zarr-io-design.md @@ -639,3 +639,15 @@ own ticket once the initial backend lands. view alongside the per-channel sub-archives. v1 does not because of the no-byte-duplication rule; the per-channel sub-archives are themselves valid OME images. +- **NCZarr / NetCDF interop.** Unidata's NCZarr layers a NetCDF data + model on top of Zarr, unlocking native reads via `libnetcdf` and + the downstream R / Fortran / MATLAB / IDL ecosystems. v1 is + already partially compatible because `_ARRAY_DIMENSIONS` (xarray) + is the same dimension-naming convention NCZarr uses. Full + compliance is **purely additive**: add `_NCZARR_GROUP` and + `_NCZARR_ARRAY` attribute markers (no layout change, no extra + bytes), and optionally write 1-D `y` / `x` coordinate variables + so the file is self-describing as a NetCDF dataset. Held out of + v1 because NCZarr's zarr-v3 mapping is still evolving and we'd + rather pin against a stable revision; the upgrade requires no + migration of existing files when we adopt it. From 76c9f74a98ed5d5ec8a0b9cabf4393d47a1c0d44 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 18:44:16 -0700 Subject: [PATCH 05/60] Rewrite zarr I/O implementation plan for the revised design Replaces the v1 plan (commit 51a1e3a) with a plan that matches the revised spec (commits 83b9064 + a34369d): xarray/CF-shaped root with OME-NGFF as a discoverability layer, 2-D packed-integer mask with CF flag attrs, ColorImage as recursive sub-archives (no stacking), CellCoadd with native 4-D PSF (no fixup pass), affine residual validator with 1-pixel threshold, and the AST string as the WCS round-trip authority. Six phases, ~37 bite-sized TDD tasks, every critical invariant pinned by a failing-then-passing test. Generated with AI Co-Authored-By: SLAC AI --- .../plans/2026-05-22-zarr-io-backend.md | 3788 ++++++++++------- 1 file changed, 2252 insertions(+), 1536 deletions(-) diff --git a/docs/superpowers/plans/2026-05-22-zarr-io-backend.md b/docs/superpowers/plans/2026-05-22-zarr-io-backend.md index 38991d4c..1b062d0d 100644 --- a/docs/superpowers/plans/2026-05-22-zarr-io-backend.md +++ b/docs/superpowers/plans/2026-05-22-zarr-io-backend.md @@ -2,17 +2,18 @@ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. -**Goal:** Add a `lsst.images.zarr` subpackage that reads and writes OME-Zarr v0.5 files (with namespaced `lsst:` extensions) for every image type the existing FITS/JSON/NDF backends support, with cloud-friendly chunking/sharding and efficient subset reads that go straight to the underlying `zarr-python` lazy arrays. +**Goal:** Add a `lsst.images.zarr` subpackage that reads and writes Zarr v3 archives following the revised design at `docs/superpowers/specs/2026-05-22-zarr-io-design.md` — xarray/CF-shaped at the root with OME-NGFF v0.5 metadata as a discoverability layer on top, supporting every image type the FITS/JSON/NDF backends support, with cloud-friendly chunking and lazy subset reads that only fetch the chunks they touch. -**Architecture:** Mirrors the NDF backend. A Python intermediate representation (`ZarrDocument`/`ZarrGroup`/`ZarrArray`) holds the on-disk layout independently of `zarr-python`. The IR holds **lazy `zarr.Array` handles, never materialized `numpy` arrays** — so `from_zarr()` opens groups without reading bytes, and `InputArchive.get_array(model, slices=...)` passes slices through to the lazy handle. Writes use a two-pass model: `obj.serialize(archive)` populates the IR, then `__exit__` materializes it via the configured `zarr.storage.Store`. Stores are selected from a `ResourcePath` URI: `*.zarr.zip` → `ZipStore`, remote URIs → `FsspecStore`, otherwise `LocalStore`. +**Architecture:** Mirrors the NDF backend. A Python intermediate representation (`ZarrDocument`/`ZarrGroup`/`ZarrArray`) holds the on-disk layout independently of `zarr-python`. The IR holds **lazy `zarr.Array` handles, never materialized `numpy` arrays** — so `from_zarr()` opens groups without reading bytes, and `InputArchive.get_array(model, slices=...)` passes slices through to the lazy handle. Writes use a two-pass model: `obj.serialize(archive)` populates the IR, then `__exit__` materializes it via the configured `zarr.storage.Store`. Stores are selected from a `ResourcePath` URI: `*.zarr.zip` → `ZipStore`, remote URIs → `FsspecStore`, otherwise `LocalStore`. **No stacking, no JSON-pointer rewrites, no compound source URLs** — each `add_array(name)` call lands at the zarr path equal to `name`. **Tech Stack:** `zarr >= 3.0`, `numcodecs` (already pulled by zarr), `fsspec` (already a dependency), `lsst.resources.ResourcePath` (already a dependency), `pydantic >= 2.12`, `numpy >= 2.0`. Reuses `lsst.images.serialization` ABCs and tree models. Optional install via `pip install lsst-images[zarr]`. -**Critical invariant — lazy reads everywhere:** -- `ZarrArray.data` holds either `np.ndarray` (when staged for write) **or** a `zarr.Array` handle (when read). It never copies bytes during `from_zarr`. -- `ZarrInputArchive.get_array(model, slices=...)` resolves the model's zarr path, fetches the lazy `zarr.Array`, and applies `slices` *to the handle* (`arr[slices]`). For a remote VisitImage this means only the chunks intersecting `slices` are downloaded. -- The `_layout.py` ColorImage transpose `(c, y, x) → (y, x, c)` happens *after* slicing, on the (small) sliced result — never on the full array. -- Tests under `test_zarr_input_archive.py` assert this with a mock `zarr.storage.Store` that records every key access; subset reads must touch a strict subset of the chunks of the full read. +**Critical invariants** — these are pinned by tests in this plan: + +1. **Lazy reads everywhere.** `ZarrArray.data` is one of `np.ndarray` (staged for write) or `zarr.Array` (read-side handle). `from_zarr` never reads chunk bytes. `InputArchive.get_array(model, slices=...)` forwards `slices` straight to the lazy handle. Pinned by `_CountingStore` regression test in Task 3.2. +2. **Aligned chunks across siblings.** `image`, `variance`, and `mask` share spatial chunk shape. The output archive derives `variance`/`mask` chunks from `image`'s chunk shape when not explicitly overridden. Pinned by Task 2.5. +3. **Affine residual validator.** Before emitting an OME `coordinateTransformations` block, the layout layer samples residuals on an 11×11 grid; if max pixel-equivalent residual exceeds 1.0 pixel, the block is dropped and `lsst.wcs_simplified_dropped: true` is set. The AST string at `wcs_ast` is always authoritative. Pinned by Task 2.4. +4. **No byte duplication.** ColorImage channels are recursive sub-archives, not stacked. CellCoadd PSF is whatever shape `serialize` natively emits — typically 4-D `(Cy, Cx, Py, Px)`. There is no fixup pass that copies or re-shapes data. --- @@ -21,18 +22,21 @@ ``` python/lsst/images/zarr/ ├── __init__.py guarded `import zarr`; re-exports public API -├── _common.py ZarrPointerModel, attribute namespace constants -│ ("lsst:" / "ome:" prefixes plus the LSST_VERSION -│ schema integer), ZarrCompressionOptions dataclass, -│ path/JSON-pointer helpers -├── _model.py IR: ZarrAttributes, ZarrArray, ZarrGroup, -│ ZarrDocument, OmeMultiscale, OmeOmero, -│ LsstMaskGroup, LsstTableGroup, with from_zarr / -│ to_zarr methods that DO NOT materialize array data -├── _layout.py Layout rules: archive-class → axes mapping, -│ JSON-pointer ↔ zarr-path translation, default -│ chunk/shard derivation, ColorImage axis transpose -│ applied to (already-sliced) arrays +├── _common.py ZarrPointerModel, namespace constants +│ (LSST_NS / OME_NS / LSST_VERSION / OME_VERSION), +│ ZarrCompressionOptions, mask-dtype-for-plane-count, +│ path helpers (no JSON-pointer mapping table — +│ every name maps to its literal path now) +├── _model.py IR: ZarrAttributes, ZarrArray (lazy-handle backed), +│ ZarrGroup, ZarrDocument, OME/CF helpers +│ (OmeMultiscale, OmeOmeroChannel, +│ CfFlagAttributes, build_image_array_attrs) +├── _layout.py Layout rules: axes per archive class, +│ chunk derivation (incl. cell-aligned for CellCoadd +│ and aligned-with-image for variance/mask), +│ affine extraction + residual validator, +│ OME multiscale block construction, +│ CF flag-attrs construction from MaskSchema ├── _store.py URI → zarr.storage.Store wrapper: │ *.zarr.zip → ZipStore, http(s)/s3/gs → FsspecStore, │ local → LocalStore. Honors create-only mode. @@ -40,37 +44,47 @@ python/lsst/images/zarr/ ├── _input_archive.py ZarrInputArchive (reads IR lazily) and read() helper tests/ -├── test_zarr_common.py constants, helpers, ZarrCompressionOptions -├── test_zarr_model.py IR round-trip via in-memory MemoryStore -├── test_zarr_layout.py axes per archive class, pointer translation +├── test_zarr_common.py constants, helpers, ZarrCompressionOptions, +│ mask-dtype-for-plane-count +├── test_zarr_model.py IR round-trip via in-memory MemoryStore, +│ lazy invariant on from_zarr +├── test_zarr_layout.py axes per archive class, chunk derivation, +│ CF flag-attrs construction, +│ affine residual validator behaviour ├── test_zarr_store.py URI dispatch, create-only refusal -├── test_zarr_output_archive.py write paths inspected against IR -├── test_zarr_input_archive.py read paths + lazy subset assertions +├── test_zarr_output_archive.py write paths inspected against IR for +│ every supported archive class +├── test_zarr_input_archive.py read paths + lazy subset assertion +│ (_CountingStore), error taxonomy, +│ opaque-metadata round-trip ├── test_zarr_round_trip.py full write→read for every type +├── test_zarr_cross_format.py FITS↔Zarr opaque-metadata round-trip +├── test_zarr_xarray_interop.py xr.open_zarr returns Dataset with +│ image/variance/mask data variables ├── test_zarr_ome_compliance.py ngff-validator (skipped if absent) └── test_zarr_external_reader.py ome-zarr-py sanity (skipped if absent) ``` -Files in this layout follow the NDF backend's split: `_model.py` is pure data; `_output_archive.py` and `_input_archive.py` only translate between the IR and the abstract archive interface. Layout decisions live in `_layout.py` so per-class tweaks (ColorImage's `c` axis, CellCoadd's cell-aligned chunks) are made once against the populated IR rather than scattered through `add_array` calls. +The split mirrors the NDF backend exactly: `_model.py` is pure data; `_output_archive.py` and `_input_archive.py` only translate between the IR and the abstract archive interface; `_layout.py` holds every per-archive-class decision so individual `add_array` calls stay generic. --- ## Phase 1 — Skeleton, `_common.py`, and IR (no I/O yet) -This phase produces the IR and constants in isolation. It can be merged independently — there is no archive yet, but the IR round-trips through an in-memory zarr `MemoryStore` so we can prove the shape of what the later archives will produce. +This phase produces the IR and constants in isolation. The IR round-trips through an in-memory zarr `MemoryStore` so the shape of what later phases will produce is pinned before any archive code is written. ### Task 1.1: Create the package skeleton **Files:** - Create: `python/lsst/images/zarr/__init__.py` -- Modify: `pyproject.toml:55` (add `zarr` extra after the existing `ndf` extra) +- Modify: `pyproject.toml` (add `zarr` extra after the existing `ndf` extra at line 55) - [ ] **Step 1: Add the optional dependency** -Edit `pyproject.toml` after line 55 (the `ndf` extra): +In `pyproject.toml`, immediately after the `ndf` extra (around line 55), add: ```toml -# Add feature for OME-Zarr v0.5 read/write support. +# Add feature for Zarr v3 read/write support. zarr = ["zarr >= 3.0"] ``` @@ -90,22 +104,23 @@ Create `python/lsst/images/zarr/__init__.py`: # Use of this source code is governed by a 3-clause BSD-style # license that can be found in the LICENSE file. -"""OME-Zarr v0.5 archive backend for `lsst.images`. +"""Zarr v3 archive backend for `lsst.images`. -Files written by this archive are valid OME-Zarr v0.5 NGFF images at every -level where image-shaped data lives, augmented with namespaced ``lsst:`` -extensions for mask plane semantics, AST WCS round-trip, the LSST archive -tree, table layout, and cell-grid hints. External tools that consume -OME-Zarr (`napari`, `neuroglancer`, `ome-zarr-py`) can render the science -arrays without LSST-specific awareness. +Files written by this archive are xarray/CF-shaped at the root +(``image`` / ``variance`` / ``mask`` as siblings sharing ``(y, x)`` +dimensions, CF ``flag_masks`` / ``flag_meanings`` on the mask) with +OME-NGFF v0.5 multiscales metadata as a discoverability layer +pointing at the same ``image`` array. The same bytes are visible to +``xarray``, GDAL's Zarr driver, and OME-Zarr tooling like ``napari`` +and ``ome-zarr-py``. Default chunk geometry is tile-aligned (~1024×1024 for plain images, -``cell_shape`` for ``CellCoadd``). Sharding (zarr v3 native) is enabled by -default with a tunable shard size to keep object counts manageable on -S3/GCS. Both ``DirectoryStore`` and ``ZipStore`` are supported; the choice -is driven by URI shape (``*.zarr.zip`` → ``ZipStore``, otherwise -directory). Remote URIs go through `lsst.resources.ResourcePath` and -`fsspec`. +``cell_shape`` for ``CellCoadd``). Sharding (zarr v3 native) is +enabled by default with a tunable shard size to keep object counts +manageable on S3/GCS. Both ``DirectoryStore`` and ``ZipStore`` are +supported; the choice is driven by URI shape (``*.zarr.zip`` → +``ZipStore``, otherwise directory). Remote URIs go through +`lsst.resources.ResourcePath` and `fsspec`. """ try: @@ -122,7 +137,7 @@ except ImportError as e: - [ ] **Step 3: Verify the guarded import works** Run: `python -c "import lsst.images.zarr"` -Expected: no output (success), or a clear ImportError pointing at the `[zarr]` extra if `zarr` isn't installed. +Expected: no output (success), or a clear ImportError pointing at the `[zarr]` extra if `zarr` is not installed. - [ ] **Step 4: Commit** @@ -131,12 +146,20 @@ git add python/lsst/images/zarr/__init__.py pyproject.toml git commit -m "feat: add lsst.images.zarr package skeleton with guarded import" ``` -### Task 1.2: `_common.py` — constants, `ZarrPointerModel`, `ZarrCompressionOptions` +### Task 1.2: `_common.py` — namespaces, `ZarrPointerModel`, `ZarrCompressionOptions`, mask-dtype helper **Files:** - Create: `python/lsst/images/zarr/_common.py` - Test: `tests/test_zarr_common.py` +`_common.py` carries: + +- Namespace constants `LSST_NS = "lsst"`, `OME_NS = "ome"`, version integers `LSST_VERSION = 1`, `OME_VERSION = "0.5"`. +- `ZarrPointerModel` — Pydantic model holding an absolute zarr path. +- `ZarrCompressionOptions` — dataclass with `codec`, `cname`, `clevel`, `shuffle`. Provides `default_for_dtype(dtype)` returning byte-shuffle for floats, bit-shuffle for ints/masks. +- `mask_dtype_for_plane_count(n)` — picks the smallest unsigned integer that holds `n` planes; raises if `n > 64`. +- `archive_path_to_zarr_path(archive_path)` — translates an empty archive path to `/tree`; non-empty paths are kept verbatim under their natural path. **There is no JSON-pointer mapping table.** `name="image"` lands at `/image`; `name="mask"` at `/mask`; nested `name="red/image"` at `/red/image`. + - [ ] **Step 1: Write the failing test** Create `tests/test_zarr_common.py`: @@ -157,6 +180,8 @@ from __future__ import annotations import unittest +import numpy as np + try: from lsst.images.zarr._common import ( LSST_NS, @@ -166,7 +191,7 @@ try: ZarrCompressionOptions, ZarrPointerModel, archive_path_to_zarr_path, - json_pointer_to_zarr_path, + mask_dtype_for_plane_count, ) HAVE_ZARR = True @@ -177,7 +202,7 @@ except ImportError: @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class CommonTestCase(unittest.TestCase): def test_pointer_round_trips(self) -> None: - original = ZarrPointerModel(path="/lsst/mask") + original = ZarrPointerModel(path="/lsst/psf/tree") recovered = ZarrPointerModel.model_validate_json(original.model_dump_json()) self.assertEqual(recovered, original) @@ -187,28 +212,35 @@ class CommonTestCase(unittest.TestCase): self.assertEqual(OME_VERSION, "0.5") self.assertGreaterEqual(LSST_VERSION, 1) - def test_archive_path_to_zarr_path(self) -> None: - # Empty archive path → top-level main JSON tree under lsst/. - self.assertEqual(archive_path_to_zarr_path(""), "/lsst/tree") - # Archive sub-paths land under lsst/ verbatim (no uppercase). - self.assertEqual(archive_path_to_zarr_path("/psf"), "/lsst/psf") - self.assertEqual(archive_path_to_zarr_path("/psf/coefficients"), "/lsst/psf/coefficients") - - def test_json_pointer_to_zarr_path(self) -> None: - # JSON pointer to a top-level array attribute → top-level zarr "0". - self.assertEqual(json_pointer_to_zarr_path("/image"), "/0") - # JSON pointer to a mask attribute → /lsst/mask/0. - self.assertEqual(json_pointer_to_zarr_path("/mask"), "/lsst/mask/0") - # Unknown JSON pointers fall through to a literal /lsst/ mapping. - self.assertEqual(json_pointer_to_zarr_path("/companions/extra"), "/lsst/companions/extra") - - def test_compression_options_default(self) -> None: - defaults = ZarrCompressionOptions.default_for_dtype("float32") - self.assertEqual(defaults.codec, "blosc") - # bitshuffle for ints, byte-shuffle for floats. - self.assertEqual(defaults.shuffle, "shuffle") - int_defaults = ZarrCompressionOptions.default_for_dtype("uint8") - self.assertEqual(int_defaults.shuffle, "bitshuffle") + def test_archive_path_translation(self) -> None: + # Empty archive path -> the canonical root-level JSON tree. + self.assertEqual(archive_path_to_zarr_path(""), "/tree") + # Non-empty archive paths are kept verbatim. + self.assertEqual(archive_path_to_zarr_path("/image"), "/image") + self.assertEqual(archive_path_to_zarr_path("image"), "/image") + self.assertEqual(archive_path_to_zarr_path("/red/image"), "/red/image") + self.assertEqual(archive_path_to_zarr_path("/psf"), "/psf") + + def test_compression_defaults(self) -> None: + floats = ZarrCompressionOptions.default_for_dtype("float32") + self.assertEqual(floats.codec, "blosc") + self.assertEqual(floats.shuffle, "shuffle") + ints = ZarrCompressionOptions.default_for_dtype("uint8") + self.assertEqual(ints.shuffle, "bitshuffle") + + def test_mask_dtype_picks_smallest_fit(self) -> None: + self.assertEqual(mask_dtype_for_plane_count(1), np.dtype("uint8")) + self.assertEqual(mask_dtype_for_plane_count(8), np.dtype("uint8")) + self.assertEqual(mask_dtype_for_plane_count(9), np.dtype("uint16")) + self.assertEqual(mask_dtype_for_plane_count(16), np.dtype("uint16")) + self.assertEqual(mask_dtype_for_plane_count(17), np.dtype("uint32")) + self.assertEqual(mask_dtype_for_plane_count(32), np.dtype("uint32")) + self.assertEqual(mask_dtype_for_plane_count(33), np.dtype("uint64")) + self.assertEqual(mask_dtype_for_plane_count(64), np.dtype("uint64")) + + def test_mask_dtype_refuses_more_than_64_planes(self) -> None: + with self.assertRaisesRegex(ValueError, "supports up to 64"): + mask_dtype_for_plane_count(65) if __name__ == "__main__": @@ -246,12 +278,13 @@ __all__ = ( "ZarrCompressionOptions", "ZarrPointerModel", "archive_path_to_zarr_path", - "json_pointer_to_zarr_path", + "mask_dtype_for_plane_count", ) from dataclasses import dataclass from typing import ClassVar, Self +import numpy as np import pydantic LSST_NS = "lsst" @@ -271,31 +304,17 @@ backwards-incompatible changes to the on-disk layout. """ -# Well-known archive-attribute → zarr-path mappings used by the layout rules. -# Keys are JSON-pointer suffixes (the ``name=`` argument passed to -# ``add_array``); values are absolute zarr paths inside the archive root. -# Anything not in this map falls through to a literal /lsst/ mapping. -_JSON_POINTER_TO_ZARR_PATH: dict[str, str] = { - "/image": "/0", - "/mask": "/lsst/mask/0", - "/variance": "/lsst/variance/0", - "/red": "/0", # ColorImage channels are stacked along the c axis at /0. - "/green": "/0", - "/blue": "/0", -} - - class ZarrPointerModel(pydantic.BaseModel): """Reference to a zarr archive sub-tree by absolute zarr path. - Used by `ZarrOutputArchive`/`ZarrInputArchive` to point to sub-trees - that have been hoisted out of the main JSON tree into separate zarr - groups. The path is interpreted relative to the archive root, e.g. - ``"/lsst/psf"``. + Used by `ZarrOutputArchive` / `ZarrInputArchive` to point to + sub-trees that have been hoisted out of the main JSON tree into + separate zarr arrays. The path is interpreted relative to the + archive root, e.g. ``"/lsst/psf/tree"``. """ path: str - """Absolute zarr path (e.g. ``/lsst/psf``).""" + """Absolute zarr path (e.g. ``/lsst/psf/tree``).""" @dataclass(frozen=True) @@ -305,8 +324,7 @@ class ZarrCompressionOptions: The default codec stack is ``bytes -> blosc(zstd, clevel=5)`` with byte-shuffle for floats and bit-shuffle for integers (and masks). All defaults are overridable per-array via the ``compression`` - keyword to ``write()``, keyed by the JSON pointer of the attribute - the array backs. + keyword to ``write()``. """ codec: str = "blosc" @@ -318,9 +336,11 @@ class ZarrCompressionOptions: DEFAULT_INT: ClassVar[Self] @classmethod - def default_for_dtype(cls, dtype: str) -> Self: - """Return the default codec stack for a numpy dtype name.""" - if dtype.startswith(("u", "i")) or dtype == "bool": + def default_for_dtype(cls, dtype: str | np.dtype) -> Self: + """Return the default codec stack for a numpy dtype.""" + kind = np.dtype(dtype).kind + # 'u' (unsigned int), 'i' (signed int), 'b' (bool) -> bit-shuffle. + if kind in ("u", "i", "b"): return cls.DEFAULT_INT return cls.DEFAULT_FLOAT @@ -332,39 +352,51 @@ ZarrCompressionOptions.DEFAULT_INT = ZarrCompressionOptions(shuffle="bitshuffle" def archive_path_to_zarr_path(archive_path: str) -> str: """Translate a serialization archive path to its zarr path. - The empty archive path maps to the main JSON tree at ``/lsst/tree``. - Non-empty archive paths are kept verbatim under ``/lsst/`` (zarr keys - have no length limit, unlike HDS). + The empty archive path maps to the root-level JSON tree at + ``/tree``. Non-empty archive paths are kept verbatim (with a + leading slash). The v1 design's JSON-pointer mapping table is + intentionally absent: arrays land where their archive name says + they do. """ if not archive_path: - return "/lsst/tree" + return "/tree" stripped = archive_path.strip("/") - return f"/{LSST_NS}/{stripped}" + return f"/{stripped}" -def json_pointer_to_zarr_path(pointer: str) -> str: - """Translate a JSON pointer attribute name to a zarr path. +def mask_dtype_for_plane_count(n_planes: int) -> np.dtype: + """Pick the smallest unsigned-integer dtype that holds ``n_planes`` bits. - Used by the output archive's ``add_array`` to figure out where in the - zarr store an array referenced by a Pydantic field should be - materialized. Falls through to ``archive_path_to_zarr_path`` for - unrecognised attribute names. + Returns ``uint8`` for ≤8 planes, ``uint16`` for ≤16, ``uint32`` + for ≤32, ``uint64`` for ≤64. Raises `ValueError` for >64 planes; + a 3-D fallback for that case is tracked as a follow-up. """ - if pointer in _JSON_POINTER_TO_ZARR_PATH: - return _JSON_POINTER_TO_ZARR_PATH[pointer] - return archive_path_to_zarr_path(pointer) + if n_planes <= 0: + raise ValueError(f"n_planes must be positive, got {n_planes}.") + if n_planes <= 8: + return np.dtype("uint8") + if n_planes <= 16: + return np.dtype("uint16") + if n_planes <= 32: + return np.dtype("uint32") + if n_planes <= 64: + return np.dtype("uint64") + raise ValueError( + f"Mask has {n_planes} planes; v1 supports up to 64. " + f"3-D fallback is a follow-up." + ) ``` - [ ] **Step 4: Run the test to verify it passes** Run: `pytest tests/test_zarr_common.py -v` -Expected: PASS — 5 tests pass. +Expected: PASS — 6 tests pass. - [ ] **Step 5: Commit** ```bash git add python/lsst/images/zarr/_common.py tests/test_zarr_common.py -git commit -m "feat: add ZarrPointerModel, ZarrCompressionOptions, and path helpers" +git commit -m "feat: add ZarrPointerModel, ZarrCompressionOptions, mask-dtype helper" ``` ### Task 1.3: IR — `ZarrAttributes` and `ZarrArray` with lazy backing @@ -375,11 +407,13 @@ git commit -m "feat: add ZarrPointerModel, ZarrCompressionOptions, and path help This task introduces the IR types whose **lazy-array invariant** is the heart of the efficient subsetting story. `ZarrArray.data` is one of: -- `numpy.ndarray` — staged for write (the user just called `add_array`) +- `numpy.ndarray` — staged for write - `zarr.Array` — read from a store, **never sliced eagerly** A read of a remote VisitImage opens its `zarr.Array` handle through `from_zarr`. Subsequent slicing (in `InputArchive.get_array(model, slices=...)`) goes straight to that handle, so only the chunks intersecting the slice are downloaded. +`ZarrAttributes` separates the `lsst:` and `ome:` namespaces (each gets its `version` field stamped automatically on `dump`) and preserves unknown keys for forward compatibility. Plain CF / xarray attributes like `_ARRAY_DIMENSIONS`, `flag_masks`, `flag_meanings`, `units` live in a third namespace called `extra` that round-trips verbatim — they're written at the top level of `zarr.json` `attributes` (no `lsst:` or `ome:` wrapper). + - [ ] **Step 1: Write the failing test** Create `tests/test_zarr_model.py`: @@ -417,62 +451,69 @@ except ImportError: class ZarrAttributesTestCase(unittest.TestCase): def test_dump_separates_namespaces(self) -> None: attrs = ZarrAttributes() - attrs.lsst["archive_class"] = "Image" + attrs.lsst["archive_class"] = "MaskedImage" attrs.ome["multiscales"] = [{"name": "image"}] + attrs.extra["_ARRAY_DIMENSIONS"] = ["y", "x"] + attrs.extra["units"] = "adu" dumped = attrs.dump() - self.assertEqual(dumped[LSST_NS]["archive_class"], "Image") + self.assertEqual(dumped[LSST_NS]["archive_class"], "MaskedImage") self.assertEqual(dumped[LSST_NS]["version"], LSST_VERSION) self.assertEqual(dumped[OME_NS]["multiscales"], [{"name": "image"}]) self.assertEqual(dumped[OME_NS]["version"], OME_VERSION) + # CF / xarray attrs sit at the top level, not inside lsst: or ome:. + self.assertEqual(dumped["_ARRAY_DIMENSIONS"], ["y", "x"]) + self.assertEqual(dumped["units"], "adu") def test_load_preserves_unknown_keys(self) -> None: # Forward compatibility: unknown lsst.* keys must survive a - # load → dump round-trip so a partial-knowledge reader can - # still re-emit them on write. + # load -> dump round-trip. raw = { - LSST_NS: {"version": LSST_VERSION, "archive_class": "Image", "future_thing": {"x": 1}}, + LSST_NS: { + "version": LSST_VERSION, + "archive_class": "Image", + "future_thing": {"x": 1}, + }, OME_NS: {"version": OME_VERSION, "multiscales": []}, + "_ARRAY_DIMENSIONS": ["y", "x"], + "units": "adu", } attrs = ZarrAttributes.load(raw) dumped = attrs.dump() self.assertEqual(dumped[LSST_NS]["future_thing"], {"x": 1}) + self.assertEqual(dumped["units"], "adu") @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class ZarrArrayTestCase(unittest.TestCase): def test_lazy_data_after_from_zarr(self) -> None: - # Write an array to an in-memory store and load via from_zarr. - # The IR must hold the zarr.Array handle, NOT a numpy array. store = zarr.storage.MemoryStore() root = zarr.create_group(store=store, zarr_format=3) - zarr_array = root.create_array(name="0", shape=(8, 8), chunks=(4, 4), dtype="float32") + zarr_array = root.create_array( + name="image", shape=(8, 8), chunks=(4, 4), dtype="float32" + ) zarr_array[:] = np.arange(64, dtype=np.float32).reshape(8, 8) ir_array = ZarrArray.from_zarr(zarr_array) - # Critical invariant: data is the lazy zarr.Array, not numpy. + # Lazy invariant: data is the zarr.Array handle, not numpy. self.assertIsInstance(ir_array.data, zarr.Array) self.assertNotIsInstance(ir_array.data, np.ndarray) - # Shape/dtype come from the lazy handle without reading bytes. self.assertEqual(ir_array.shape, (8, 8)) self.assertEqual(str(ir_array.dtype), "float32") def test_subset_does_not_materialize_full_array(self) -> None: - # The IR must let callers slice through to the lazy handle so - # only the touched chunks are read. We simulate "remote bytes - # consumed" by counting key fetches against the underlying store. store = _CountingStore() root = zarr.create_group(store=store, zarr_format=3) - zarr_array = root.create_array(name="0", shape=(16, 16), chunks=(4, 4), dtype="int32") + zarr_array = root.create_array( + name="image", shape=(16, 16), chunks=(4, 4), dtype="int32" + ) zarr_array[:] = np.arange(256, dtype=np.int32).reshape(16, 16) store.reads = 0 # reset after the write phase ir_array = ZarrArray.from_zarr(zarr_array) - # Reading ir_array.shape must not fetch any chunk data. + # Reading shape / dtype must not fetch any chunk data. self.assertEqual(ir_array.shape, (16, 16)) self.assertEqual(store.reads, 0) - # A 4×4 subset spans exactly one chunk; reading it must touch at - # most that chunk's data key (plus zero or one metadata keys). subset = ir_array.read(slices=(slice(0, 4), slice(0, 4))) self.assertEqual(subset.shape, (4, 4)) np.testing.assert_array_equal(subset, np.arange(256).reshape(16, 16)[:4, :4]) @@ -480,7 +521,6 @@ class ZarrArrayTestCase(unittest.TestCase): self.assertLess(store.reads, 16) def test_staged_numpy_array_is_eager(self) -> None: - # Pre-write IR (caller just called add_array): data is numpy. data = np.arange(12, dtype=np.float64).reshape(3, 4) ir_array = ZarrArray(data=data) self.assertIs(ir_array.data, data) @@ -488,9 +528,7 @@ class ZarrArrayTestCase(unittest.TestCase): class _CountingStore(zarr.storage.MemoryStore if HAVE_ZARR else object): - """A MemoryStore that counts get() calls so we can prove subset reads - only touch the chunks they need. - """ + """A MemoryStore that counts get() calls.""" def __init__(self) -> None: super().__init__() @@ -512,7 +550,7 @@ Expected: FAIL — `ImportError` on `lsst.images.zarr._model`. - [ ] **Step 3: Write `_model.py` (initial portion: `ZarrAttributes` and `ZarrArray`)** -Create `python/lsst/images/zarr/_model.py` with this content (further IR types are appended in Task 1.4): +Create `python/lsst/images/zarr/_model.py`: ```python # This file is part of lsst-images. @@ -526,7 +564,7 @@ Create `python/lsst/images/zarr/_model.py` with this content (further IR types a # Use of this source code is governed by a 3-clause BSD-style # license that can be found in the LICENSE file. -"""Python intermediate representation for OME-Zarr / lsst-extension content. +"""Python intermediate representation for zarr / xarray-CF / OME-NGFF content. The IR is the source of truth for what gets written. ``ZarrOutputArchive`` populates a `ZarrDocument`; on context-manager exit, `to_zarr` materializes @@ -561,19 +599,25 @@ from ._common import LSST_NS, LSST_VERSION, OME_NS, OME_VERSION, ZarrCompression class ZarrAttributes: """Namespaced attributes attached to a `ZarrGroup` or `ZarrArray`. - The two top-level namespaces (``lsst:`` and ``ome:``) are kept - separate so the IR can serve both internal LSST round-trip needs - and external OME-Zarr discoverability without key collisions. + Three namespaces: + + - ``lsst`` — LSST extensions (always emitted with a ``version`` key). + - ``ome`` — OME-NGFF (emitted only when non-empty). + - ``extra`` — flat top-level keys for CF / xarray conventions + (``_ARRAY_DIMENSIONS``, ``flag_masks``, ``flag_meanings``, + ``flag_descriptions``, ``units``, ``long_name``, …). These live at + the top of ``zarr.json`` ``attributes`` so xarray and CF tooling + see them without unwrapping a namespace. """ lsst: dict[str, Any] = field(default_factory=dict) ome: dict[str, Any] = field(default_factory=dict) + extra: dict[str, Any] = field(default_factory=dict) def dump(self) -> dict[str, Any]: """Return the raw mapping zarr-python writes to ``zarr.json``.""" - out: dict[str, Any] = {} - # Always emit lsst namespace with a schema version; readers - # use this to gate forward-compat behavior. + out: dict[str, Any] = dict(self.extra) + # lsst is always present so readers can dispatch on lsst.archive_class. out[LSST_NS] = {"version": LSST_VERSION, **self.lsst} if self.ome: out[OME_NS] = {"version": OME_VERSION, **self.ome} @@ -583,10 +627,11 @@ class ZarrAttributes: def load(cls, raw: dict[str, Any]) -> Self: """Construct from a raw attributes mapping read from zarr.""" lsst = dict(raw.get(LSST_NS, {})) - lsst.pop("version", None) # version is implicit in the namespace + lsst.pop("version", None) # version implicit in the namespace ome = dict(raw.get(OME_NS, {})) ome.pop("version", None) - return cls(lsst=lsst, ome=ome) + extra = {k: v for k, v in raw.items() if k not in (LSST_NS, OME_NS)} + return cls(lsst=lsst, ome=ome, extra=extra) @dataclass @@ -608,7 +653,7 @@ class ZarrArray: resulting shard exceeds 1 MiB. compression Codec configuration. ``None`` falls back to - ``ZarrCompressionOptions.default_for_dtype(dtype)``. + `ZarrCompressionOptions.default_for_dtype`. attributes Namespaced attributes for this array's ``zarr.json``. """ @@ -631,9 +676,6 @@ class ZarrArray: def from_zarr(cls, zarr_array: zarr.Array) -> Self: """Wrap an open ``zarr.Array`` without reading its data.""" attrs = ZarrAttributes.load(dict(zarr_array.attrs)) - # Reading shape/dtype off the open handle is metadata-only; chunks - # are read off the array's chunk_grid configuration. We do NOT - # call zarr_array[:] here. return cls( data=zarr_array, chunks=tuple(zarr_array.chunks), @@ -645,19 +687,17 @@ class ZarrArray: For a `ZarrArray` backed by a lazy handle, this is the only place that touches array bytes. ``slices`` is forwarded straight - to the handle so only the chunks intersecting the slice are - fetched. + to the handle so only chunks intersecting the slice are fetched. """ if isinstance(self.data, np.ndarray): return self.data if slices is ... else self.data[slices] - # zarr.Array supports lazy slicing via __getitem__. return self.data[...] if slices is ... else self.data[slices] ``` -- [ ] **Step 4: Run the test to verify it passes** +- [ ] **Step 4: Run the tests to verify they pass** Run: `pytest tests/test_zarr_model.py -v` -Expected: PASS — 4 tests pass; the `_CountingStore` test confirms a 4×4 subset of a 16×16 / chunks=(4,4) array touches strictly fewer than 16 chunk reads. +Expected: PASS — 5 tests pass; the `_CountingStore` test confirms a 4×4 subset of a 16×16 / chunks=(4,4) array touches strictly fewer than 16 chunk reads. - [ ] **Step 5: Commit** @@ -666,17 +706,17 @@ git add python/lsst/images/zarr/_model.py tests/test_zarr_model.py git commit -m "feat: add ZarrAttributes and ZarrArray IR with lazy zarr.Array backing" ``` -### Task 1.4: IR — `ZarrGroup`, `ZarrDocument`, and store materialization +### Task 1.4: IR — `ZarrGroup`, `ZarrDocument`, store materialization **Files:** -- Modify: `python/lsst/images/zarr/_model.py` (append `ZarrGroup`, `ZarrDocument`, materialization helpers) +- Modify: `python/lsst/images/zarr/_model.py` (append `ZarrGroup`, `ZarrDocument`, helpers) - Modify: `tests/test_zarr_model.py` (add round-trip test through `MemoryStore`) -This task gives the IR a full tree shape and the bidirectional `to_zarr` / `from_zarr` materialization. The round-trip test pins the lazy invariant: `from_zarr` on a freshly-opened store does not read any chunk bytes; only `ZarrArray.read()` does. +This task gives the IR a full tree shape and the bidirectional `to_zarr` / `from_zarr` materialization. The round-trip test pins the lazy invariant: after `from_zarr` on a freshly-opened store, every `ZarrArray.data` is a `zarr.Array`, not a materialized ndarray. - [ ] **Step 1: Write the failing test (extend `test_zarr_model.py`)** -Append this class to `tests/test_zarr_model.py` (before the `if __name__` guard): +Append before the `if __name__` guard: ```python @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") @@ -684,30 +724,67 @@ class ZarrDocumentTestCase(unittest.TestCase): def test_round_trip_through_memory_store(self) -> None: from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup - # Build an IR: top-level array at /0, a sub-group at /lsst/mask - # with its own array at /lsst/mask/0. + # Build a flat IR: image, variance, mask siblings at root. doc = ZarrDocument(root=ZarrGroup()) doc.root.attributes.lsst["archive_class"] = "MaskedImage" - doc.root.arrays["0"] = ZarrArray(data=np.ones((4, 4), dtype="float32")) - mask = ZarrGroup() - mask.arrays["0"] = ZarrArray(data=np.zeros((1, 4, 4), dtype="uint8")) - mask.attributes.lsst["mask"] = {"planes": [{"name": "BAD", "bit": 0}]} - doc.root.groups["lsst"] = ZarrGroup(groups={"mask": mask}) + doc.root.attributes.lsst["tree"] = "tree" + + image = ZarrArray(data=np.ones((4, 4), dtype="float32")) + image.attributes.extra["_ARRAY_DIMENSIONS"] = ["y", "x"] + doc.root.arrays["image"] = image + + mask = ZarrArray(data=np.zeros((4, 4), dtype="uint8")) + mask.attributes.extra["_ARRAY_DIMENSIONS"] = ["y", "x"] + mask.attributes.extra["flag_masks"] = [1, 2] + mask.attributes.extra["flag_meanings"] = "BAD SAT" + doc.root.arrays["mask"] = mask + + # Stub a 1-D uint8 'tree' array (JSON bytes). + doc.root.arrays["tree"] = ZarrArray( + data=np.frombuffer(b"{}", dtype=np.uint8) + ) store = zarr.storage.MemoryStore() doc.to_zarr(store) - # Reload and verify lazy invariant: ZarrArray.data is a zarr.Array, - # not a materialized ndarray. + # Reload and verify lazy invariant on every array. recovered = ZarrDocument.from_zarr(store) - self.assertIsInstance(recovered.root.arrays["0"].data, zarr.Array) + self.assertIsInstance(recovered.root.arrays["image"].data, zarr.Array) + self.assertIsInstance(recovered.root.arrays["mask"].data, zarr.Array) self.assertEqual( recovered.root.attributes.lsst["archive_class"], "MaskedImage" ) - recovered_mask = recovered.root.groups["lsst"].groups["mask"] + # CF flag attrs round-trip via the extra namespace. + self.assertEqual( + recovered.root.arrays["mask"].attributes.extra["flag_meanings"], + "BAD SAT", + ) + # xarray dims round-trip. + self.assertEqual( + recovered.root.arrays["image"].attributes.extra["_ARRAY_DIMENSIONS"], + ["y", "x"], + ) + # Subset reads still go through the lazy handle. np.testing.assert_array_equal( - recovered_mask.arrays["0"].read(), np.zeros((1, 4, 4), dtype="uint8") + recovered.root.arrays["image"].read(), np.ones((4, 4), dtype="float32") ) + + def test_get_walks_paths(self) -> None: + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + + doc = ZarrDocument(root=ZarrGroup()) + doc.root.arrays["image"] = ZarrArray(data=np.zeros((2, 2), dtype="float32")) + red = doc.root.ensure_group("/red") + red.arrays["image"] = ZarrArray(data=np.ones((2, 2), dtype="float32")) + + # Absolute and relative paths. + self.assertIs(doc.root.get("/image"), doc.root.arrays["image"]) + self.assertIs(doc.root.get("image"), doc.root.arrays["image"]) + self.assertIs(doc.root.get("/red/image"), red.arrays["image"]) + self.assertIs(doc.root.get("/"), doc.root) + + with self.assertRaises(KeyError): + doc.root.get("/missing") ``` - [ ] **Step 2: Run the test to verify it fails** @@ -715,7 +792,7 @@ class ZarrDocumentTestCase(unittest.TestCase): Run: `pytest tests/test_zarr_model.py::ZarrDocumentTestCase -v` Expected: FAIL — `ImportError` for `ZarrGroup` / `ZarrDocument`. -- [ ] **Step 3: Append `ZarrGroup`, `ZarrDocument`, materialization helpers to `_model.py`** +- [ ] **Step 3: Append `ZarrGroup`, `ZarrDocument`, helpers** Update the `__all__` and append to `python/lsst/images/zarr/_model.py`: @@ -770,7 +847,7 @@ class ZarrGroup: @dataclass class ZarrDocument: - """A complete OME-Zarr archive root.""" + """A complete zarr archive root.""" root: ZarrGroup = field(default_factory=ZarrGroup) @@ -781,11 +858,7 @@ class ZarrDocument: return cls(root=_group_from_zarr(zarr_root)) def to_zarr(self, store: zarr.storage.Store) -> None: - """Materialize this IR into ``store``. - - The store is expected to be empty; callers (the output archive's - write helper) are responsible for create-only enforcement. - """ + """Materialize this IR into ``store`` (which must be empty).""" zarr_root = zarr.create_group(store=store, zarr_format=3, overwrite=False) _group_to_zarr(self.root, zarr_root) @@ -838,12 +911,7 @@ def _default_chunks(shape: tuple[int, ...]) -> tuple[int, ...]: def _build_codecs(options: ZarrCompressionOptions) -> list[Any]: - """Build a zarr v3 codec stack from `ZarrCompressionOptions`. - - For zarr v3 the codec list always begins with the bytes codec - followed by compressors. Blosc supports byte-shuffle and - bit-shuffle via its ``shuffle`` argument. - """ + """Build a zarr v3 codec stack from `ZarrCompressionOptions`.""" from numcodecs.zarr3 import Blosc if options.codec != "blosc": @@ -857,7 +925,7 @@ def _build_codecs(options: ZarrCompressionOptions) -> list[Any]: - [ ] **Step 4: Run all model tests** Run: `pytest tests/test_zarr_model.py -v` -Expected: PASS — all 5 tests pass; the round-trip test confirms `recovered.root.arrays["0"].data` is a `zarr.Array`, not a numpy array. +Expected: PASS — all tests pass; round-trip test confirms `.data` is a `zarr.Array` after `from_zarr`. - [ ] **Step 5: Commit** @@ -866,13 +934,13 @@ git add python/lsst/images/zarr/_model.py tests/test_zarr_model.py git commit -m "feat: add ZarrGroup and ZarrDocument with lazy-on-read materialization" ``` -### Task 1.5: IR — OME multiscales, omero channels, mask group, table group +### Task 1.5: IR — OME and CF helper dataclasses **Files:** -- Modify: `python/lsst/images/zarr/_model.py` (append OME / LSST helper dataclasses) -- Modify: `tests/test_zarr_model.py` (add helper-construction test) +- Modify: `python/lsst/images/zarr/_model.py` (append OME / CF helpers) +- Modify: `tests/test_zarr_model.py` (helper-construction test) -These small dataclasses centralize the OME and LSST attribute shapes so `_layout.py` can populate them without literal-dict-typo bugs. They round-trip through `ZarrAttributes.dump()` / `load()`. +These small dataclasses centralize the OME and CF attribute shapes so `_layout.py` can populate them without literal-dict-typo bugs. - [ ] **Step 1: Write the failing test** @@ -880,45 +948,90 @@ Append to `tests/test_zarr_model.py`: ```python @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") -class OmeAndLsstHelpersTestCase(unittest.TestCase): +class OmeCfHelpersTestCase(unittest.TestCase): def test_multiscale_emits_expected_shape(self) -> None: from lsst.images.zarr._model import OmeMultiscale - m = OmeMultiscale(name="visitimage", axes=("y", "x")) + m = OmeMultiscale( + name="visitimage", + axes=("y", "x"), + dataset_path="image", + ) d = m.dump() self.assertEqual(d["name"], "visitimage") self.assertEqual( - d["axes"], [{"name": "y", "type": "space"}, {"name": "x", "type": "space"}] + d["axes"], + [ + {"name": "y", "type": "space", "unit": "pixel"}, + {"name": "x", "type": "space", "unit": "pixel"}, + ], + ) + self.assertEqual(d["datasets"][0]["path"], "image") + # Default coordinate transform is unit scale until a real one is set. + self.assertEqual( + d["datasets"][0]["coordinateTransformations"], + [{"type": "scale", "scale": [1.0, 1.0]}], + ) + + def test_multiscale_with_affine(self) -> None: + from lsst.images.zarr._model import OmeMultiscale + + m = OmeMultiscale( + name="image", + axes=("y", "x"), + dataset_path="image", + coordinate_transformations=[ + {"type": "scale", "scale": [0.2, 0.2]}, + { + "type": "affine", + "affine": [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], + }, + ], + ) + d = m.dump() + self.assertEqual(len(d["datasets"][0]["coordinateTransformations"]), 2) + self.assertEqual( + d["datasets"][0]["coordinateTransformations"][0]["type"], "scale" ) - self.assertEqual(d["datasets"], [{"path": "0"}]) - def test_lsst_mask_group_round_trip(self) -> None: - from lsst.images.zarr._model import LsstMaskGroup, MaskPlaneEntry + def test_cf_flag_attributes(self) -> None: + from lsst.images.zarr._model import CfFlagAttributes, MaskPlaneEntry - group = LsstMaskGroup( - planes=[MaskPlaneEntry(name="BAD", bit=0, description="Bad pixel.")] + cf = CfFlagAttributes( + planes=[ + MaskPlaneEntry(name="BAD", bit=0, description="Bad pixel."), + MaskPlaneEntry(name="SAT", bit=1, description="Saturated."), + MaskPlaneEntry(name="CR", bit=2, description="Cosmic ray."), + ] + ) + d = cf.dump() + self.assertEqual(d["flag_masks"], [1, 2, 4]) + self.assertEqual(d["flag_meanings"], "BAD SAT CR") + self.assertEqual( + d["flag_descriptions"], ["Bad pixel.", "Saturated.", "Cosmic ray."] ) - dumped = group.dump() - recovered = LsstMaskGroup.load(dumped) - self.assertEqual(len(recovered.planes), 1) - self.assertEqual(recovered.planes[0].name, "BAD") - self.assertEqual(recovered.planes[0].bit, 0) + + def test_image_array_attrs(self) -> None: + from lsst.images.zarr._model import build_image_array_attrs + + attrs = build_image_array_attrs(axes=("y", "x"), units="adu", long_name="science image") + self.assertEqual(attrs["_ARRAY_DIMENSIONS"], ["y", "x"]) + self.assertEqual(attrs["units"], "adu") + self.assertEqual(attrs["long_name"], "science image") ``` - [ ] **Step 2: Run to verify failure** -Run: `pytest tests/test_zarr_model.py::OmeAndLsstHelpersTestCase -v` +Run: `pytest tests/test_zarr_model.py::OmeCfHelpersTestCase -v` Expected: FAIL — `ImportError`. -- [ ] **Step 3: Append OME and LSST helpers to `_model.py`** +- [ ] **Step 3: Append helpers to `_model.py`** -Update the `__all__` and append to `python/lsst/images/zarr/_model.py`: +Update the `__all__` and append: ```python __all__ = ( - "LsstMaskGroup", - "LsstTableColumn", - "LsstTableGroup", + "CfFlagAttributes", "MaskPlaneEntry", "OmeMultiscale", "OmeOmeroChannel", @@ -926,6 +1039,7 @@ __all__ = ( "ZarrAttributes", "ZarrDocument", "ZarrGroup", + "build_image_array_attrs", ) @@ -933,36 +1047,45 @@ __all__ = ( class OmeMultiscale: """OME-NGFF v0.5 multiscales metadata for a single-level image. - The backend always writes one level (``path=0``); pyramid generation - is a follow-up tracked in the design spec. + The backend always writes one level whose ``path`` points at a + sibling array (``image`` for typical archives). ``coordinate_transformations`` + defaults to a unit ``scale`` so the OME block is well-formed even + when the simplified affine is dropped by the residual validator. """ name: str axes: tuple[str, ...] - coordinate_transformations: list[dict[str, Any]] = field(default_factory=list) + dataset_path: str = "image" + coordinate_transformations: list[dict[str, Any]] | None = None @staticmethod - def _axis_type(name: str) -> str: + def _axis_block(name: str) -> dict[str, Any]: if name == "c": - return "channel" + return {"name": "c", "type": "channel"} if name == "t": - return "time" - return "space" + return {"name": "t", "type": "time"} + return {"name": name, "type": "space", "unit": "pixel"} def dump(self) -> dict[str, Any]: - dataset: dict[str, Any] = {"path": "0"} - if self.coordinate_transformations: - dataset["coordinateTransformations"] = list(self.coordinate_transformations) + ndim = len(self.axes) + ct = self.coordinate_transformations + if ct is None: + ct = [{"type": "scale", "scale": [1.0] * ndim}] return { "name": self.name, - "axes": [{"name": a, "type": self._axis_type(a)} for a in self.axes], - "datasets": [dataset], + "axes": [self._axis_block(a) for a in self.axes], + "datasets": [ + { + "path": self.dataset_path, + "coordinateTransformations": ct, + } + ], } @dataclass class OmeOmeroChannel: - """OME ``omero/channels`` entry for one channel of a multi-channel image.""" + """OME ``omero/channels`` entry (used only when a channel axis exists).""" label: str color: str | None = None @@ -976,7 +1099,7 @@ class OmeOmeroChannel: @dataclass class MaskPlaneEntry: - """One mask-plane definition under ``lsst.mask.planes``.""" + """One mask-plane definition.""" name: str bit: int @@ -984,86 +1107,73 @@ class MaskPlaneEntry: @dataclass -class LsstMaskGroup: - """Helper for ``lsst.mask`` attributes on a mask sub-group.""" +class CfFlagAttributes: + """CF-conventions flag metadata for a 2-D packed mask array. + + Emits ``flag_masks`` (list of bit values), ``flag_meanings`` + (single space-separated string per CF), and the LSST extension + ``flag_descriptions`` (list of human-readable strings parallel to + ``flag_meanings``). + """ planes: list[MaskPlaneEntry] = field(default_factory=list) def dump(self) -> dict[str, Any]: return { - "planes": [ - {"name": p.name, "bit": p.bit, "description": p.description} - for p in self.planes - ] + "flag_masks": [int(1 << p.bit) for p in self.planes], + "flag_meanings": " ".join(p.name for p in self.planes), + "flag_descriptions": [p.description for p in self.planes], } @classmethod def load(cls, raw: dict[str, Any]) -> Self: - return cls( - planes=[ - MaskPlaneEntry( - name=p["name"], - bit=int(p["bit"]), - description=p.get("description", ""), - ) - for p in raw.get("planes", []) - ] - ) - - -@dataclass -class LsstTableColumn: - """One column entry under ``lsst.table.columns``.""" - - name: str - dtype: str - unit: str | None = None - description: str = "" - - -@dataclass -class LsstTableGroup: - """Helper for ``lsst.table`` attributes on a table sub-group.""" - - columns: list[LsstTableColumn] = field(default_factory=list) - length: int = 0 - meta: dict[str, Any] = field(default_factory=dict) - - def dump(self) -> dict[str, Any]: - return { - "columns": [ - { - "name": c.name, - "dtype": c.dtype, - "unit": c.unit, - "description": c.description, - } - for c in self.columns - ], - "length": self.length, - "meta": self.meta, - } + meanings = raw.get("flag_meanings", "").split() + masks = [int(m) for m in raw.get("flag_masks", [])] + descriptions = list(raw.get("flag_descriptions", [""] * len(meanings))) + planes = [] + for name, mask, desc in zip(meanings, masks, descriptions, strict=False): + # Recover bit position from the mask value (always a power of 2). + bit = (mask & -mask).bit_length() - 1 + planes.append(MaskPlaneEntry(name=name, bit=bit, description=desc)) + return cls(planes=planes) + + +def build_image_array_attrs( + *, + axes: tuple[str, ...], + units: str | None = None, + long_name: str | None = None, +) -> dict[str, Any]: + """Build the CF / xarray attribute block for a 2-D-or-higher image array.""" + out: dict[str, Any] = {"_ARRAY_DIMENSIONS": list(axes)} + if units is not None: + out["units"] = units + if long_name is not None: + out["long_name"] = long_name + return out ``` - [ ] **Step 4: Run all model tests** Run: `pytest tests/test_zarr_model.py -v` -Expected: PASS — 7 tests. +Expected: PASS — 9 tests. - [ ] **Step 5: Commit** ```bash git add python/lsst/images/zarr/_model.py tests/test_zarr_model.py -git commit -m "feat: add OmeMultiscale, OmeOmeroChannel, LsstMaskGroup, LsstTableGroup helpers" +git commit -m "feat: add OmeMultiscale, CfFlagAttributes, image-array-attrs helpers" ``` --- +**End of Phase 1.** Five tasks. The IR is in place with the lazy invariant pinned by `_CountingStore`, the CF / OME helpers are unit-tested in isolation, and `ZarrAttributes` separates `lsst:` / `ome:` / top-level (`extra`) namespaces so xarray and CF tooling see flat attributes without unwrapping. Phase 2 wires `_store.py`, `_layout.py`, and `ZarrOutputArchive` against this IR for `Image` / `MaskedImage` / `VisitImage`. + ## Phase 2 — Store dispatch, layout rules, and `ZarrOutputArchive` (Image / MaskedImage / VisitImage) -This phase adds enough machinery to **write** a plain `Image`, a `MaskedImage`, and a `VisitImage` to a zarr archive on disk and on a `ZipStore`. No reading yet — that lands in Phase 3 — so tests inspect the on-disk shape via `ZarrDocument.from_zarr()` directly (the same lazy IR) and assert on attributes/paths/shapes. +This phase adds enough machinery to **write** a plain `Image`, a `MaskedImage`, and a `VisitImage` to a zarr archive on disk and on a `ZipStore`. No reading yet — that lands in Phase 3 — so tests inspect the on-disk shape via `ZarrDocument.from_zarr()` directly. `ColorImage` and `CellCoadd` are deferred to Phase 4. -`ColorImage` and `CellCoadd` are deferred to Phase 4. `add_table` / `add_structured_array` produce native zarr tables in this phase (the JSON tree carries the references) so VisitImage's tabular components round-trip. +The output archive's `add_array(name)` method writes to a zarr path equal to `name` (after stripping the leading slash). There is **no JSON-pointer mapping table** and **no fixup pass**. Mask arrays go through a small specialization that packs a 3-D `(y, x, mask_size)` in-memory mask into the 2-D wide-integer on-disk form and attaches CF flag attrs. ### Task 2.1: `_store.py` — URI → `zarr.storage.Store` dispatch @@ -1071,7 +1181,7 @@ This phase adds enough machinery to **write** a plain `Image`, a `MaskedImage`, - Create: `python/lsst/images/zarr/_store.py` - Test: `tests/test_zarr_store.py` -The store layer is the only place that knows about `lsst.resources.ResourcePath`. The output archive and input archive both call `open_store_for_write(uri, ...)` / `open_store_for_read(uri)` and treat the result as an opaque `zarr.storage.Store`. URI dispatch: +URI dispatch: | URI shape | Store | |---|---| @@ -1079,7 +1189,7 @@ The store layer is the only place that knows about `lsst.resources.ResourcePath` | `file://` or local path | `zarr.storage.LocalStore` | | `http(s)://`, `s3://`, `gs://`, etc. | `zarr.storage.FsspecStore` (via `fsspec.url_to_fs`) | -Create-only mode is enforced here, not in `to_zarr`: the write helpers refuse to open a non-empty existing store. +Create-only mode is enforced here: write helpers refuse to open a non-empty existing store. - [ ] **Step 1: Write the failing test** @@ -1121,7 +1231,6 @@ class StoreDispatchTestCase(unittest.TestCase): with open_store_for_write(target) as store: self.assertIsInstance(store, zarr.storage.LocalStore) zarr.create_group(store=store, zarr_format=3) - # Re-opening for read returns a usable store. with open_store_for_read(target) as store: self.assertIsInstance(store, zarr.storage.LocalStore) root = zarr.open_group(store=store, mode="r") @@ -1153,7 +1262,7 @@ if __name__ == "__main__": - [ ] **Step 2: Run the test to verify it fails** Run: `pytest tests/test_zarr_store.py -v` -Expected: FAIL — `ImportError` on `lsst.images.zarr._store`. +Expected: FAIL — `ImportError`. - [ ] **Step 3: Write `_store.py`** @@ -1196,15 +1305,15 @@ def _is_remote(rp: ResourcePath) -> bool: def open_store_for_write(path: ResourcePathExpression) -> Iterator[zarr.storage.Store]: """Open a zarr store for writing. - Refuses to overwrite an existing non-empty store. The returned + Refuses to overwrite a non-empty existing store. The returned context manager closes the store on exit; for ``ZipStore`` this - finalizes the central directory (atomic write). + finalizes the central directory. """ rp = ResourcePath(path) if _is_zip(rp): - local = rp.ospath if not _is_remote(rp) else None - if local is None: + if _is_remote(rp): raise NotImplementedError("Remote ZipStore writes are a follow-up.") + local = rp.ospath if os.path.exists(local) and os.path.getsize(local) > 0: raise OSError(f"File {local!r} already exists.") store = zarr.storage.ZipStore(local, mode="w") @@ -1217,25 +1326,17 @@ def open_store_for_write(path: ResourcePathExpression) -> Iterator[zarr.storage. import fsspec fs, fs_path = fsspec.url_to_fs(str(rp)) - # FsspecStore does its own existence check via the fs. if fs.exists(fs_path) and fs.ls(fs_path): raise OSError(f"Store {rp!s} already exists.") store = zarr.storage.FsspecStore(fs=fs, path=fs_path, read_only=False) - try: - yield store - finally: - pass # FsspecStore has no explicit close. + yield store return - # Local directory. local = rp.ospath if os.path.exists(local) and os.listdir(local): raise OSError(f"Directory {local!r} already exists and is non-empty.") os.makedirs(local, exist_ok=True) store = zarr.storage.LocalStore(local, read_only=False) - try: - yield store - finally: - pass + yield store @contextmanager @@ -1244,8 +1345,6 @@ def open_store_for_read(path: ResourcePathExpression) -> Iterator[zarr.storage.S rp = ResourcePath(path) if _is_zip(rp): if _is_remote(rp): - # Materialize remote zips locally first; remote ZipStore is a - # follow-up. with rp.as_local() as local: store = zarr.storage.ZipStore(local.ospath, mode="r") try: @@ -1270,10 +1369,10 @@ def open_store_for_read(path: ResourcePathExpression) -> Iterator[zarr.storage.S yield store ``` -- [ ] **Step 4: Run the tests to verify they pass** +- [ ] **Step 4: Run the tests** Run: `pytest tests/test_zarr_store.py -v` -Expected: PASS — 3 tests pass. +Expected: PASS — 3 tests. - [ ] **Step 5: Commit** @@ -1282,19 +1381,21 @@ git add python/lsst/images/zarr/_store.py tests/test_zarr_store.py git commit -m "feat: add zarr store dispatch (LocalStore / ZipStore / FsspecStore)" ``` -### Task 2.2: `_layout.py` — archive-class axes and ColorImage transpose +### Task 2.2: `_layout.py` — axes per archive class and chunk derivation **Files:** - Create: `python/lsst/images/zarr/_layout.py` - Test: `tests/test_zarr_layout.py` -`_layout.py` centralizes per-archive-class decisions so `_output_archive.py` and `_input_archive.py` stay generic. The functions it exposes: +This task adds the per-archive-class layout rules: axis tuples and chunk-shape derivation. Chunk derivation honors three sources of truth in priority order: -- `axes_for_archive_class(name)` — returns the OME axis tuple -- `chunks_for(archive_class, shape, override)` — derives default chunks -- `transpose_color_image_in(array)` / `transpose_color_image_out(array)` — `(Y, X, 3) ↔ (3, Y, X)` (used in Phase 4 by ColorImage; included now to lock the contract) +1. Explicit per-array override (from `write(chunks={...})`). +2. `cell_shape` from the archive metadata (for `CellCoadd`). +3. `min(1024, dim)` per axis fallback. -`ColorImage` and `CellCoadd` rules ship in this task as data, but the output archive doesn't exercise them until Phase 4. +A separate helper `chunks_aligned_to(image_chunks, shape)` derives `variance`/`mask` chunks from the `image` array's chunks so siblings stay aligned (CF / xarray / GDAL all assume this). The output archive will call this helper when the user has not overridden the sibling's chunks. + +The affine residual validator lands in Task 2.3 (separate task because it has its own test surface). - [ ] **Step 1: Write the failing test** @@ -1316,14 +1417,11 @@ from __future__ import annotations import unittest -import numpy as np - try: from lsst.images.zarr._layout import ( axes_for_archive_class, + chunks_aligned_to, chunks_for, - transpose_color_image_in, - transpose_color_image_out, ) HAVE_ZARR = True @@ -1334,39 +1432,47 @@ except ImportError: @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class LayoutTestCase(unittest.TestCase): def test_axes_for_archive_class(self) -> None: + # Standard 2-D images use (y, x). self.assertEqual(axes_for_archive_class("Image"), ("y", "x")) self.assertEqual(axes_for_archive_class("MaskedImage"), ("y", "x")) self.assertEqual(axes_for_archive_class("VisitImage"), ("y", "x")) - self.assertEqual(axes_for_archive_class("ColorImage"), ("c", "y", "x")) + self.assertEqual(axes_for_archive_class("Mask"), ("y", "x")) self.assertEqual(axes_for_archive_class("CellCoadd"), ("y", "x")) + # ColorImage's root has no top-level multiscale; this returns + # an empty tuple to signal "no OME multiscale at this level". + self.assertEqual(axes_for_archive_class("ColorImage"), ()) def test_chunks_for_default(self) -> None: - # 2D image: clamped to 1024 per axis. self.assertEqual(chunks_for("Image", (4096, 4096), None), (1024, 1024)) - # Smaller than 1024 → use full dim. + # Smaller than 1024 -> use full dim. self.assertEqual(chunks_for("Image", (300, 600), None), (300, 600)) def test_chunks_for_override(self) -> None: - # User override takes precedence. self.assertEqual(chunks_for("Image", (4096, 4096), (256, 256)), (256, 256)) - def test_color_image_transpose(self) -> None: - # In-memory shape (Y, X, 3) → on-disk (3, Y, X). - in_memory = np.arange(2 * 3 * 3, dtype=np.uint8).reshape(2, 3, 3) - on_disk = transpose_color_image_in(in_memory) - self.assertEqual(on_disk.shape, (3, 2, 3)) - # Round-trip. - recovered = transpose_color_image_out(on_disk) - np.testing.assert_array_equal(recovered, in_memory) - - def test_color_image_transpose_after_slicing(self) -> None: - # Critical: when reading a sliced subset of a ColorImage, the - # transpose must run on the (small) sliced result, never the - # full array. Here we feed in a (3, sliced-Y, sliced-X) array - # and check the output is (sliced-Y, sliced-X, 3). - sliced_on_disk = np.arange(3 * 4 * 5, dtype=np.uint8).reshape(3, 4, 5) - out = transpose_color_image_out(sliced_on_disk) - self.assertEqual(out.shape, (4, 5, 3)) + def test_chunks_for_cell_coadd_uses_cell_shape(self) -> None: + result = chunks_for( + "CellCoadd", + (4096, 4096), + None, + archive_metadata={"cell_shape": (256, 256)}, + ) + self.assertEqual(result, (256, 256)) + + def test_chunks_for_cell_coadd_without_metadata_falls_back(self) -> None: + self.assertEqual(chunks_for("CellCoadd", (4096, 4096), None), (1024, 1024)) + + def test_chunks_aligned_to_matches_image(self) -> None: + # variance / mask follow image's chunks when not overridden. + self.assertEqual( + chunks_aligned_to(image_chunks=(256, 256), shape=(4096, 4096)), + (256, 256), + ) + # If the sibling shape is smaller than image's chunks, clamp. + self.assertEqual( + chunks_aligned_to(image_chunks=(1024, 1024), shape=(300, 600)), + (300, 600), + ) if __name__ == "__main__": @@ -1398,38 +1504,40 @@ Create `python/lsst/images/zarr/_layout.py`: This module centralises the decisions that vary by image type: -- which OME axes apply (``ColorImage`` adds ``c``) -- default chunk sizes (clamped to 1024 per axis for plain images) -- how the in-memory ``(Y, X, 3)`` `ColorImage` array maps to the - ``(c, y, x)`` on-disk shape +- which OME axes apply (``ColorImage`` has no root multiscale) +- default chunk sizes (clamped to 1024 per axis for plain images, + cell-aligned for `CellCoadd`, image-aligned for `variance` / `mask` + siblings) +- the affine residual validator that gates the OME + ``coordinateTransformations`` block Keeping these in one place lets the output archive populate the IR -generically and lets the input archive apply per-class fixups (the -ColorImage transpose) **after** slicing — never on the full array. +generically. """ from __future__ import annotations __all__ = ( "axes_for_archive_class", + "chunks_aligned_to", "chunks_for", - "transpose_color_image_in", - "transpose_color_image_out", ) -import numpy as np +from collections.abc import Mapping +from typing import Any _DEFAULT_AXIS_LIMIT = 1024 def axes_for_archive_class(name: str) -> tuple[str, ...]: - """Return the OME axis tuple for a given archive class name. + """Return the OME axis tuple for a given archive class. - The default is ``(y, x)``. Specific classes that need extra axes - are listed explicitly. + Returns an empty tuple for ``ColorImage`` to signal that there is + no OME multiscale at the root of that class — the per-channel + sub-archives carry their own ``(y, x)`` multiscales. """ if name == "ColorImage": - return ("c", "y", "x") + return () return ("y", "x") @@ -1437,20 +1545,24 @@ def chunks_for( archive_class: str, shape: tuple[int, ...], override: tuple[int, ...] | None, + *, + archive_metadata: Mapping[str, Any] | None = None, ) -> tuple[int, ...]: - """Return the chunk shape to use for an array. + """Return the chunk shape to use for a top-level array. Parameters ---------- archive_class - The top-level archive class (used for class-specific defaults - such as ``CellCoadd``'s cell-aligned chunks; cell handling - lands in Phase 4). + Top-level archive class name; used for class-specific + defaults like ``CellCoadd``'s cell-aligned chunks. shape The full array shape, used to clamp the default per-axis. override - User-supplied chunk shape; if not ``None`` it is returned + User-supplied chunk shape. If not ``None`` it is returned verbatim after a length check. + archive_metadata + Class-specific layout hints. ``CellCoadd`` reads + ``"cell_shape"`` from this mapping. """ if override is not None: if len(override) != len(shape): @@ -1459,55 +1571,312 @@ def chunks_for( f"expected {len(shape)} for {archive_class!r}." ) return tuple(override) + if archive_class == "CellCoadd" and archive_metadata is not None: + cell_shape = archive_metadata.get("cell_shape") + if cell_shape is not None: + return tuple(min(c, dim) for c, dim in zip(cell_shape, shape, strict=True)) return tuple(min(_DEFAULT_AXIS_LIMIT, dim) for dim in shape) -def transpose_color_image_in(array: np.ndarray) -> np.ndarray: - """Transpose a `ColorImage` array from in-memory to on-disk shape. +def chunks_aligned_to( + *, + image_chunks: tuple[int, ...], + shape: tuple[int, ...], +) -> tuple[int, ...]: + """Derive a sibling array's chunks from the ``image`` array's chunks. - In-memory: ``(Y, X, 3)``. On-disk (OME ``c, y, x``): ``(3, Y, X)``. + Used by `ZarrOutputArchive.add_array` for ``variance`` and + ``mask`` siblings when the user has not provided an explicit + override. The result is per-axis ``min(image_chunks[i], + shape[i])`` so a sibling smaller than ``image`` is not + over-chunked. """ - if array.ndim != 3 or array.shape[2] != 3: + if len(image_chunks) != len(shape): raise ValueError( - f"ColorImage in-memory shape must be (Y, X, 3); got {array.shape!r}." + f"image_chunks rank {len(image_chunks)} does not match " + f"sibling shape rank {len(shape)}." + ) + return tuple(min(c, dim) for c, dim in zip(image_chunks, shape, strict=True)) +``` + +- [ ] **Step 4: Run the tests** + +Run: `pytest tests/test_zarr_layout.py -v` +Expected: PASS — 6 tests. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_layout.py tests/test_zarr_layout.py +git commit -m "feat: add zarr layout rules for axes and chunk derivation" +``` + +### Task 2.3: `_layout.py` — affine residual validator + +**Files:** +- Modify: `python/lsst/images/zarr/_layout.py` +- Modify: `tests/test_zarr_layout.py` + +The affine residual validator extracts the linear / affine portion of the AST FrameSet's pixel-to-sky mapping, samples residuals on an 11×11 grid, and decides whether to emit the OME `coordinateTransformations` block. The contract: + +- Input: a `FrameSet`, a 2-D image bbox `(y_size, x_size)`, and a max residual threshold (default 1.0 pixel). +- Output: `AffineCheckResult` carrying either the affine `coordinateTransformations` to emit, **or** a `dropped=True` flag with the observed max residual. + +The function does **not** know about zarr; it only knows about AST. The output archive consumes the result and threads it into the OME multiscale block. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_layout.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class AffineValidatorTestCase(unittest.TestCase): + def _make_linear_frame_set(self, *, scale: float = 0.2): + # Build a synthetic FrameSet whose pixel-to-sky is a pure scale. + from lsst.images._transforms._ast import ( + Frame, + FrameSet, + ZoomMap, + ) + + base = Frame(2, "Domain=PIXEL") + sky = Frame(2, "Domain=SKY") + fs = FrameSet(base) + fs.addFrame(FrameSet.BASE, ZoomMap(2, scale), sky) + return fs + + def _make_distorted_frame_set(self): + # Build a FrameSet that adds a polynomial distortion on top of + # a linear pixel-to-sky map; the affine approximation will be + # off by many pixels at the corners. + from lsst.images._transforms._ast import ( + Frame, + FrameSet, + PolyMap, + ZoomMap, + CmpMap, + ) + + base = Frame(2, "Domain=PIXEL") + sky = Frame(2, "Domain=SKY") + # Forward polynomial: x' = x + 0.001 * y^2; y' = y + 0.001 * x^2. + # PolyMap coefficient table format: [coeff, output_index, x_power, y_power]. + forward_coeffs = [ + [1.0, 1, 1, 0], + [0.001, 1, 0, 2], + [1.0, 2, 0, 1], + [0.001, 2, 2, 0], + ] + poly = PolyMap(forward_coeffs, 2, "IterInverse=1, NIterInverse=20") + cmp = CmpMap(poly, ZoomMap(2, 0.2), True) + fs = FrameSet(base) + fs.addFrame(FrameSet.BASE, cmp, sky) + return fs + + def test_pure_linear_passes(self) -> None: + from lsst.images.zarr._layout import affine_check + + fs = self._make_linear_frame_set(scale=0.2) + result = affine_check( + frame_set=fs, + image_shape=(64, 64), + max_residual_pixels=1.0, + ) + self.assertFalse(result.dropped) + self.assertIsNotNone(result.coordinate_transformations) + self.assertLess(result.max_residual_pixels, 1e-6) + + def test_high_distortion_drops_block(self) -> None: + from lsst.images.zarr._layout import affine_check + + fs = self._make_distorted_frame_set() + # 4096-pixel-wide image: 0.001 * 2048^2 ~ 4000 pixels of error + # at the corners. Way over the 1-pixel threshold. + result = affine_check( + frame_set=fs, + image_shape=(4096, 4096), + max_residual_pixels=1.0, ) - return np.ascontiguousarray(np.transpose(array, (2, 0, 1))) + self.assertTrue(result.dropped) + self.assertGreater(result.max_residual_pixels, 1.0) + # When dropped, the function still reports the residual so the + # output archive can record it as lsst.wcs_simplified_max_residual_pixels. +``` + +- [ ] **Step 2: Run to verify failure** + +Run: `pytest tests/test_zarr_layout.py::AffineValidatorTestCase -v` +Expected: FAIL — `affine_check` does not exist. + +- [ ] **Step 3: Implement `affine_check`** + +Append to `python/lsst/images/zarr/_layout.py`: + +```python +__all__ = ( + "AffineCheckResult", + "affine_check", + "axes_for_archive_class", + "chunks_aligned_to", + "chunks_for", +) + + +from dataclasses import dataclass -def transpose_color_image_out(array: np.ndarray) -> np.ndarray: - """Transpose a `ColorImage` array from on-disk to in-memory shape. +@dataclass +class AffineCheckResult: + """Result of validating a simplified affine against a full WCS. - On-disk: ``(c, y, x)`` (already-sliced if a subset was requested). - In-memory: ``(y, x, c)``. + When ``dropped`` is False, ``coordinate_transformations`` is the + OME-NGFF ``coordinateTransformations`` list to emit. When True, + the caller must omit the block (or emit a unit scale only) and + record ``max_residual_pixels`` as the observed worst error. """ - if array.ndim != 3 or array.shape[0] != 3: - raise ValueError( - f"ColorImage on-disk shape must be (3, Y, X); got {array.shape!r}." + + dropped: bool + max_residual_pixels: float + coordinate_transformations: list[dict[str, Any]] | None + + +def affine_check( + *, + frame_set: Any, + image_shape: tuple[int, int], + max_residual_pixels: float = 1.0, + grid: int = 11, +) -> AffineCheckResult: + """Build an OME affine ``coordinateTransformations`` for ``frame_set``, + validate it on an ``grid``×``grid`` sample, and decide whether to keep it. + + The simplified affine is constructed by mapping three reference + pixels (origin and the two unit-axis steps) through ``frame_set`` + to recover the linear coefficients. The full pixel-to-sky map is + then evaluated at every grid point and compared to the affine's + prediction; the worst great-circle separation is divided by the + pixel scale to get a pixel-equivalent residual. + + If ``max_residual <= max_residual_pixels``, returns a result whose + ``coordinate_transformations`` is the affine block. Otherwise + returns a dropped result and the caller must emit the unit scale + (or no transformations at all). + """ + import numpy as np + + h, w = image_shape + + # 1. Recover the linear / affine portion by mapping three pixels. + pixels = np.array([[0.0, 0.0], [1.0, 0.0], [0.0, 1.0]]) + sky_at_ref = _frame_set_apply(frame_set, pixels) + origin = sky_at_ref[0] + dxsky = sky_at_ref[1] - origin + dysky = sky_at_ref[2] - origin + affine_matrix = np.array( + [ + [dxsky[0], dysky[0], origin[0]], + [dxsky[1], dysky[1], origin[1]], + [0.0, 0.0, 1.0], + ] + ) + + pixel_scale_y = float(np.linalg.norm(dysky)) + pixel_scale_x = float(np.linalg.norm(dxsky)) + pixel_scale = float(np.sqrt(pixel_scale_y * pixel_scale_x)) + if pixel_scale <= 0.0: + return AffineCheckResult( + dropped=True, + max_residual_pixels=float("inf"), + coordinate_transformations=None, + ) + + # 2. Sample residuals on a grid spanning [0, h-1] x [0, w-1]. + ys = np.linspace(0.0, max(h - 1, 0), grid) + xs = np.linspace(0.0, max(w - 1, 0), grid) + grid_pixels = np.array([[y, x] for y in ys for x in xs]) + sky_full = _frame_set_apply(frame_set, grid_pixels) + affine_pred = (affine_matrix[:2, :2] @ grid_pixels.T).T + origin + great_circle = _angular_separation(sky_full, affine_pred) + max_residual = float(np.max(great_circle) / pixel_scale) + + coordinate_transformations: list[dict[str, Any]] = [ + { + "type": "scale", + "scale": [pixel_scale_y, pixel_scale_x], + }, + { + "type": "affine", + "affine": affine_matrix.tolist(), + }, + ] + + if max_residual > max_residual_pixels: + return AffineCheckResult( + dropped=True, + max_residual_pixels=max_residual, + coordinate_transformations=None, + ) + return AffineCheckResult( + dropped=False, + max_residual_pixels=max_residual, + coordinate_transformations=coordinate_transformations, + ) + + +def _frame_set_apply(frame_set: Any, pixels: Any) -> Any: + """Apply ``frame_set``'s base->current mapping to a (N, 2) pixel array.""" + import numpy as np + + pixels = np.asarray(pixels, dtype=float) + mapping = frame_set.getMapping(frame_set.base, frame_set.current) + # AST applyForward expects (n_axes, n_points); transpose round-trip. + out = mapping.applyForward(pixels.T) + return np.asarray(out).T + + +def _angular_separation(a: Any, b: Any) -> Any: + """Element-wise great-circle separation between two arrays of (lon, lat). + + Inputs in radians (AST default for unit sky frames). Returns a 1-D + array of separations in the same units as the input. + """ + import numpy as np + + a = np.asarray(a) + b = np.asarray(b) + lon_a, lat_a = a[:, 0], a[:, 1] + lon_b, lat_b = b[:, 0], b[:, 1] + dlon = lon_b - lon_a + return np.arccos( + np.clip( + np.sin(lat_a) * np.sin(lat_b) + np.cos(lat_a) * np.cos(lat_b) * np.cos(dlon), + -1.0, + 1.0, ) - return np.ascontiguousarray(np.transpose(array, (1, 2, 0))) + ) ``` -- [ ] **Step 4: Run the test to verify it passes** +- [ ] **Step 4: Run the tests** Run: `pytest tests/test_zarr_layout.py -v` -Expected: PASS — 5 tests pass. +Expected: PASS — 8 tests; the linear FrameSet has near-zero residual, the polynomial FrameSet is dropped with `max_residual_pixels` in the thousands. - [ ] **Step 5: Commit** ```bash git add python/lsst/images/zarr/_layout.py tests/test_zarr_layout.py -git commit -m "feat: add zarr layout rules and ColorImage transpose helpers" +git commit -m "feat: add affine_check residual validator for OME coordinateTransformations" ``` -### Task 2.3: `ZarrOutputArchive` — skeleton, `serialize_direct` / `serialize_pointer` +### Task 2.4: `ZarrOutputArchive` skeleton — `serialize_direct` / `serialize_pointer` / `iter_frame_sets` **Files:** - Create: `python/lsst/images/zarr/_output_archive.py` - Test: `tests/test_zarr_output_archive.py` -This task wires the abstract `OutputArchive` ABC to the IR. `add_array` / `add_table` / `add_structured_array` follow in Task 2.4. `add_tree` and the public `write()` helper land in Task 2.5. +The constructor builds an empty `ZarrDocument` and stashes the user's per-array overrides plus the `archive_metadata` dict (used by `_layout.chunks_for` to see `cell_shape`). `serialize_direct` returns a `NestedOutputArchive` so nested calls land at compound paths (`red/image` rather than `image`). `serialize_pointer` writes the sub-tree's JSON bytes to a `tree` array under the sub-archive's path and returns a `ZarrPointerModel(path="/tree")`. -The constructor builds an empty `ZarrDocument` and stores the user-supplied chunks/shards/compression overrides. `serialize_direct` returns a `NestedOutputArchive` (matches NDF/JSON pattern). `serialize_pointer` writes the sub-tree as a UTF-8 JSON byte array under `/lsst//tree` and returns a `ZarrPointerModel(path="/lsst//tree")`. +`add_array` / `add_table` / `add_structured_array` / `add_tree` follow in subsequent tasks; they raise `NotImplementedError` here so the abstract class is concretely implementable. - [ ] **Step 1: Write the failing test** @@ -1529,6 +1898,8 @@ from __future__ import annotations import unittest +import pydantic + try: from lsst.images.zarr._common import ZarrPointerModel from lsst.images.zarr._output_archive import ZarrOutputArchive @@ -1537,8 +1908,6 @@ try: except ImportError: HAVE_ZARR = False -import pydantic - class _Sub(pydantic.BaseModel): label: str = "sub" @@ -1552,7 +1921,7 @@ class ZarrOutputArchiveSkeletonTestCase(unittest.TestCase): def serializer(arch): # noqa: ANN001 return _Sub(label="ok") - result = archive.serialize_direct("/sub", serializer) + result = archive.serialize_direct("red", serializer) self.assertEqual(result.label, "ok") def test_serialize_pointer_writes_json_subtree(self) -> None: @@ -1561,14 +1930,14 @@ class ZarrOutputArchiveSkeletonTestCase(unittest.TestCase): def serializer(arch): # noqa: ANN001 return _Sub(label="psf") - pointer = archive.serialize_pointer("/psf", serializer, key=12345) + pointer = archive.serialize_pointer("psf", serializer, key=12345) self.assertIsInstance(pointer, ZarrPointerModel) - self.assertEqual(pointer.path, "/lsst/psf/tree") - # Calling again with the same key returns the cached pointer. - again = archive.serialize_pointer("/psf", serializer, key=12345) + self.assertEqual(pointer.path, "/psf/tree") + # Cached on second call. + again = archive.serialize_pointer("psf", serializer, key=12345) self.assertEqual(again, pointer) - # The IR has the JSON sub-tree as an array of UTF-8 bytes. - node = archive.document.root.get("/lsst/psf/tree") + # IR holds the JSON bytes as a 1-D uint8 array. + node = archive.document.root.get("/psf/tree") self.assertEqual(str(node.dtype), "uint8") @@ -1576,12 +1945,12 @@ if __name__ == "__main__": unittest.main() ``` -- [ ] **Step 2: Run the test to verify it fails** +- [ ] **Step 2: Run to verify failure** Run: `pytest tests/test_zarr_output_archive.py -v` Expected: FAIL — `ImportError`. -- [ ] **Step 3: Write the skeleton of `_output_archive.py`** +- [ ] **Step 3: Write the skeleton** Create `python/lsst/images/zarr/_output_archive.py`: @@ -1625,18 +1994,24 @@ class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): """Output archive that populates a `ZarrDocument` IR. Bytes are not written until the IR is materialized via - `ZarrDocument.to_zarr`, which the public `write` helper performs on - context-manager exit. + `ZarrDocument.to_zarr`, which the public `write` helper performs + on context-manager exit. Parameters ---------- chunks - Per-array chunk overrides keyed by JSON pointer (or zarr path). - ``None`` for a key means "use the layout default". - shards - Per-array shard overrides keyed by JSON pointer (or zarr path). - compression - Per-array codec overrides keyed by JSON pointer (or zarr path). + Per-array chunk overrides keyed by the JSON pointer of the + attribute the array backs (or its zarr path). ``None`` for a + key means "use the layout default". + shards, compression + Same shape as ``chunks``. + archive_class + Top-level archive class name (``"VisitImage"``, ``"CellCoadd"``, + …). Used by the layout layer to pick chunk defaults; set by + ``write()`` before ``obj.serialize`` runs so ``add_array`` + sees the right value. + archive_metadata + Class-specific layout hints (``cell_shape`` for ``CellCoadd``). """ def __init__( @@ -1645,11 +2020,15 @@ class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): chunks: Mapping[str, tuple[int, ...] | None] | None = None, shards: Mapping[str, tuple[int, ...] | None] | None = None, compression: Mapping[str, ZarrCompressionOptions | None] | None = None, + archive_class: str = "Image", + archive_metadata: Mapping[str, Any] | None = None, ) -> None: self.document = ZarrDocument(root=ZarrGroup()) self._chunks = dict(chunks) if chunks else {} self._shards = dict(shards) if shards else {} self._compression = dict(compression) if compression else {} + self._archive_class = archive_class + self._archive_metadata = dict(archive_metadata) if archive_metadata else {} self._pointers: dict[Hashable, ZarrPointerModel] = {} self._frame_sets: list[tuple[FrameSet, ZarrPointerModel]] = [] @@ -1673,8 +2052,6 @@ class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): sub_zarr_path = archive_path_to_zarr_path(archive_path) model = self.serialize_direct(name, serializer) json_bytes = model.model_dump_json().encode("utf-8") - # Stage the JSON tree as a 1D uint8 array under /tree. - # Group container is created lazily by ensure_group. parent = self.document.root.ensure_group(sub_zarr_path) parent.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) pointer = ZarrPointerModel(path=f"{sub_zarr_path}/tree") @@ -1695,44 +2072,46 @@ class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): def iter_frame_sets(self) -> Iterator[tuple[FrameSet, ZarrPointerModel]]: return iter(self._frame_sets) - # add_array / add_table / add_structured_array land in Task 2.4; - # raising NotImplementedError keeps mypy happy until then. def add_array(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] - raise NotImplementedError("add_array lands in Task 2.4") + raise NotImplementedError("add_array lands in Task 2.5") def add_table(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] - raise NotImplementedError("add_table lands in Task 2.4") + raise NotImplementedError("add_table lands in Task 2.6") def add_structured_array(self, *args: Any, **kwargs: Any) -> Any: # type: ignore[override] - raise NotImplementedError("add_structured_array lands in Task 2.4") + raise NotImplementedError("add_structured_array lands in Task 2.6") def write(*args: Any, **kwargs: Any) -> Any: - """Public write helper. Implemented in Task 2.5.""" - raise NotImplementedError("write() lands in Task 2.5") + """Public write helper. Implemented in Task 2.7.""" + raise NotImplementedError("write() lands in Task 2.7") ``` -- [ ] **Step 4: Run the test to verify it passes** +- [ ] **Step 4: Run the tests** Run: `pytest tests/test_zarr_output_archive.py -v` -Expected: PASS — 2 tests pass; the IR holds a `uint8` array at `/lsst/psf/tree`. +Expected: PASS — 2 tests pass. - [ ] **Step 5: Commit** ```bash git add python/lsst/images/zarr/_output_archive.py tests/test_zarr_output_archive.py -git commit -m "feat: add ZarrOutputArchive skeleton with serialize_direct/serialize_pointer" +git commit -m "feat: add ZarrOutputArchive skeleton with serialize_direct/pointer/frame_set" ``` -### Task 2.4: `ZarrOutputArchive.add_array`, `add_table`, `add_structured_array` +### Task 2.5: `add_array` — image, variance, and the 2-D packed mask **Files:** - Modify: `python/lsst/images/zarr/_output_archive.py` -- Modify: `tests/test_zarr_output_archive.py` +- Test: `tests/test_zarr_output_archive.py` + +`add_array(array, name=...)` does three different things depending on the name: -`add_array` stages a numpy array into the IR at the zarr path computed by `_layout` / `_common`, applies per-array chunk/shard/compression overrides, and returns an `ArrayReferenceModel` with `source=f"zarr:{zarr_path}"`. Mask arrays go to `/lsst/mask/0`; variance to `/lsst/variance/0`; main image to `/0`. Anonymous nested arrays land at `/lsst//0`. +1. `name == "image"` (or any non-mask name) — stage the array verbatim with default chunks (or overrides), attach `_ARRAY_DIMENSIONS` and `units` / `long_name` if known. The chunk shape is held aside as the "image chunks" so siblings can align to it. +2. `name == "variance"` — derive chunks from `image_chunks` via `chunks_aligned_to` when the user has not overridden, attach `_ARRAY_DIMENSIONS = ["y", "x"]`. +3. `name == "mask"` — convert the 3-D `(y, x, mask_size)` in-memory mask into a 2-D `(y, x)` packed-integer array of `mask_dtype_for_plane_count(n_planes)`. Build CF flag attrs from the schema (passed via `archive_metadata["mask_schema"]`). Derive chunks from `image_chunks` when not overridden. -`add_table` and `add_structured_array` stage one 1-D zarr array per column under `/lsst/tables//` and attach an `LsstTableGroup` attributes block to the parent group. The returned `TableModel` has each column's `ArrayReferenceModel.source` set to `f"zarr:/lsst/tables//"`. +Anonymous (nested) arrays land at the path equal to `name`, no special-case behavior. - [ ] **Step 1: Write the failing test** @@ -1740,66 +2119,118 @@ Append to `tests/test_zarr_output_archive.py`: ```python @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") -class ZarrOutputArchiveArrayTestCase(unittest.TestCase): - def test_add_image_array(self) -> None: +class ZarrOutputArchiveAddArrayTestCase(unittest.TestCase): + def test_add_image(self) -> None: import numpy as np archive = ZarrOutputArchive() - ref = archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") - self.assertEqual(ref.source, "zarr:/0") + ref = archive.add_array( + np.ones((4, 5), dtype=np.float32), name="image" + ) + self.assertEqual(ref.source, "zarr:/image") self.assertEqual(list(ref.shape), [4, 5]) - ir_array = archive.document.root.get("/0") - self.assertEqual(ir_array.shape, (4, 5)) + node = archive.document.root.get("/image") + self.assertEqual(node.shape, (4, 5)) + self.assertEqual(node.attributes.extra["_ARRAY_DIMENSIONS"], ["y", "x"]) - def test_add_mask_array(self) -> None: + def test_add_variance_aligns_to_image_chunks(self) -> None: import numpy as np - archive = ZarrOutputArchive() - ref = archive.add_array(np.zeros((1, 4, 5), dtype=np.uint8), name="mask") - self.assertEqual(ref.source, "zarr:/lsst/mask/0") - ir_array = archive.document.root.get("/lsst/mask/0") - self.assertEqual(ir_array.shape, (1, 4, 5)) + archive = ZarrOutputArchive(chunks={"image": (2, 2)}) + archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + archive.add_array(np.ones((4, 5), dtype=np.float64), name="variance") + var_node = archive.document.root.get("/variance") + self.assertEqual(tuple(var_node.chunks), (2, 2)) - def test_add_variance_array(self) -> None: + def test_add_mask_packs_to_2d_with_cf_flag_attrs(self) -> None: import numpy as np - archive = ZarrOutputArchive() - ref = archive.add_array(np.ones((4, 5), dtype=np.float64), name="variance") - self.assertEqual(ref.source, "zarr:/lsst/variance/0") + from lsst.images import MaskPlane, MaskSchema - def test_add_anonymous_nested_array(self) -> None: + schema = MaskSchema( + [ + MaskPlane("BAD", "Bad pixel."), + MaskPlane("SAT", "Saturated."), + MaskPlane("CR", "Cosmic ray."), + ] + ) + # In-memory mask is (y, x, mask_size). + in_memory = np.zeros((4, 5, 1), dtype=np.uint8) + in_memory[0, 0, 0] = 0b1 # BAD + in_memory[1, 1, 0] = 0b110 # SAT | CR + + archive = ZarrOutputArchive(archive_metadata={"mask_schema": schema}) + archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + ref = archive.add_array(in_memory, name="mask") + self.assertEqual(ref.source, "zarr:/mask") + node = archive.document.root.get("/mask") + # 2-D packed integer. + self.assertEqual(node.shape, (4, 5)) + self.assertEqual(str(node.dtype), "uint8") # 3 planes -> uint8 + # Bytes packed correctly. + np.testing.assert_array_equal(node.data[0, 0], 0b1) + np.testing.assert_array_equal(node.data[1, 1], 0b110) + # CF flag attrs. + attrs = node.attributes.extra + self.assertEqual(attrs["flag_masks"], [1, 2, 4]) + self.assertEqual(attrs["flag_meanings"], "BAD SAT CR") + self.assertEqual( + attrs["flag_descriptions"], + ["Bad pixel.", "Saturated.", "Cosmic ray."], + ) + self.assertEqual(attrs["_ARRAY_DIMENSIONS"], ["y", "x"]) + + def test_add_mask_picks_widest_dtype_for_40_planes(self) -> None: import numpy as np - archive = ZarrOutputArchive() - ref = archive.add_array(np.ones((3,), dtype=np.float32), name="/psf/centroids") - self.assertEqual(ref.source, "zarr:/lsst/psf/centroids/0") + from lsst.images import MaskPlane, MaskSchema - def test_add_table_creates_one_array_per_column(self) -> None: - import astropy.table + planes = [MaskPlane(f"P{i}", f"Plane {i}.") for i in range(40)] + schema = MaskSchema(planes) + in_memory = np.zeros((4, 5, 5), dtype=np.uint8) # mask_size=5 + + archive = ZarrOutputArchive(archive_metadata={"mask_schema": schema}) + archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + archive.add_array(in_memory, name="mask") + node = archive.document.root.get("/mask") + self.assertEqual(node.shape, (4, 5)) + self.assertEqual(str(node.dtype), "uint64") + + def test_add_mask_refuses_more_than_64_planes(self) -> None: + import numpy as np + + from lsst.images import MaskPlane, MaskSchema + + planes = [MaskPlane(f"P{i}", f"Plane {i}.") for i in range(65)] + schema = MaskSchema(planes) + in_memory = np.zeros((4, 5, 9), dtype=np.uint8) + + archive = ZarrOutputArchive(archive_metadata={"mask_schema": schema}) + archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + with self.assertRaisesRegex(ValueError, "supports up to 64"): + archive.add_array(in_memory, name="mask") + + def test_add_anonymous_nested_array(self) -> None: import numpy as np archive = ZarrOutputArchive() - table = astropy.table.Table( - {"x": np.arange(4, dtype=np.int32), "y": np.arange(4, dtype=np.float32)} + ref = archive.add_array( + np.ones((3,), dtype=np.float32), name="psf/centroids" ) - model = archive.add_table(table, name="/cat") - self.assertEqual(len(model.columns), 2) - sources = {c.name: c.data.source for c in model.columns} - self.assertEqual(sources["x"], "zarr:/lsst/tables/cat/x") - self.assertEqual(sources["y"], "zarr:/lsst/tables/cat/y") + self.assertEqual(ref.source, "zarr:/psf/centroids") + self.assertEqual(archive.document.root.get("/psf/centroids").shape, (3,)) ``` -- [ ] **Step 2: Run the test to verify it fails** +- [ ] **Step 2: Run to verify failure** -Run: `pytest tests/test_zarr_output_archive.py::ZarrOutputArchiveArrayTestCase -v` +Run: `pytest tests/test_zarr_output_archive.py::ZarrOutputArchiveAddArrayTestCase -v` Expected: FAIL — `add_array` raises `NotImplementedError`. -- [ ] **Step 3: Replace the stubs with real implementations** +- [ ] **Step 3: Implement `add_array` and the mask-packing helper** -Edit `python/lsst/images/zarr/_output_archive.py` — extend the imports and replace the three `NotImplementedError` stubs: +In `python/lsst/images/zarr/_output_archive.py`, extend imports: ```python -# At the top of the file, extend imports: import astropy.io.fits import astropy.table import astropy.units @@ -1818,11 +2249,26 @@ from ._common import ( ZarrCompressionOptions, ZarrPointerModel, archive_path_to_zarr_path, - json_pointer_to_zarr_path, + mask_dtype_for_plane_count, ) +from ._layout import chunks_aligned_to, chunks_for +from ._model import ( + CfFlagAttributes, + MaskPlaneEntry, + ZarrArray, + ZarrDocument, + ZarrGroup, + build_image_array_attrs, +) +``` + +Add an `_image_chunks` field to `__init__`: + +```python + self._image_chunks: tuple[int, ...] | None = None ``` -Replace the three placeholder methods with: +Replace the `add_array` placeholder: ```python def add_array( @@ -1834,24 +2280,170 @@ Replace the three placeholder methods with: ) -> ArrayReferenceModel: if name is None: raise ValueError("Anonymous arrays are not supported in ZarrOutputArchive.") - zarr_path = json_pointer_to_zarr_path(name if name.startswith("/") else f"/{name}") - # zarr_path looks like "/0" or "/lsst/<...>/0"; we stage the - # array at the leaf and let layout defaults fill chunks later. + archive_path = name if name.startswith("/") else f"/{name}" + zarr_path = archive_path_to_zarr_path(archive_path) leaf = zarr_path.rsplit("/", 1)[-1] parent_path = zarr_path[: -(len(leaf) + 1)] or "/" parent = self.document.root.ensure_group(parent_path) - parent.arrays[leaf] = ZarrArray( - data=np.ascontiguousarray(array), - chunks=self._chunks.get(name), - shards=self._shards.get(name), - compression=self._compression.get(name), - ) - return ArrayReferenceModel( - source=f"zarr:{zarr_path}", - shape=list(array.shape), - datatype=NumberType.from_numpy(array.dtype), + + # Mask: pack 3-D (y, x, mask_size) -> 2-D wide-int packed. + if leaf == "mask" and array.ndim == 3: + packed, flag_attrs = self._pack_mask(array) + chunks = self._chunks.get(name) or self._chunks.get(leaf) + if chunks is None and self._image_chunks is not None: + chunks = chunks_aligned_to( + image_chunks=self._image_chunks, shape=packed.shape + ) + extra: dict[str, Any] = {"_ARRAY_DIMENSIONS": ["y", "x"]} + extra.update(flag_attrs.dump()) + ir_array = ZarrArray( + data=packed, + chunks=chunks, + shards=self._shards.get(name), + compression=self._compression.get(name), + ) + ir_array.attributes.extra = extra + parent.arrays[leaf] = ir_array + return ArrayReferenceModel( + source=f"zarr:{zarr_path}", + shape=list(packed.shape), + datatype=NumberType.from_numpy(packed.dtype), + ) + + # variance / other top-level siblings: align to image's chunks. + if leaf in ("variance",) or (parent_path == "/" and self._image_chunks): + chunks = self._chunks.get(name) or self._chunks.get(leaf) + if chunks is None and self._image_chunks is not None and array.ndim == len( + self._image_chunks + ): + chunks = chunks_aligned_to( + image_chunks=self._image_chunks, shape=array.shape + ) + else: + chunks = self._chunks.get(name) or self._chunks.get(leaf) + + # Default chunks for the top-level image: from layout rules. + if chunks is None and parent_path == "/" and leaf == "image": + chunks = chunks_for( + self._archive_class, + array.shape, + None, + archive_metadata=self._archive_metadata, + ) + + ir_array = ZarrArray( + data=np.ascontiguousarray(array), + chunks=chunks, + shards=self._shards.get(name), + compression=self._compression.get(name), + ) + if parent_path == "/" and leaf in ("image", "variance"): + ir_array.attributes.extra = build_image_array_attrs( + axes=("y", "x"), + long_name="science image" if leaf == "image" else "image variance", + ) + parent.arrays[leaf] = ir_array + + # Remember the image's chunks so siblings can align. + if parent_path == "/" and leaf == "image" and chunks is not None: + self._image_chunks = tuple(chunks) + + return ArrayReferenceModel( + source=f"zarr:{zarr_path}", + shape=list(array.shape), + datatype=NumberType.from_numpy(array.dtype), + ) + + def _pack_mask( + self, array: np.ndarray + ) -> tuple[np.ndarray, CfFlagAttributes]: + """Pack a 3-D ``(y, x, mask_size)`` mask into a 2-D wide-int array. + + The schema is taken from ``self._archive_metadata["mask_schema"]``. + Returns the packed array and the CF flag attributes. + """ + from lsst.images import MaskSchema + + schema = self._archive_metadata.get("mask_schema") + if not isinstance(schema, MaskSchema): + raise ValueError( + "Writing a 3-D mask requires archive_metadata['mask_schema'] " + "to be set; the output archive cannot infer the plane " + "definitions otherwise." + ) + n_planes = len(schema) + target_dtype = mask_dtype_for_plane_count(n_planes) + # Pack: each (y, x) pixel's mask_size bytes -> one wide integer. + # Byte 0 is the low byte (planes 0..7), byte 1 is the next, etc. + packed = np.zeros(array.shape[:2], dtype=target_dtype) + for i in range(array.shape[2]): + packed |= array[..., i].astype(target_dtype) << (8 * i) + planes = [ + MaskPlaneEntry(name=p.name, bit=i, description=p.description) + for i, p in enumerate(schema) + ] + return packed, CfFlagAttributes(planes=planes) +``` + +- [ ] **Step 4: Run the tests** + +Run: `pytest tests/test_zarr_output_archive.py -v` +Expected: PASS — 8 tests; mask packs to the correct dtype, CF flag attrs are populated, sibling chunks align. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_output_archive.py tests/test_zarr_output_archive.py +git commit -m "feat: implement add_array with image/variance/mask handling and CF flag attrs" +``` + +### Task 2.6: `add_table` and `add_structured_array` + +**Files:** +- Modify: `python/lsst/images/zarr/_output_archive.py` +- Modify: `tests/test_zarr_output_archive.py` + +Tables stage one 1-D zarr array per column under `/lsst/tables//` and attach the table's `meta` block to the parent group's `lsst` namespace. + +- [ ] **Step 1: Write the failing test** + +Append to `tests/test_zarr_output_archive.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOutputArchiveAddTableTestCase(unittest.TestCase): + def test_add_table_creates_one_array_per_column(self) -> None: + import astropy.table + import numpy as np + + archive = ZarrOutputArchive() + original = astropy.table.Table( + { + "x": np.arange(4, dtype=np.int32), + "y": np.arange(4, dtype=np.float32), + }, + meta={"comment": "small catalog"}, ) + model = archive.add_table(original, name="cat") + self.assertEqual(len(model.columns), 2) + sources = {c.name: c.data.source for c in model.columns} + self.assertEqual(sources["x"], "zarr:/lsst/tables/cat/x") + self.assertEqual(sources["y"], "zarr:/lsst/tables/cat/y") + # Each column is its own zarr array under the parent group. + x_node = archive.document.root.get("/lsst/tables/cat/x") + self.assertEqual(x_node.shape, (4,)) +``` + +- [ ] **Step 2: Run to verify failure** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrOutputArchiveAddTableTestCase -v` +Expected: FAIL — `add_table` raises `NotImplementedError`. +- [ ] **Step 3: Implement `add_table` and `add_structured_array`** + +Replace the placeholders in `_output_archive.py`: + +```python def add_table( self, table: astropy.table.Table, @@ -1899,30 +2491,35 @@ Replace the three placeholder methods with: return TableModel(columns=columns) ``` -- [ ] **Step 4: Run the tests to verify they pass** +- [ ] **Step 4: Run the tests** Run: `pytest tests/test_zarr_output_archive.py -v` -Expected: PASS — all 7 tests pass. +Expected: PASS — 9 tests. - [ ] **Step 5: Commit** ```bash git add python/lsst/images/zarr/_output_archive.py tests/test_zarr_output_archive.py -git commit -m "feat: implement ZarrOutputArchive add_array/add_table/add_structured_array" +git commit -m "feat: implement ZarrOutputArchive add_table and add_structured_array" ``` -### Task 2.5: `add_tree`, top-level OME multiscales, and the public `write()` helper +### Task 2.7: `add_tree`, OME multiscale + WCS validator integration, public `write()` **Files:** - Modify: `python/lsst/images/zarr/_output_archive.py` -- Modify: `python/lsst/images/zarr/__init__.py` (re-export `write` and `ZarrOutputArchive`) +- Modify: `python/lsst/images/zarr/__init__.py` - Modify: `tests/test_zarr_output_archive.py` -`add_tree` stages the main JSON tree under `/lsst/tree`, sets the `lsst.archive_class` and `lsst.tree`/`lsst.frame_set`/`lsst.companions` attributes on the root group, and — when there is a top-level `/0` array — populates an `OmeMultiscale` block on the root group's attributes. +`add_tree` finalizes the IR: -The public `write(obj, path, ...)` helper opens a store via `_store.open_store_for_write`, calls `obj.serialize`, then `add_tree`, and finally `document.to_zarr(store)`. +1. Stage the JSON tree at `/tree`. +2. Stage the AST WCS string at `/wcs_ast` (when an AST FrameSet was registered via `serialize_frame_set` or supplied directly). +3. Build the OME multiscale block. If a top-level `/image` array exists and the archive carries a frame set, run `affine_check`. If the result drops the affine, emit a unit-scale block and set `lsst.wcs_simplified_dropped: true` with the residual. +4. Set `lsst.archive_class`, `lsst.tree`, `lsst.wcs_ast` (if present), `data_model`, `version`, `lsst.cell_grid` (when `archive_metadata["cell_grid"]` is set). -- [ ] **Step 1: Write the failing tests** +The public `write(obj, path, ...)` function constructs the archive, runs the serializer, calls `add_tree`, and materializes via `open_store_for_write` + `to_zarr`. + +- [ ] **Step 1: Write the failing test** Append to `tests/test_zarr_output_archive.py`: @@ -1941,70 +2538,137 @@ class ZarrWriteHelperTestCase(unittest.TestCase): from lsst.images.zarr._common import LSST_NS, OME_NS from lsst.images.zarr._model import ZarrDocument - image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) with tempfile.TemporaryDirectory() as tmp: target = os.path.join(tmp, "out.zarr") - tree = write(image, target) - self.assertEqual(tree.kind, "image") # whatever Image's tree carries - # Reload via the IR (we have no ZarrInputArchive yet) and check shape. + tree = write(original, target) + self.assertIsNotNone(tree) with zarr.storage.LocalStore(target, read_only=True) as store: doc = ZarrDocument.from_zarr(store) - self.assertIn("0", doc.root.arrays) - self.assertEqual(doc.root.arrays["0"].shape, (4, 5)) - # Top-level OME multiscales block is populated. - self.assertEqual(doc.root.attributes.lsst["archive_class"], "Image") - self.assertIn("multiscales", doc.root.attributes.ome) + # Top-level image and tree are present. + self.assertIn("image", doc.root.arrays) + self.assertIn("tree", doc.root.arrays) + self.assertEqual(doc.root.arrays["image"].shape, (4, 5)) + # LSST root attrs. + lsst_attrs = doc.root.attributes.lsst + self.assertEqual(lsst_attrs["archive_class"], "Image") + self.assertEqual(lsst_attrs["tree"], "tree") + # OME multiscales points at /image; no projection means + # the unit scale is emitted. + ome = doc.root.attributes.ome + self.assertIn("multiscales", ome) + self.assertEqual( + ome["multiscales"][0]["datasets"][0]["path"], "image" + ) + # data_model + version on root. + self.assertEqual( + doc.root.attributes.extra["data_model"], "org.lsst.image" + ) ``` -(Use `tree.kind` only if `ImageSerializationModel` exposes that field — otherwise check whatever attribute is canonical for the image tree; the assertion is just that `write` returns the tree.) - -- [ ] **Step 2: Run the test to verify it fails** +- [ ] **Step 2: Run to verify failure** Run: `pytest tests/test_zarr_output_archive.py::ZarrWriteHelperTestCase -v` Expected: FAIL — `write()` raises `NotImplementedError`. - [ ] **Step 3: Implement `add_tree` and `write`** -Edit `python/lsst/images/zarr/_output_archive.py` — append to `ZarrOutputArchive`: +Append to `python/lsst/images/zarr/_output_archive.py`: ```python - def add_tree( - self, - tree: ArchiveTree, - *, - archive_class: str, - ) -> None: - """Finalize the IR: write the main JSON tree and root attributes. - - Called once after the user's serializer has populated arrays / - sub-trees. Sets the namespaced ``lsst.*`` and ``ome.*`` blocks - on the root group and stages ``/lsst/tree`` as a 1D ``uint8`` - zarr array of UTF-8 JSON. + def add_tree(self, tree: ArchiveTree) -> None: + """Finalize the IR: write JSON tree, WCS, and root attributes. + + Called once after the user's serializer has populated arrays + / sub-trees. Sets the ``lsst.*`` and ``ome.*`` blocks on the + root group, stages ``/tree`` as 1-D ``uint8`` UTF-8 JSON, and + runs the affine residual validator if the archive carries a + frame set. """ - from ._layout import axes_for_archive_class + from ._layout import affine_check, axes_for_archive_class from ._model import OmeMultiscale + # Stage the JSON tree at /tree. json_bytes = tree.model_dump_json().encode("utf-8") - lsst_group = self.document.root.ensure_group("/lsst") - lsst_group.arrays["tree"] = ZarrArray( + self.document.root.arrays["tree"] = ZarrArray( data=np.frombuffer(json_bytes, dtype=np.uint8) ) - # Root attributes. - self.document.root.attributes.lsst["archive_class"] = archive_class - self.document.root.attributes.lsst["tree"] = "/lsst/tree" + + # Stage the AST WCS string at /wcs_ast when a frame set is registered. + wcs_ast_path: str | None = None if self._frame_sets: - # The first frame set's pointer is published as the canonical - # one for external tools; pointer paths to all the others - # remain reachable via the JSON tree. - self.document.root.attributes.lsst["frame_set"] = self._frame_sets[0][1].path - if "0" in self.document.root.arrays: - top = self.document.root.arrays["0"] + wcs_ast_path = self._stage_wcs_ast(self._frame_sets[0][0]) + + # Root LSST attrs. + lsst = self.document.root.attributes.lsst + lsst["archive_class"] = self._archive_class + lsst["tree"] = "tree" + if wcs_ast_path is not None: + lsst["wcs_ast"] = wcs_ast_path + if "cell_grid" in self._archive_metadata: + lsst["cell_grid"] = self._archive_metadata["cell_grid"] + + # data_model / version go to the top level (not under lsst:). + self.document.root.attributes.extra["data_model"] = self._data_model_for( + self._archive_class + ) + self.document.root.attributes.extra["version"] = 1 + + # OME multiscale block, gated by axes_for_archive_class. + axes = axes_for_archive_class(self._archive_class) + if axes and "image" in self.document.root.arrays: + image_array = self.document.root.arrays["image"] + ct: list[dict[str, Any]] | None = None + if self._frame_sets: + fs = self._frame_sets[0][0] + check = affine_check( + frame_set=fs._get_ast_frame_set(), + image_shape=image_array.shape, + max_residual_pixels=1.0, + ) + if check.dropped: + lsst["wcs_simplified_dropped"] = True + lsst["wcs_simplified_max_residual_pixels"] = check.max_residual_pixels + else: + lsst["wcs_simplified_dropped"] = False + lsst["wcs_simplified_max_residual_pixels"] = check.max_residual_pixels + ct = check.coordinate_transformations multiscale = OmeMultiscale( - name=archive_class.lower(), - axes=axes_for_archive_class(archive_class), + name=self._archive_class.lower(), + axes=axes, + dataset_path="image", + coordinate_transformations=ct, ) self.document.root.attributes.ome["multiscales"] = [multiscale.dump()] + def _stage_wcs_ast(self, frame_set: FrameSet) -> str: + """Encode an AST FrameSet as a UTF-8 string and stage it at /wcs_ast.""" + from .._transforms._ast import Channel, StringStream + + ast_fs = frame_set._get_ast_frame_set() + stream = StringStream() + Channel(stream, options="Full=-1,Comment=0,Indent=0").write(ast_fs) + text = stream.getSinkData() + self.document.root.arrays["wcs_ast"] = ZarrArray( + data=np.frombuffer(text.encode("utf-8"), dtype=np.uint8) + ) + return "wcs_ast" + + @staticmethod + def _data_model_for(archive_class: str) -> str: + """Map an archive class name to the public ``data_model`` string.""" + return { + "Image": "org.lsst.image", + "Mask": "org.lsst.mask", + "MaskedImage": "org.lsst.masked_image", + "VisitImage": "org.lsst.visit_image", + "ColorImage": "org.lsst.color_image", + "CellCoadd": "org.lsst.cell_coadd", + }.get(archive_class, f"org.lsst.{archive_class.lower()}") + def write( obj: Any, @@ -2018,14 +2682,34 @@ def write( ) -> ArchiveTree: """Write ``obj`` to a zarr archive at ``path``. - Parameters mirror the FITS/NDF write helpers. The store implementation - (LocalStore / ZipStore / FsspecStore) is selected from the URI shape - by `lsst.images.zarr._store.open_store_for_write`. + Parameters mirror the FITS / NDF write helpers. The store + implementation (LocalStore / ZipStore / FsspecStore) is selected + from the URI shape by ``_store.open_store_for_write``. """ from ._store import open_store_for_write + archive_class = type(obj).__name__ archive_default_name = getattr(obj, "_archive_default_name", None) - archive = ZarrOutputArchive(chunks=chunks, shards=shards, compression=compression) + archive_metadata: dict[str, Any] = {} + if (cell_shape := getattr(obj, "cell_shape", None)) is not None: + archive_metadata["cell_shape"] = tuple(cell_shape) + if (cell_grid := getattr(obj, "cell_grid", None)) is not None: + archive_metadata["cell_grid"] = { + "bbox": list(cell_grid.bbox) if hasattr(cell_grid, "bbox") else None, + "cell_shape": list(cell_grid.cell_shape) + if hasattr(cell_grid, "cell_shape") + else None, + } + if (mask_schema := getattr(obj, "mask_schema", None)) is not None: + archive_metadata["mask_schema"] = mask_schema + + archive = ZarrOutputArchive( + chunks=chunks, + shards=shards, + compression=compression, + archive_class=archive_class, + archive_metadata=archive_metadata, + ) if archive_default_name is not None: tree = archive.serialize_direct(archive_default_name, obj.serialize) else: @@ -2034,7 +2718,7 @@ def write( tree.metadata.update(metadata) if butler_info is not None: tree.butler_info = butler_info - archive.add_tree(tree, archive_class=type(obj).__name__) + archive.add_tree(tree) with open_store_for_write(path) as store: archive.document.to_zarr(store) return tree @@ -2047,10 +2731,10 @@ from ._common import * # noqa: F401, F403 from ._output_archive import * # noqa: F401, F403 ``` -- [ ] **Step 4: Run the tests to verify they pass** +- [ ] **Step 4: Run the tests** Run: `pytest tests/test_zarr_output_archive.py -v` -Expected: PASS — all tests, including the local-directory write. Open the produced `out.zarr` directory with `python -c "import zarr; print(list(zarr.open_group(zarr.storage.LocalStore('/tmp/.../out.zarr', read_only=True)).keys()))"` to spot-check. +Expected: PASS — 10 tests. - [ ] **Step 5: Commit** @@ -2059,12 +2743,12 @@ git add python/lsst/images/zarr/_output_archive.py python/lsst/images/zarr/__ini git commit -m "feat: add ZarrOutputArchive.add_tree and public write() helper" ``` -### Task 2.6: Layout-level write tests for `MaskedImage` and `VisitImage` +### Task 2.8: Layout-level write tests for `MaskedImage` and `VisitImage` **Files:** - Modify: `tests/test_zarr_output_archive.py` -This task pins the on-disk shape for two more archive classes by inspecting the IR after `write()`. No new production code — these tests only catch regressions when Phase 4 changes things. +Pin the on-disk shape for the two harder archive classes. - [ ] **Step 1: Write the test** @@ -2094,19 +2778,31 @@ class ZarrWriteOnDiskShapeTestCase(unittest.TestCase): from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) - image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) masked = MaskedImage(image, mask_schema=schema) + masked.mask.set("BAD", image.array % 2 == 0) doc = self._round_trip_doc(masked) - self.assertEqual(doc.root.attributes.lsst["archive_class"], "MaskedImage") - # Top-level data array. - self.assertIn("0", doc.root.arrays) - # Mask group at /lsst/mask/0. - mask_group = doc.root.groups["lsst"].groups["mask"] - self.assertIn("0", mask_group.arrays) - # Variance group at /lsst/variance/0 (MaskedImage carries one). - variance_group = doc.root.groups["lsst"].groups["variance"] - self.assertIn("0", variance_group.arrays) + self.assertEqual( + doc.root.attributes.lsst["archive_class"], "MaskedImage" + ) + # image / variance / mask are sibling root arrays. + self.assertIn("image", doc.root.arrays) + self.assertIn("variance", doc.root.arrays) + self.assertIn("mask", doc.root.arrays) + # Mask is 2-D packed integer with CF flag attrs. + mask = doc.root.arrays["mask"] + self.assertEqual(mask.shape, (4, 5)) + self.assertEqual(mask.attributes.extra["flag_meanings"], "BAD") + # CF / xarray dims on every 2-D array. + for name in ("image", "variance", "mask"): + self.assertEqual( + doc.root.arrays[name].attributes.extra["_ARRAY_DIMENSIONS"], + ["y", "x"], + ) def test_visit_image_layout(self) -> None: import numpy as np @@ -2114,21 +2810,20 @@ class ZarrWriteOnDiskShapeTestCase(unittest.TestCase): from lsst.images import Box, Image, MaskPlane, MaskSchema, VisitImage schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) - image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) - # Construct a minimal VisitImage; the exact constructor depends on - # the public API. The test only cares that write() succeeds and - # produces a valid IR. + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) visit = VisitImage(image=image, mask_schema=schema) - doc = self._round_trip_doc(visit) self.assertEqual(doc.root.attributes.lsst["archive_class"], "VisitImage") - self.assertIn("0", doc.root.arrays) + self.assertIn("image", doc.root.arrays) ``` -- [ ] **Step 2: Run the test** +- [ ] **Step 2: Run the tests** Run: `pytest tests/test_zarr_output_archive.py::ZarrWriteOnDiskShapeTestCase -v` -Expected: PASS — both tests. If the `VisitImage` constructor in this codebase needs different arguments than the snippet, adapt the constructor call only — the assertions on the IR shape stay. +Expected: PASS — both tests. If `VisitImage`'s constructor in this codebase needs different arguments than the snippet, adapt the constructor call only — the on-disk assertions stay. - [ ] **Step 3: Commit** @@ -2139,26 +2834,109 @@ git commit -m "test: pin on-disk zarr layout for MaskedImage and VisitImage" --- -**End of Phase 2.** Six tasks. The output side now produces valid OME-Zarr archives for `Image`, `MaskedImage`, `VisitImage` with namespaced `lsst.*` attributes and a JSON tree at `/lsst/tree`. Phase 3 inverts this: `ZarrInputArchive`, `read()`, and the lazy-subset assertions that prove `slices=` only fetches the touched chunks. +**End of Phase 2.** Eight tasks. The output side now produces: + +- `image`, `variance`, `mask` siblings at the root with aligned chunks +- 2-D packed-integer mask with CF `flag_masks` / `flag_meanings` / `flag_descriptions` +- `_ARRAY_DIMENSIONS` and `units` / `long_name` per array (xarray-readable) +- OME multiscales metadata pointing at `/image` +- Affine `coordinateTransformations` validated against an 11×11 grid; dropped to unit scale when residual exceeds 1 pixel +- `wcs_ast` 1-D `uint8` array as the authoritative WCS round-trip source + +Phase 3 inverts this: `ZarrInputArchive`, `read()`, and the lazy-subset assertions that prove `slices=` only fetches the touched chunks. + +## Phase 3 — `ZarrInputArchive`, `read()`, lazy subset enforcement, mask unpack -## Phase 3 — `ZarrInputArchive`, `read()`, and lazy subset enforcement +This phase delivers the read side. The hard constraint is the **lazy subset invariant**: `get_array(model, slices=...)` must forward `slices` to the underlying `zarr.Array` handle so a 4×4 subset of a 4096×4096 remote VisitImage downloads only the chunks intersecting that slice. The phase ships with a `_CountingStore`-based regression test that fails if any code path materializes the full array before slicing. -This phase delivers the read side. The hard constraint here is the **lazy subset invariant**: `get_array(model, slices=...)` must forward `slices` to the underlying `zarr.Array` handle so a 4×4 subset of a 4096×4096 remote VisitImage downloads only the chunks intersecting that slice, never the full array. The phase ships with a `_CountingStore`-based regression test that *fails* if any code path materializes the full array before slicing. +The phase also adds the **mask unpack path**: `Mask.serialize` (when the archive sets `_prefer_native_mask_arrays = True`) hands us a 3-D `(y, x, mask_size)` array which Phase 2's `add_array` packs to 2-D wide-integer; on read, `get_array` detects the rank mismatch (model claims 3-D, on-disk is 2-D, on-disk has `flag_masks` attribute) and unpacks via bit shifts. -### Task 3.1: `ZarrInputArchive` skeleton — open + `get_tree` +### Task 3.0: Wire up `_prefer_native_mask_arrays` + +**Files:** +- Modify: `python/lsst/images/zarr/_output_archive.py` +- Test: `tests/test_zarr_round_trip.py` (later in this phase confirms it round-trips) + +A one-line retrofit to Phase 2 to make `Mask.serialize` choose the native 3-D path for our archive (matching what the NDF backend does). Without this, `Mask.serialize` calls `add_array` multiple times with 2-D `int32` splits and our packing path never runs. + +- [ ] **Step 1: Add the class attribute** + +In `python/lsst/images/zarr/_output_archive.py`, edit the `ZarrOutputArchive` class definition to add the class attribute right above `__init__`: + +```python +class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): + """Output archive that populates a `ZarrDocument` IR. + + ... (existing docstring) ... + """ + + _prefer_native_mask_arrays: ClassVar[bool] = True + """Tell Mask.serialize to hand us the 3-D ``(y, x, mask_size)`` + array in one ``add_array`` call. Our ``add_array`` packs that into + a 2-D wide-integer array on disk with CF flag_masks / flag_meanings + attributes. + """ + + def __init__(...): + ... +``` + +(Add `from typing import ClassVar` to the imports if it is not already present.) + +- [ ] **Step 2: Run the existing tests to confirm no regression** + +Run: `pytest tests/test_zarr_output_archive.py -v` +Expected: PASS — all 10 Phase 2 tests still pass; the class attribute does not change behavior for direct `add_array(3D)` calls. + +- [ ] **Step 3: Commit** + +```bash +git add python/lsst/images/zarr/_output_archive.py +git commit -m "feat: opt ZarrOutputArchive into native 3-D mask serialization" +``` + +### Task 3.1: `ZarrInputArchive` skeleton — open + `get_tree` + error taxonomy **Files:** - Create: `python/lsst/images/zarr/_input_archive.py` - Test: `tests/test_zarr_input_archive.py` -The constructor takes a `zarr.storage.Store` and builds a `ZarrDocument` via `from_zarr` (lazy — no chunk reads). `get_tree(model_type)` reads the JSON bytes at `/lsst/tree` and validates them with `model_type.model_validate_json`. The `open` classmethod is a context manager that delegates store creation to `_store.open_store_for_read`. +Constructor takes a `ZarrDocument` (built lazily via `from_zarr`). `get_tree(model_type)` reads `/tree`'s bytes and validates them. The `open` classmethod is a context manager around `_store.open_store_for_read`. -Error taxonomy: -- Missing root metadata → `ArchiveReadError("File has no zarr.json")` -- Missing `lsst.archive_class` attribute → `ArchiveReadError("File is not an LSST zarr archive")` -- `lsst.version` newer than `LSST_VERSION` → `ArchiveReadError(f"Unsupported lsst:version {N}")` +Error taxonomy (per spec §4): +- Missing `lsst.archive_class` → `ArchiveReadError("File is not an LSST zarr archive")`. +- `lsst.version` newer than `LSST_VERSION` → `ArchiveReadError("Unsupported lsst:version ")`. -- [ ] **Step 1: Write the failing test** +`ZarrAttributes.load` keeps the on-disk `version` under a private sentinel `__version_remembered_at_load__` so the input archive can validate without going back to the raw store. + +- [ ] **Step 1: Update `ZarrAttributes.load` / `dump` to round-trip the version sentinel** + +In `python/lsst/images/zarr/_model.py`, change `ZarrAttributes.load` to keep the version under a private key, and `dump` to ignore that key: + +```python + @classmethod + def load(cls, raw: dict[str, Any]) -> Self: + lsst = dict(raw.get(LSST_NS, {})) + version = lsst.pop("version", None) + if version is not None: + lsst["__version_remembered_at_load__"] = version + ome = dict(raw.get(OME_NS, {})) + ome.pop("version", None) + extra = {k: v for k, v in raw.items() if k not in (LSST_NS, OME_NS)} + return cls(lsst=lsst, ome=ome, extra=extra) + + def dump(self) -> dict[str, Any]: + out: dict[str, Any] = dict(self.extra) + public_lsst = { + k: v for k, v in self.lsst.items() if not k.startswith("__") + } + out[LSST_NS] = {"version": LSST_VERSION, **public_lsst} + if self.ome: + out[OME_NS] = {"version": OME_VERSION, **self.ome} + return out +``` + +- [ ] **Step 2: Write the failing test** Create `tests/test_zarr_input_archive.py`: @@ -2188,31 +2966,30 @@ try: from lsst.images.serialization import ArchiveReadError from lsst.images.zarr._common import LSST_NS, LSST_VERSION from lsst.images.zarr._input_archive import ZarrInputArchive - from lsst.images.zarr._output_archive import ZarrOutputArchive + from lsst.images.zarr._model import ZarrDocument HAVE_ZARR = True except ImportError: HAVE_ZARR = False -def _write_minimal_image(target: str) -> None: - from lsst.images import Box, Image - from lsst.images.zarr import write - - write(Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]), target) - - @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class ZarrInputArchiveSkeletonTestCase(unittest.TestCase): def test_open_reads_tree(self) -> None: + from lsst.images import Box, Image + from lsst.images.zarr import write + from lsst.images._image import ImageSerializationModel + + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) with tempfile.TemporaryDirectory() as tmp: target = os.path.join(tmp, "out.zarr") - _write_minimal_image(target) + write(original, target) with ZarrInputArchive.open(target) as archive: - from lsst.images._image import ImageSerializationModel - tree = archive.get_tree(ImageSerializationModel) - self.assertEqual(list(tree.image.shape), [4, 5]) + self.assertIsNotNone(tree) def test_missing_archive_class_raises(self) -> None: with tempfile.TemporaryDirectory() as tmp: @@ -2231,7 +3008,13 @@ class ZarrInputArchiveSkeletonTestCase(unittest.TestCase): store = zarr.storage.LocalStore(target, read_only=False) root = zarr.create_group(store=store, zarr_format=3) root.update_attributes( - {LSST_NS: {"version": LSST_VERSION + 1, "archive_class": "Image", "tree": "/lsst/tree"}} + { + LSST_NS: { + "version": LSST_VERSION + 1, + "archive_class": "Image", + "tree": "tree", + } + } ) with self.assertRaisesRegex(ArchiveReadError, "Unsupported lsst:version"): with ZarrInputArchive.open(target): @@ -2242,12 +3025,12 @@ if __name__ == "__main__": unittest.main() ``` -- [ ] **Step 2: Run the test to verify it fails** +- [ ] **Step 3: Run the test to verify it fails** Run: `pytest tests/test_zarr_input_archive.py -v` -Expected: FAIL — `ImportError` on `_input_archive`. +Expected: FAIL — `ImportError`. -- [ ] **Step 3: Write `_input_archive.py`** +- [ ] **Step 4: Write `_input_archive.py`** Create `python/lsst/images/zarr/_input_archive.py`: @@ -2289,21 +3072,12 @@ from ..serialization import ( TableModel, no_header_updates, ) -from ._common import LSST_NS, LSST_VERSION, ZarrPointerModel +from ._common import LSST_VERSION, ZarrPointerModel from ._model import ZarrArray, ZarrDocument class ZarrInputArchive(InputArchive[ZarrPointerModel]): - """Reads zarr archives written by `ZarrOutputArchive`. - - Built around a `ZarrDocument` whose `ZarrArray` nodes hold lazy - ``zarr.Array`` handles. ``get_array(model, slices=...)`` forwards - slices straight to those handles, so subset reads on remote stores - only fetch the touched chunks. - - Instances should only be constructed via the :meth:`open` context - manager. - """ + """Reads zarr archives written by `ZarrOutputArchive`.""" def __init__(self, document: ZarrDocument) -> None: self._document = document @@ -2326,31 +3100,29 @@ class ZarrInputArchive(InputArchive[ZarrPointerModel]): return self._document def get_tree[T: ArchiveTree](self, model_type: type[T]) -> T: - """Read and validate the main Pydantic tree at ``/lsst/tree``.""" + """Read and validate the main Pydantic tree at ``/tree``.""" try: - node = self._document.root.get("/lsst/tree") + node = self._document.root.get("/tree") except KeyError: raise ArchiveReadError( - "File has no /lsst/tree array; this is not an LSST zarr archive." + "File has no /tree array; this is not an LSST zarr archive." ) from None if not isinstance(node, ZarrArray): - raise ArchiveReadError("/lsst/tree must be a zarr array, not a group.") + raise ArchiveReadError("/tree must be a zarr array, not a group.") json_bytes = bytes(node.read()) return model_type.model_validate_json(json_bytes.decode("utf-8")) def _validate_root_attributes(self) -> None: attrs = self._document.root.attributes.lsst if "archive_class" not in attrs: - raise ArchiveReadError("File is not an LSST zarr archive (missing lsst.archive_class).") - # ZarrAttributes.load() strips "version", so re-fetch from the raw store. - # Easiest: re-derive via the document root's underlying group, but we - # don't have that handle. Stash version on the IR instead. + raise ArchiveReadError( + "File is not an LSST zarr archive (missing lsst.archive_class)." + ) version = attrs.get("__version_remembered_at_load__", LSST_VERSION) - # The plan task immediately following persists the version through - # ZarrAttributes; for now treat absence as compatible. if version > LSST_VERSION: raise ArchiveReadError( - f"Unsupported lsst:version {version}; this reader supports up to {LSST_VERSION}." + f"Unsupported lsst:version {version}; this reader supports up " + f"to {LSST_VERSION}." ) # The remaining abstract methods land in subsequent tasks. @@ -2375,61 +3147,34 @@ def read(*args: Any, **kwargs: Any) -> Any: raise NotImplementedError("read() lands in Task 3.5") ``` -Add `version` round-trip to `_model.py` so `_validate_root_attributes` can see it. In `python/lsst/images/zarr/_model.py`, change `ZarrAttributes.load` to **keep** the version under a private key: - -```python - @classmethod - def load(cls, raw: dict[str, Any]) -> Self: - lsst = dict(raw.get(LSST_NS, {})) - version = lsst.pop("version", None) - if version is not None: - lsst["__version_remembered_at_load__"] = version - ome = dict(raw.get(OME_NS, {})) - ome.pop("version", None) - return cls(lsst=lsst, ome=ome) -``` - -…and skip that key in `dump`: - -```python - def dump(self) -> dict[str, Any]: - out: dict[str, Any] = {} - public_lsst = {k: v for k, v in self.lsst.items() if not k.startswith("__")} - out[LSST_NS] = {"version": LSST_VERSION, **public_lsst} - if self.ome: - out[OME_NS] = {"version": OME_VERSION, **self.ome} - return out -``` - -(Update the existing `test_load_preserves_unknown_keys` if needed — the assertion is about `future_thing`, not the private version sentinel, so it still passes.) - -- [ ] **Step 4: Run the tests to verify they pass** +- [ ] **Step 5: Run all relevant tests** -Run: `pytest tests/test_zarr_input_archive.py -v tests/test_zarr_model.py -v` -Expected: PASS — input-archive skeleton tests pass; existing model tests still pass. +Run: `pytest tests/test_zarr_input_archive.py tests/test_zarr_model.py -v` +Expected: PASS — input archive skeleton tests pass; the version-sentinel update does not break model tests. -- [ ] **Step 5: Commit** +- [ ] **Step 6: Commit** ```bash git add python/lsst/images/zarr/_input_archive.py python/lsst/images/zarr/_model.py tests/test_zarr_input_archive.py git commit -m "feat: add ZarrInputArchive skeleton with get_tree and version validation" ``` -### Task 3.2: `get_array` — lazy slice forwarding +### Task 3.2: `get_array` — lazy slice forwarding + mask unpack **Files:** - Modify: `python/lsst/images/zarr/_input_archive.py` - Modify: `tests/test_zarr_input_archive.py` -This is the load-bearing task for the lazy-subset invariant. `get_array(model, slices=...)`: +`get_array(model, slices=...)`: -1. Resolves the model's `source` field (`f"zarr:{path}"`) to a zarr path inside the IR. -2. Fetches the `ZarrArray` IR node — still lazy. -3. Calls `ir_array.read(slices=slices)`, which forwards directly to the `zarr.Array` handle. +1. Resolve the model's `source` (always plain `zarr:/` — no query suffix). +2. Fetch the `ZarrArray` IR node — still lazy. +3. **Mask unpack:** if the model claims a 3-D `(y, x, mask_size)` shape but the on-disk array is 2-D and carries `flag_masks` attribute, slice the 2-D array first (forwarding `slices` if it has rank 2; or its `slices[:-1]` if rank 3 was requested) and unpack via bit shifts to reconstruct the 3-D mask. +4. Otherwise call `ir_array.read(slices=slices)`, forwarding directly to the lazy handle. -The test uses a `_CountingStore` that counts every `get` call against the underlying store and asserts the subset read touches strictly fewer chunk keys than a full read of the same array. +The lazy invariant test uses `_CountingStore` to count chunk fetches and asserts a single-chunk subset of a 16×16 / chunks=(4,4) array touches strictly fewer keys than a full read. -- [ ] **Step 1: Write the failing test** +- [ ] **Step 1: Write the failing tests** Append to `tests/test_zarr_input_archive.py`: @@ -2448,18 +3193,11 @@ class _CountingStore(zarr.storage.MemoryStore if HAVE_ZARR else object): @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class ZarrInputArchiveLazySubsetTestCase(unittest.TestCase): - """Pins the lazy-subset invariant from the design spec. - - A subset read of a chunked image must touch a strict subset of the - chunk keys that a full read would. This is what makes remote - VisitImage subsetting cheap. - """ + """Lazy-subset invariant: subset reads only fetch touched chunks.""" def test_subset_read_touches_only_intersecting_chunks(self) -> None: from lsst.images.serialization import ArrayReferenceModel, NumberType - # Build a 16x16 / chunks=(4,4) zarr archive with the LSST root - # attributes wired up so ZarrInputArchive accepts it. store = _CountingStore() root = zarr.create_group(store=store, zarr_format=3) root.update_attributes( @@ -2467,55 +3205,151 @@ class ZarrInputArchiveLazySubsetTestCase(unittest.TestCase): LSST_NS: { "version": LSST_VERSION, "archive_class": "Image", - "tree": "/lsst/tree", - }, + "tree": "tree", + } } ) - zarr_array = root.create_array(name="0", shape=(16, 16), chunks=(4, 4), dtype="float32") + zarr_array = root.create_array( + name="image", shape=(16, 16), chunks=(4, 4), dtype="float32" + ) zarr_array[:] = np.arange(256, dtype=np.float32).reshape(16, 16) - # Stage a stub /lsst/tree primitive so the input archive's - # constructor doesn't blow up on get_tree (we won't call it). - lsst_group = root.create_group("lsst") - lsst_group.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" + # Stub /tree so the input archive's constructor accepts the file. + root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" - # Open via the ZarrInputArchive (this only reads metadata). doc = ZarrDocument.from_zarr(store) archive = ZarrInputArchive(doc) - # Reset the counter; we want to count only what get_array does. store.reads = 0 full_ref = ArrayReferenceModel( - source="zarr:/0", shape=[16, 16], datatype=NumberType.from_numpy(np.dtype("float32")) + source="zarr:/image", + shape=[16, 16], + datatype=NumberType.from_numpy(np.dtype("float32")), ) - # Full read for the baseline. full = archive.get_array(full_ref) full_reads = store.reads self.assertEqual(full.shape, (16, 16)) - # Reset and read a single-chunk subset. store.reads = 0 subset = archive.get_array(full_ref, slices=(slice(0, 4), slice(0, 4))) subset_reads = store.reads self.assertEqual(subset.shape, (4, 4)) np.testing.assert_array_equal(subset, np.arange(256).reshape(16, 16)[:4, :4]) - - # Critical assertion: subset read fetched strictly fewer keys. self.assertLess(subset_reads, full_reads) -``` -- [ ] **Step 2: Run the test to verify it fails** -Run: `pytest tests/test_zarr_input_archive.py::ZarrInputArchiveLazySubsetTestCase -v` -Expected: FAIL — `get_array` raises `NotImplementedError`. +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchiveMaskUnpackTestCase(unittest.TestCase): + """Round-trip a packed 2-D mask through get_array's unpack path.""" -- [ ] **Step 3: Implement `get_array`** + def test_unpack_2d_packed_back_to_3d(self) -> None: + from lsst.images.serialization import ArrayReferenceModel, NumberType -Replace the `get_array` placeholder in `_input_archive.py`: + # Build an archive that has a 2-D packed mask on disk. + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + { + LSST_NS: { + "version": LSST_VERSION, + "archive_class": "Mask", + "tree": "tree", + } + } + ) + # 4x5 mask, 3 planes -> packed in uint8. + on_disk = np.zeros((4, 5), dtype=np.uint8) + on_disk[0, 0] = 0b001 # plane 0 + on_disk[1, 1] = 0b110 # planes 1+2 + mask_array = root.create_array( + name="mask", shape=(4, 5), chunks=(4, 5), dtype="uint8" + ) + mask_array[:] = on_disk + mask_array.update_attributes( + { + "_ARRAY_DIMENSIONS": ["y", "x"], + "flag_masks": [1, 2, 4], + "flag_meanings": "BAD SAT CR", + "flag_descriptions": ["Bad pixel.", "Saturated.", "Cosmic ray."], + } + ) + root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" -```python - def get_array( - self, - model: ArrayReferenceModel | InlineArrayModel, + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + # The model claims a 3-D shape (mask_size = 1 because <=8 planes). + model = ArrayReferenceModel( + source="zarr:/mask", + shape=[4, 5, 1], + datatype=NumberType.from_numpy(np.dtype("uint8")), + ) + result = archive.get_array(model) + self.assertEqual(result.shape, (4, 5, 1)) + self.assertEqual(result[0, 0, 0], 0b001) + self.assertEqual(result[1, 1, 0], 0b110) + + def test_unpack_uint64_with_5_bytes(self) -> None: + from lsst.images.serialization import ArrayReferenceModel, NumberType + + # 40 planes packed into uint64 -> mask_size = 5. + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + { + LSST_NS: { + "version": LSST_VERSION, + "archive_class": "Mask", + "tree": "tree", + } + } + ) + on_disk = np.zeros((4, 5), dtype=np.uint64) + on_disk[0, 0] = 0x01_02_03_04_05 # arbitrary bit pattern + mask_array = root.create_array( + name="mask", shape=(4, 5), chunks=(4, 5), dtype="uint64" + ) + mask_array[:] = on_disk + mask_array.update_attributes( + { + "_ARRAY_DIMENSIONS": ["y", "x"], + "flag_masks": [1 << i for i in range(40)], + "flag_meanings": " ".join(f"P{i}" for i in range(40)), + "flag_descriptions": [f"Plane {i}." for i in range(40)], + } + ) + root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" + + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + model = ArrayReferenceModel( + source="zarr:/mask", + shape=[4, 5, 5], + datatype=NumberType.from_numpy(np.dtype("uint8")), + ) + result = archive.get_array(model) + self.assertEqual(result.shape, (4, 5, 5)) + # Bytes recovered from the packed uint64. + self.assertEqual(result[0, 0, 0], 0x05) # low byte + self.assertEqual(result[0, 0, 1], 0x04) + self.assertEqual(result[0, 0, 2], 0x03) + self.assertEqual(result[0, 0, 3], 0x02) + self.assertEqual(result[0, 0, 4], 0x01) +``` + +- [ ] **Step 2: Run the tests to verify they fail** + +Run: `pytest tests/test_zarr_input_archive.py -v` +Expected: FAIL — `get_array` raises `NotImplementedError` for both new test classes. + +- [ ] **Step 3: Implement `get_array`** + +In `python/lsst/images/zarr/_input_archive.py`, replace the `get_array` placeholder: + +```python + def get_array( + self, + model: ArrayReferenceModel | InlineArrayModel, *, slices: tuple[slice, ...] | EllipsisType = ..., strip_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, @@ -2535,25 +3369,68 @@ Replace the `get_array` placeholder in `_input_archive.py`: raise ArchiveReadError(f"Array reference {zarr_path!r} not in store.") from None if not isinstance(node, ZarrArray): raise ArchiveReadError(f"{zarr_path!r} is not an array.") - # The lazy invariant: ZarrArray.read forwards slices straight - # to the underlying zarr.Array handle. Only the chunks - # intersecting `slices` are fetched. + + # Mask unpack: model claims 3-D (y, x, mask_size); on-disk is 2-D + # (y, x) packed wide-int with flag_masks attribute. + claimed_shape = tuple(model.shape) if model.shape is not None else None + if ( + claimed_shape is not None + and len(claimed_shape) == 3 + and len(node.shape) == 2 + and "flag_masks" in node.attributes.extra + ): + return self._read_packed_mask(node, claimed_shape, slices) + + # Standard path: forward slices straight to the lazy handle. return node.read(slices=slices) + + def _read_packed_mask( + self, + node: ZarrArray, + claimed_shape: tuple[int, ...], + slices: tuple[slice, ...] | EllipsisType, + ) -> np.ndarray: + """Unpack a 2-D wide-int mask back to 3-D ``(y, x, mask_size)``. + + ``slices`` is forwarded to the underlying handle as-is when it + has rank 2; rank-3 slices have their last axis stripped and + re-applied after the unpack. + """ + mask_size = claimed_shape[2] + # Forward 2-D slice to the lazy handle; only intersecting + # chunks are fetched even on remote stores. + if slices is ...: + spatial_slices: tuple[slice, ...] | EllipsisType = ... + byte_slice: slice | EllipsisType = ... + elif len(slices) == 3: + spatial_slices = slices[:2] + byte_slice = slices[2] + else: + spatial_slices = slices + byte_slice = ... + packed = node.read(slices=spatial_slices) + # Unpack: low byte first. + out = np.empty(packed.shape + (mask_size,), dtype=np.uint8) + for i in range(mask_size): + out[..., i] = (packed >> np.uint64(8 * i)) & np.uint64(0xFF) + if byte_slice is ...: + return out + return out[..., byte_slice] ``` - [ ] **Step 4: Run the tests to verify they pass** -Run: `pytest tests/test_zarr_input_archive.py::ZarrInputArchiveLazySubsetTestCase -v` -Expected: PASS — `subset_reads < full_reads`. +Run: `pytest tests/test_zarr_input_archive.py -v` +Expected: PASS — lazy-subset invariant holds, mask unpack recovers both single-byte and five-byte packings. - [ ] **Step 5: Commit** ```bash git add python/lsst/images/zarr/_input_archive.py tests/test_zarr_input_archive.py -git commit -m "feat: implement lazy slice forwarding in ZarrInputArchive.get_array" +git commit -m "feat: implement ZarrInputArchive.get_array with lazy slices and mask unpack" ``` -### Task 3.3: `deserialize_pointer`, `serialize_frame_set` round-trip, `get_frame_set` +### Task 3.3: `deserialize_pointer`, `get_frame_set`, AST WCS reconstruction **Files:** - Modify: `python/lsst/images/zarr/_input_archive.py` @@ -2561,10 +3438,14 @@ git commit -m "feat: implement lazy slice forwarding in ZarrInputArchive.get_arr `deserialize_pointer(pointer, model_type, deserializer)`: -1. Looks up the cached deserialized object by `pointer.path`; returns it if present. -2. Reads the JSON bytes at `pointer.path` (a `ZarrArray` of `uint8`). -3. Validates with `model_type.model_validate_json` and calls `deserializer(model, self)`. -4. Caches by `pointer.path` and, if the result is a `FrameSet`, also caches it under `_frame_set_cache` so `get_frame_set` can return it later without re-deserialization. +1. Cache hit by `pointer.path` → return cached object. +2. Read JSON bytes at `pointer.path` (a `ZarrArray` of `uint8`). +3. Validate via `model_type.model_validate_json` and call `deserializer(model, self)`. +4. Cache the result; if it is a `FrameSet`, also cache it under `_frame_set_cache` so `get_frame_set` can return it. + +For `Projection.deserialize` to find the AST WCS, the Projection serialization model carries a `ZarrPointerModel` referencing `/wcs_ast` (set by `add_tree` in Phase 2). When that pointer is deserialized, the deserializer reads the AST string bytes via `get_array` (the `wcs_ast` array is plain `uint8` so `get_array` returns it as-is) and reconstructs the FrameSet with `astshim.Object.fromString`. + +The AST reconstruction is performed inside the projection deserializer, not the input archive — but the input archive needs to expose the bytes at `/wcs_ast` so the deserializer can call `get_array` on it. That happens automatically since `/wcs_ast` is just a regular zarr array. - [ ] **Step 1: Write the failing test** @@ -2574,10 +3455,6 @@ Append to `tests/test_zarr_input_archive.py`: @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class ZarrInputArchivePointerTestCase(unittest.TestCase): def test_deserialize_pointer_caches_results(self) -> None: - # Write an archive containing one Image, then re-open and - # deserialize via the public read() helper once it lands. For - # this skeleton task, build a hand-rolled archive that contains - # a JSON sub-tree and call deserialize_pointer directly. import pydantic from lsst.images.zarr._common import ZarrPointerModel @@ -2585,20 +3462,21 @@ class ZarrInputArchivePointerTestCase(unittest.TestCase): class _Sub(pydantic.BaseModel): label: str - # Build an archive with /lsst/psf/tree containing JSON. store = zarr.storage.MemoryStore() root = zarr.create_group(store=store, zarr_format=3) root.update_attributes( - {LSST_NS: {"version": LSST_VERSION, "archive_class": "Image", "tree": "/lsst/tree"}} + {LSST_NS: {"version": LSST_VERSION, "archive_class": "Image", "tree": "tree"}} ) - # Stub /lsst/tree. - lsst_group = root.create_group("lsst") - lsst_group.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" - # /lsst/psf/tree with a JSON document. + # Stub /tree. + root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" + # Sub-archive with its own /tree at /psf/tree. json_bytes = b'{"label": "psf"}' - psf_group = lsst_group.create_group("psf") - arr = psf_group.create_array( - name="tree", shape=(len(json_bytes),), chunks=(len(json_bytes),), dtype="uint8" + psf = root.create_group("psf") + arr = psf.create_array( + name="tree", + shape=(len(json_bytes),), + chunks=(len(json_bytes),), + dtype="uint8", ) arr[:] = np.frombuffer(json_bytes, dtype=np.uint8) @@ -2611,7 +3489,7 @@ class ZarrInputArchivePointerTestCase(unittest.TestCase): deserialize_calls.append(1) return model - pointer = ZarrPointerModel(path="/lsst/psf/tree") + pointer = ZarrPointerModel(path="/psf/tree") first = archive.deserialize_pointer(pointer, _Sub, deserializer) second = archive.deserialize_pointer(pointer, _Sub, deserializer) self.assertEqual(first.label, "psf") @@ -2619,14 +3497,14 @@ class ZarrInputArchivePointerTestCase(unittest.TestCase): self.assertEqual(len(deserialize_calls), 1) ``` -- [ ] **Step 2: Run the test to verify it fails** +- [ ] **Step 2: Run to verify failure** Run: `pytest tests/test_zarr_input_archive.py::ZarrInputArchivePointerTestCase -v` Expected: FAIL — `deserialize_pointer` raises `NotImplementedError`. - [ ] **Step 3: Implement `deserialize_pointer` and `get_frame_set`** -Replace the placeholders in `_input_archive.py`: +Replace the placeholders in `python/lsst/images/zarr/_input_archive.py`: ```python def deserialize_pointer[U: ArchiveTree, V]( @@ -2640,7 +3518,9 @@ Replace the placeholders in `_input_archive.py`: try: node = self._document.root.get(pointer.path) except KeyError: - raise ArchiveReadError(f"Pointer reference {pointer.path!r} not in store.") from None + raise ArchiveReadError( + f"Pointer reference {pointer.path!r} not in store." + ) from None if not isinstance(node, ZarrArray): raise ArchiveReadError(f"Pointer target {pointer.path!r} is not an array.") json_text = bytes(node.read()).decode("utf-8") @@ -2661,7 +3541,7 @@ Replace the placeholders in `_input_archive.py`: ) from None ``` -- [ ] **Step 4: Run the test to verify it passes** +- [ ] **Step 4: Run the tests** Run: `pytest tests/test_zarr_input_archive.py -v` Expected: PASS — pointer-cache test asserts the deserializer is called exactly once. @@ -2679,7 +3559,7 @@ git commit -m "feat: implement deserialize_pointer and get_frame_set" - Modify: `python/lsst/images/zarr/_input_archive.py` - Modify: `tests/test_zarr_input_archive.py` -Mirrors the FITS implementation: the `TableModel` carries one `ArrayReferenceModel` per column whose `source` is a `zarr:/lsst/tables//` path. `get_table` resolves each column via `get_array` (so subset semantics propagate), then builds an `astropy.table.Table`. `get_structured_array` returns the same data as a numpy structured array. +Mirrors the FITS implementation: each column is a separate `ArrayReferenceModel(source=f"zarr:/lsst/tables//")` resolved via `get_array`. - [ ] **Step 1: Write the failing test** @@ -2690,22 +3570,24 @@ Append to `tests/test_zarr_input_archive.py`: class ZarrInputArchiveTableTestCase(unittest.TestCase): def test_get_table_reconstructs_columns(self) -> None: import astropy.table - import numpy as np + from lsst.images.zarr._model import ZarrArray from lsst.images.zarr._output_archive import ZarrOutputArchive - # Stage a table via the output archive, then read it back. out = ZarrOutputArchive() - # Wire up the LSST root attributes so the input archive accepts it. + # Wire up the LSST root attributes. out.document.root.attributes.lsst["archive_class"] = "Image" - out.document.root.attributes.lsst["tree"] = "/lsst/tree" - out.document.root.ensure_group("/lsst").arrays["tree"] = ZarrArray( + out.document.root.attributes.lsst["tree"] = "tree" + out.document.root.arrays["tree"] = ZarrArray( data=np.frombuffer(b"{}", dtype=np.uint8) ) original = astropy.table.Table( - {"x": np.arange(4, dtype=np.int32), "y": np.arange(4, dtype=np.float32)} + { + "x": np.arange(4, dtype=np.int32), + "y": np.arange(4, dtype=np.float32), + } ) - model = out.add_table(original, name="/cat") + model = out.add_table(original, name="cat") store = zarr.storage.MemoryStore() out.document.to_zarr(store) @@ -2718,7 +3600,7 @@ class ZarrInputArchiveTableTestCase(unittest.TestCase): np.testing.assert_array_equal(recovered["y"], original["y"]) ``` -- [ ] **Step 2: Run the test to verify it fails** +- [ ] **Step 2: Run to verify failure** Run: `pytest tests/test_zarr_input_archive.py::ZarrInputArchiveTableTestCase -v` Expected: FAIL — `get_table` raises `NotImplementedError`. @@ -2757,10 +3639,10 @@ Replace the placeholders: return self.get_table(model, strip_header).as_array() ``` -- [ ] **Step 4: Run the test to verify it passes** +- [ ] **Step 4: Run the test** Run: `pytest tests/test_zarr_input_archive.py -v` -Expected: PASS — all tests in the file. +Expected: PASS — all input-archive tests pass. - [ ] **Step 5: Commit** @@ -2776,7 +3658,7 @@ git commit -m "feat: implement ZarrInputArchive.get_table and get_structured_arr - Modify: `python/lsst/images/zarr/__init__.py` - Modify: `tests/test_zarr_input_archive.py` -`read(cls, path, **kwargs)` opens a `ZarrInputArchive`, calls `archive.get_tree(cls._get_archive_tree_type(ZarrPointerModel))`, and returns `ReadResult(tree.deserialize(archive, **kwargs), tree.metadata, tree.butler_info)` — a direct mirror of `ndf.read` minus the auto-detect path (we do not auto-detect a non-LSST OME-Zarr archive in v1). +`read(cls, path, **kwargs)` opens a `ZarrInputArchive`, calls `archive.get_tree(cls._get_archive_tree_type(ZarrPointerModel))`, and returns `ReadResult(tree.deserialize(archive, **kwargs), tree.metadata, tree.butler_info)`. No auto-detect path in v1 — files without `lsst.archive_class` raise. - [ ] **Step 1: Write the failing test** @@ -2786,13 +3668,12 @@ Append to `tests/test_zarr_input_archive.py`: @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class ZarrReadHelperTestCase(unittest.TestCase): def test_round_trip_image(self) -> None: - import numpy as np - from lsst.images import Box, Image from lsst.images.zarr import read, write original = Image( - np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], ) with tempfile.TemporaryDirectory() as tmp: target = os.path.join(tmp, "out.zarr") @@ -2803,7 +3684,7 @@ class ZarrReadHelperTestCase(unittest.TestCase): self.assertEqual(result.deserialized.bbox, original.bbox) ``` -- [ ] **Step 2: Run the test to verify it fails** +- [ ] **Step 2: Run to verify failure** Run: `pytest tests/test_zarr_input_archive.py::ZarrReadHelperTestCase -v` Expected: FAIL — `read()` raises `NotImplementedError`. @@ -2817,8 +3698,8 @@ def read[T: Any](cls: type[T], path: ResourcePathExpression, **kwargs: Any) -> R """Read an object from a zarr archive. The archive's root attributes name the in-memory class via - ``lsst.archive_class``; this is checked against ``cls`` and an - `ArchiveReadError` is raised on mismatch. + ``lsst.archive_class``. Files without this attribute raise; auto- + detect of foreign zarr files is a follow-up. """ with ZarrInputArchive.open(path) as archive: tree_type = cls._get_archive_tree_type(ZarrPointerModel) @@ -2827,7 +3708,7 @@ def read[T: Any](cls: type[T], path: ResourcePathExpression, **kwargs: Any) -> R return ReadResult(obj, tree.metadata, tree.butler_info) ``` -Re-export `read` from `python/lsst/images/zarr/__init__.py`: +Re-export from `python/lsst/images/zarr/__init__.py`: ```python from ._common import * # noqa: F401, F403 @@ -2847,13 +3728,13 @@ git add python/lsst/images/zarr/_input_archive.py python/lsst/images/zarr/__init git commit -m "feat: add public zarr.read() helper" ``` -### Task 3.6: `RoundtripZarr` test helper + round-trips for `Image` / `MaskedImage` / `VisitImage` +### Task 3.6: `RoundtripZarr` test helper + round-trips for Image / MaskedImage / VisitImage **Files:** - Modify: `python/lsst/images/tests/_roundtrip.py` (add `RoundtripZarr`) - Create: `tests/test_zarr_round_trip.py` -`RoundtripZarr` lets the existing `RoundtripBase` helpers exercise the zarr backend the same way they exercise FITS / JSON / NDF. The new test file uses it to round-trip the three image types covered by Phase 2's output archive. +`RoundtripZarr` lets the existing `RoundtripBase` pattern exercise the zarr backend the same way it does FITS / JSON / NDF. The new test file uses it to round-trip the three image types covered by Phase 2. - [ ] **Step 1: Add `RoundtripZarr` to `_roundtrip.py`** @@ -2867,7 +3748,9 @@ class RoundtripZarr[T](RoundtripBase[T]): from lsst.images.zarr._model import ZarrDocument - return ZarrDocument.from_zarr(zarr.storage.LocalStore(self.filename, read_only=True)) + return ZarrDocument.from_zarr( + zarr.storage.LocalStore(self.filename, read_only=True) + ) def _get_extension(self) -> str: return ".zarr" @@ -2883,7 +3766,7 @@ class RoundtripZarr[T](RoundtripBase[T]): return zarr_backend.read(obj_type, filename) ``` -(If `RoundtripBase` constructs the filename as a single file but our zarr archive is a directory, audit the helper for any logic that assumes a file. Mirror what NDF does — `RoundtripNdf` stores `.sdf`, a single HDF5 file; for zarr, `.zarr` is conventionally a directory. Adjust the helper to accept directories where it currently uses `tempfile.NamedTemporaryFile`.) +If `RoundtripBase` constructs the on-disk path with `tempfile.NamedTemporaryFile`, audit it for directory-vs-file assumptions: a zarr archive is a directory when `_get_extension()` returns `.zarr`. Mirror what NDF does with `.sdf` (single file) but extend to handle the directory case — likely a `tempfile.TemporaryDirectory` used as the parent and the archive path joined under it. - [ ] **Step 2: Write the failing test** @@ -2922,7 +3805,10 @@ class ZarrRoundTripTestCase(unittest.TestCase): def test_image_round_trip(self) -> None: from lsst.images import Box, Image - original = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) with RoundtripZarr(self, original) as roundtrip: recovered = roundtrip.recovered np.testing.assert_array_equal(recovered.array, original.array) @@ -2931,21 +3817,51 @@ class ZarrRoundTripTestCase(unittest.TestCase): def test_masked_image_round_trip(self) -> None: from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema - schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) - image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + schema = MaskSchema( + [ + MaskPlane("BAD", "Bad pixel."), + MaskPlane("SAT", "Saturated."), + MaskPlane("CR", "Cosmic ray."), + ] + ) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) original = MaskedImage(image, mask_schema=schema) original.mask.set("BAD", image.array % 2 == 0) + original.mask.set("SAT", image.array > 10) with RoundtripZarr(self, original) as roundtrip: recovered = roundtrip.recovered np.testing.assert_array_equal(recovered.image.array, original.image.array) np.testing.assert_array_equal(recovered.mask.array, original.mask.array) + def test_masked_image_with_40_planes_round_trip(self) -> None: + from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema + + schema = MaskSchema([MaskPlane(f"P{i}", f"Plane {i}.") for i in range(40)]) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + original = MaskedImage(image, mask_schema=schema) + original.mask.set("P0", image.array % 2 == 0) + original.mask.set("P39", image.array > 10) + + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.recovered + # 40 planes packed into uint64 on disk, unpacked to 5 bytes per pixel. + np.testing.assert_array_equal(recovered.mask.array, original.mask.array) + def test_visit_image_round_trip(self) -> None: from lsst.images import Box, Image, MaskPlane, MaskSchema, VisitImage schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) - image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) original = VisitImage(image=image, mask_schema=schema) with RoundtripZarr(self, original) as roundtrip: @@ -2960,675 +3876,260 @@ if __name__ == "__main__": - [ ] **Step 3: Run the tests** Run: `pytest tests/test_zarr_round_trip.py -v` -Expected: PASS — three round-trips. If any fail because a per-class detail in the input archive isn't quite right (e.g. a missing `lsst.companions` key for MaskedImage), fix it in `_input_archive.py`/`_output_archive.py` and re-run. +Expected: PASS — all four round-trips. If a test fails because some per-class detail is missing (e.g. a `lsst.companions` style key our `add_tree` should set, or a Projection deserializer that needs to find `/wcs_ast`), fix it in `_input_archive.py` / `_output_archive.py` and re-run. The 40-plane test is the load-bearing assertion that the wide-int packing + unpack round-trip is bit-exact. - [ ] **Step 4: Commit** ```bash git add python/lsst/images/tests/_roundtrip.py tests/test_zarr_round_trip.py -git commit -m "test: round-trip Image, MaskedImage, VisitImage through the zarr backend" +git commit -m "test: round-trip Image, MaskedImage (3- and 40-plane), VisitImage through zarr" ``` --- -**End of Phase 3.** Read side complete for `Image` / `MaskedImage` / `VisitImage`, with the lazy-subset invariant pinned by `_CountingStore` and full write→read round-trips green. Phase 4 adds `ColorImage` (channel axis + transpose) and `CellCoadd` (cell-aligned chunks + 4D per-cell PSF). +**End of Phase 3.** Seven tasks. Read side complete for `Image` / `MaskedImage` / `VisitImage`, lazy-subset invariant pinned by `_CountingStore`, mask unpack pinned by both 3-plane (uint8) and 40-plane (uint64) tests, full write→read round-trips green. Phase 4 adds `ColorImage` (recursive sub-archives) and `CellCoadd` (cell-aligned chunks + native 4-D PSF). ## Phase 4 — `ColorImage` and `CellCoadd` -This phase adds the two image types whose on-disk layouts deviate from the default `(y, x)` shape: `ColorImage` stacks its three channels into a single `(3, Y, X)` OME image at `/0`, and `CellCoadd` keeps default 2D layout for image / mask / variance but adds a 4D `(Cy, Cx, Py, Px)` per-cell PSF group at `/lsst/psf/per_cell` plus cell-aligned chunks driven by the coadd's `cell_shape`. - -**Key implementation idea (shared):** the user's `serialize()` runs unchanged. Each `serialize_direct("red", ...)` / `serialize_direct("blue", ...)` / per-cell PSF call lands its arrays at the natural per-component path in the IR (`/lsst/red/0`, `/lsst/blue/0`, `/lsst/psf/per_cell/cell__/0`). After `obj.serialize` returns, `_layout` runs a **fixup pass** keyed on `archive_class` that: - -1. Collects the per-component IR arrays. -2. Stacks them into the canonical OME-shaped array (transposed for `ColorImage`; nested-cell-stacked for `CellCoadd`). -3. Replaces the per-component arrays with **sliced source references** in the JSON tree so reads can still resolve them via `get_array` without re-implementing every type's deserializer. +This phase adds the two archive classes whose layouts go beyond the flat `image`/`variance`/`mask` siblings: -The sliced-source convention reuses the existing `ArrayReferenceModel.source` field with a query suffix: +- **`ColorImage`**: red/green/blue sub-archives. Each is itself a valid Image-shaped sub-archive (its own `image` array, its own OME multiscales, its own `lsst.archive_class = "Image"`). The root group has `lsst.archive_class = "ColorImage"` and **no** OME multiscales of its own. +- **`CellCoadd`**: `image`/`variance`/`mask` siblings (cell-aligned chunks) plus a 4-D `psf` array `(Cy, Cx, Py, Px)` with single-cell chunks `(1, 1, Py, Px)`. `lsst.cell_grid = {bbox, cell_shape}` on the root attrs. -- `zarr:/0?c=0` → axis 0 of `/0`, slice `[0:1, :, :]`, then squeeze axis 0 (used for ColorImage) -- `zarr:/lsst/psf/per_cell/0?cell=3,5` → cell `(3, 5)` of `/lsst/psf/per_cell/0`, slice `[3:4, 5:6, :, :]`, then squeeze the cell axes (used for CellCoadd PSF) +The recurring theme: **no fixup pass**. Each `add_array` call lands at the path its `name` argument names. Per-archive-class attribute decoration runs once in `add_tree` against the populated IR. -`get_array` parses the suffix, composes the implicit slice with any user-supplied `slices=`, and forwards the composed slice to the lazy `zarr.Array` handle. The lazy invariant from Phase 3 is preserved: a subset read of one channel of one ColorImage still touches only the chunks intersecting that subset along the spatial axes. - -### Task 4.1: Sliced-source URL parsing in `get_array` +### Task 4.1: Recursive sub-archive attribute decoration **Files:** -- Modify: `python/lsst/images/zarr/_common.py` -- Modify: `python/lsst/images/zarr/_input_archive.py` -- Modify: `tests/test_zarr_input_archive.py` +- Modify: `python/lsst/images/zarr/_layout.py` (add `decorate_sub_archives`) +- Modify: `python/lsst/images/zarr/_output_archive.py` (call it from `add_tree`) +- Modify: `tests/test_zarr_layout.py` +- Modify: `tests/test_zarr_output_archive.py` -This task adds the URL parser and threads it through `get_array`. No production user is calling these sliced sources yet — Tasks 4.2 and 4.4 introduce the writers. The test bench-tests the parser by hand-constructing references against an IR with a stacked array. +For ColorImage's `red/`, `green/`, `blue/` to be valid OME-NGFF / xarray groups in their own right, each needs `lsst.archive_class = "Image"` and an `ome.multiscales` block pointing at its `image` array. The decoration is purely metadata — no bytes move. + +The detection rule for "this sub-group is a sub-archive": it contains an `image` array (any rank). The decoration is recursive — sub-sub-archives (e.g. a Projection's parameter image inside a PSF sub-archive) get the same treatment. - [ ] **Step 1: Write the failing test** -Append to `tests/test_zarr_input_archive.py`: +Append to `tests/test_zarr_layout.py`: ```python @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") -class ZarrSlicedSourceTestCase(unittest.TestCase): - def test_channel_slice_returns_one_channel(self) -> None: - from lsst.images.serialization import ArrayReferenceModel, NumberType - from lsst.images.zarr._common import LSST_NS, LSST_VERSION - - store = zarr.storage.MemoryStore() - root = zarr.create_group(store=store, zarr_format=3) - root.update_attributes( - {LSST_NS: {"version": LSST_VERSION, "archive_class": "ColorImage", "tree": "/lsst/tree"}} - ) - # Stack: 3 channels × 4 rows × 5 cols. - stacked = np.arange(60, dtype=np.uint8).reshape(3, 4, 5) - root.create_array(name="0", shape=(3, 4, 5), chunks=(1, 4, 5), dtype="uint8")[:] = stacked - # Stub /lsst/tree. - lsst = root.create_group("lsst") - lsst.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" - - doc = ZarrDocument.from_zarr(store) - archive = ZarrInputArchive(doc) - - # Channel-1 reference reads a single (4, 5) plane — the c axis is - # dropped after slicing, NOT before, and a user `slices` argument - # composes correctly with the implicit channel slice. - ref = ArrayReferenceModel( - source="zarr:/0?c=1", shape=[4, 5], datatype=NumberType.from_numpy(np.dtype("uint8")) - ) - full = archive.get_array(ref) - self.assertEqual(full.shape, (4, 5)) - np.testing.assert_array_equal(full, stacked[1]) - - # Composed user slice on top of the channel suffix. - sub = archive.get_array(ref, slices=(slice(0, 2), slice(0, 3))) - self.assertEqual(sub.shape, (2, 3)) - np.testing.assert_array_equal(sub, stacked[1, :2, :3]) - - def test_cell_slice_returns_one_cell(self) -> None: - from lsst.images.serialization import ArrayReferenceModel, NumberType - from lsst.images.zarr._common import LSST_NS, LSST_VERSION - - store = zarr.storage.MemoryStore() - root = zarr.create_group(store=store, zarr_format=3) - root.update_attributes( - {LSST_NS: {"version": LSST_VERSION, "archive_class": "CellCoadd", "tree": "/lsst/tree"}} - ) - # 2x3 cells, each 4x5 PSF. - stack = np.arange(2 * 3 * 4 * 5, dtype=np.float32).reshape(2, 3, 4, 5) - psf = root.create_group("lsst").create_group("psf").create_group("per_cell") - psf.create_array(name="0", shape=(2, 3, 4, 5), chunks=(1, 1, 4, 5), dtype="float32")[:] = stack - # Stub /lsst/tree. - root["lsst"].create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8")[:] = b"{}" - - doc = ZarrDocument.from_zarr(store) - archive = ZarrInputArchive(doc) - - ref = ArrayReferenceModel( - source="zarr:/lsst/psf/per_cell/0?cell=1,2", - shape=[4, 5], - datatype=NumberType.from_numpy(np.dtype("float32")), - ) - result = archive.get_array(ref) - self.assertEqual(result.shape, (4, 5)) - np.testing.assert_array_equal(result, stack[1, 2]) -``` - -- [ ] **Step 2: Run the test to verify it fails** - -Run: `pytest tests/test_zarr_input_archive.py::ZarrSlicedSourceTestCase -v` -Expected: FAIL — current `get_array` rejects `?` query suffix. - -- [ ] **Step 3: Add the parser to `_common.py`** - -Append to `python/lsst/images/zarr/_common.py`: +class DecorateSubArchivesTestCase(unittest.TestCase): + def test_sub_group_with_image_gets_lsst_and_ome_attrs(self) -> None: + import numpy as np -```python -__all__ = ( - "LSST_NS", - "LSST_VERSION", - "OME_NS", - "OME_VERSION", - "SlicedSource", - "ZarrCompressionOptions", - "ZarrPointerModel", - "archive_path_to_zarr_path", - "json_pointer_to_zarr_path", - "parse_zarr_source", -) + from lsst.images.zarr._layout import decorate_sub_archives + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + doc = ZarrDocument(root=ZarrGroup()) + doc.root.attributes.lsst["archive_class"] = "ColorImage" + # red sub-archive with its own image array. + red = doc.root.ensure_group("/red") + red.arrays["image"] = ZarrArray(data=np.ones((4, 5), dtype="float32")) -@dataclass(frozen=True) -class SlicedSource: - """A parsed ``zarr:?`` reference. - - ``implicit_slices`` holds the slice tuple to apply *before* the - user's ``slices=`` argument. ``squeezed_axes`` lists axes to drop - after slicing (the channel axis for ColorImage, the cell axes for - CellCoadd). This shape lets `ZarrInputArchive.get_array` compose - the implicit slice with any user slice and forward the result - straight to the lazy ``zarr.Array`` handle. - """ + decorate_sub_archives(doc) - path: str - implicit_slices: tuple[slice, ...] - squeezed_axes: tuple[int, ...] - - -def parse_zarr_source(source: str) -> SlicedSource: - """Parse a ``zarr:[?]`` reference into a `SlicedSource`.""" - if not source.startswith("zarr:"): - raise ValueError(f"Not a zarr source: {source!r}") - body = source[len("zarr:") :] - if "?" not in body: - return SlicedSource(path=body, implicit_slices=(), squeezed_axes=()) - path, query = body.split("?", 1) - if query.startswith("c="): - c = int(query[len("c=") :]) - return SlicedSource(path=path, implicit_slices=(slice(c, c + 1),), squeezed_axes=(0,)) - if query.startswith("cell="): - cy_str, cx_str = query[len("cell=") :].split(",", 1) - cy, cx = int(cy_str), int(cx_str) - return SlicedSource( - path=path, - implicit_slices=(slice(cy, cy + 1), slice(cx, cx + 1)), - squeezed_axes=(0, 1), + self.assertEqual(red.attributes.lsst["archive_class"], "Image") + self.assertIn("multiscales", red.attributes.ome) + self.assertEqual( + red.attributes.ome["multiscales"][0]["datasets"][0]["path"], "image" ) - raise ValueError(f"Unsupported zarr-source query {query!r}.") -``` - -- [ ] **Step 4: Use the parser in `get_array`** - -In `python/lsst/images/zarr/_input_archive.py`, replace the `get_array` body so it composes the implicit slices with the user's: - -```python - def get_array( - self, - model: ArrayReferenceModel | InlineArrayModel, - *, - slices: tuple[slice, ...] | EllipsisType = ..., - strip_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, - ) -> np.ndarray: - if isinstance(model, InlineArrayModel): - data: np.ndarray = np.array(model.data, dtype=model.datatype.to_numpy()) - return data if slices is ... else data[slices] - if not isinstance(model.source, str) or not model.source.startswith("zarr:"): - raise ArchiveReadError( - f"ZarrInputArchive cannot resolve array source {model.source!r}; " - f"expected a 'zarr:' reference." - ) - from ._common import parse_zarr_source - - parsed = parse_zarr_source(model.source) - try: - node = self._document.root.get(parsed.path) - except KeyError: - raise ArchiveReadError(f"Array reference {parsed.path!r} not in store.") from None - if not isinstance(node, ZarrArray): - raise ArchiveReadError(f"{parsed.path!r} is not an array.") - # Compose implicit (per-class) slice with user slice. The lazy - # invariant: this composed tuple is what hits the zarr.Array - # handle, so only intersecting chunks are fetched. - composed: tuple[slice, ...] - if slices is ...: - user_slices: tuple[slice, ...] = (slice(None),) * ( - len(node.shape) - len(parsed.implicit_slices) - ) - else: - user_slices = slices - composed = parsed.implicit_slices + user_slices - raw = node.read(slices=composed) - # Drop the squeezed axes (channel / cell axes that the implicit - # slice constrained to size 1). - for axis in sorted(parsed.squeezed_axes, reverse=True): - raw = np.squeeze(raw, axis=axis) - return raw -``` - -- [ ] **Step 5: Run the tests to verify they pass** - -Run: `pytest tests/test_zarr_input_archive.py -v` -Expected: PASS — including both new sliced-source tests. - -- [ ] **Step 6: Commit** - -```bash -git add python/lsst/images/zarr/_common.py python/lsst/images/zarr/_input_archive.py tests/test_zarr_input_archive.py -git commit -m "feat: parse zarr:?c=N / ?cell=Cy,Cx sliced-source references" -``` - -### Task 4.2: ColorImage layout fixup on write - -**Files:** -- Modify: `python/lsst/images/zarr/_layout.py` (add `fixup_color_image`) -- Modify: `python/lsst/images/zarr/_output_archive.py` (call the fixup in `add_tree` when `archive_class == "ColorImage"`) -- Modify: `python/lsst/images/zarr/_common.py` (add `/red/image`, `/green/image`, `/blue/image` mappings to `_JSON_POINTER_TO_ZARR_PATH` so per-channel arrays land at predictable paths) -- Modify: `tests/test_zarr_output_archive.py` (assert IR shape after fixup) - -The fixup runs after `serialize()` populates the IR. By that point, the IR has three top-level Image sub-archives at `/lsst/red/0`, `/lsst/green/0`, `/lsst/blue/0` (each shaped `(Y, X)`), plus the staged JSON tree referencing them by `source="zarr:/lsst//0"`. The fixup: - -1. Reads the three numpy arrays out of the IR (still numpy; no zarr handles yet). -2. Stacks via `transpose_color_image_in(np.stack([r, g, b], axis=-1))` to produce `(3, Y, X)`. -3. Stages this stacked array at `/0` with chunks `(1, 1024, 1024)` (default) and shards. -4. Removes `/lsst/red`, `/lsst/green`, `/lsst/blue` from the IR. -5. Walks the staged tree's JSON bytes to rewrite the source URLs: - - `"zarr:/lsst/red/0"` → `"zarr:/0?c=0"` - - `"zarr:/lsst/green/0"` → `"zarr:/0?c=1"` - - `"zarr:/lsst/blue/0"` → `"zarr:/0?c=2"` -6. Adds OME `omero/channels` to the root attributes. - -The JSON-rewrite step is a flat string substitution because each source URL is unique and self-contained — no nested-pointer escaping concerns. - -- [ ] **Step 1: Update `_JSON_POINTER_TO_ZARR_PATH`** - -In `python/lsst/images/zarr/_common.py`, replace the existing dict literal with: - -```python -_JSON_POINTER_TO_ZARR_PATH: dict[str, str] = { - "/image": "/0", - "/mask": "/lsst/mask/0", - "/variance": "/lsst/variance/0", - # ColorImage per-channel sub-archives (stacked into /0 by the - # _layout.fixup_color_image pass on write). - "/red/image": "/lsst/red/0", - "/green/image": "/lsst/green/0", - "/blue/image": "/lsst/blue/0", -} -``` - -- [ ] **Step 2: Write the failing test** - -Append to `tests/test_zarr_output_archive.py`: - -```python -@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") -class ZarrColorImageWriteTestCase(unittest.TestCase): - def test_color_image_stacks_into_top_level_array(self) -> None: - import os - import tempfile + def test_root_archive_class_is_unchanged(self) -> None: import numpy as np - import zarr - - from lsst.images import Box, ColorImage, Image - from lsst.images.zarr import write - from lsst.images.zarr._model import ZarrDocument - - red = Image(np.full((4, 5), 1, dtype=np.uint8), bbox=Box.factory[10:14, 20:25]) - green = Image(np.full((4, 5), 2, dtype=np.uint8), bbox=red.bbox) - blue = Image(np.full((4, 5), 3, dtype=np.uint8), bbox=red.bbox) - color = ColorImage(red=red, green=green, blue=blue) - with tempfile.TemporaryDirectory() as tmp: - target = os.path.join(tmp, "out.zarr") - write(color, target) - with zarr.storage.LocalStore(target, read_only=True) as store: - doc = ZarrDocument.from_zarr(store) - # Stacked at /0 with shape (3, Y, X). - self.assertIn("0", doc.root.arrays) - self.assertEqual(doc.root.arrays["0"].shape, (3, 4, 5)) - # Per-channel sub-archives are gone after the fixup. - self.assertNotIn("red", doc.root.groups.get("lsst", _empty()).groups) - # OME attributes name the channel axis. - axes = [a["name"] for a in doc.root.attributes.ome["multiscales"][0]["axes"]] - self.assertEqual(axes, ["c", "y", "x"]) - self.assertIn("omero", doc.root.attributes.ome) + from lsst.images.zarr._layout import decorate_sub_archives + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + doc = ZarrDocument(root=ZarrGroup()) + doc.root.attributes.lsst["archive_class"] = "ColorImage" + red = doc.root.ensure_group("/red") + red.arrays["image"] = ZarrArray(data=np.ones((4, 5), dtype="float32")) -def _empty(): # noqa: ANN201 - from lsst.images.zarr._model import ZarrGroup + decorate_sub_archives(doc) - return ZarrGroup() + # Root keeps ColorImage; only sub-groups are decorated. + self.assertEqual(doc.root.attributes.lsst["archive_class"], "ColorImage") ``` -- [ ] **Step 3: Run the test to verify it fails** +- [ ] **Step 2: Run to verify failure** -Run: `pytest tests/test_zarr_output_archive.py::ZarrColorImageWriteTestCase -v` -Expected: FAIL — fixup not implemented; per-channel arrays remain at `/lsst/red/0` etc. +Run: `pytest tests/test_zarr_layout.py::DecorateSubArchivesTestCase -v` +Expected: FAIL — `decorate_sub_archives` does not exist. -- [ ] **Step 4: Add `fixup_color_image` to `_layout.py`** +- [ ] **Step 3: Implement the decoration pass** Append to `python/lsst/images/zarr/_layout.py`: ```python __all__ = ( + "AffineCheckResult", + "affine_check", "axes_for_archive_class", + "chunks_aligned_to", "chunks_for", - "fixup_color_image", - "transpose_color_image_in", - "transpose_color_image_out", + "decorate_sub_archives", ) -def fixup_color_image(document: "ZarrDocument") -> None: - """Stack ColorImage red/green/blue sub-archives into a single ``/0``. +def decorate_sub_archives(document: "ZarrDocument") -> None: + """Walk ``document`` and decorate every sub-archive group with attrs. + + A sub-archive is any group below the root that contains an ``image`` + array. Decoration adds ``lsst.archive_class = "Image"`` and an + ``ome.multiscales`` block pointing at the sub-archive's ``image`` + array. Recursive: nested sub-archives are decorated too. - Runs after ``ColorImage.serialize`` has populated the IR. Reads - the three per-channel numpy arrays out of ``/lsst/red/0``, - ``/lsst/green/0``, ``/lsst/blue/0``, transposes them into - ``(c, y, x)`` shape, stages the result at ``/0``, removes the - per-channel sub-archives, and rewrites the staged JSON tree so - references to the per-channel arrays become channel-sliced - references against ``/0``. + The root group is left alone — its ``lsst.archive_class`` is set + by ``add_tree`` based on the in-memory object's type. """ - from ._model import ZarrArray, ZarrDocument # local: avoid circular import + from ._model import OmeMultiscale, ZarrDocument, ZarrGroup # local: avoid cycle if not isinstance(document, ZarrDocument): raise TypeError(type(document).__name__) - lsst = document.root.groups.get("lsst") - if lsst is None: - return - channels = [] - for name in ("red", "green", "blue"): - sub = lsst.groups.get(name) - if sub is None or "0" not in sub.arrays: - return # not a fully-populated ColorImage; bail out - channels.append(sub.arrays["0"].data) - if not all(isinstance(c, np.ndarray) for c in channels): - raise TypeError("ColorImage fixup requires staged numpy arrays.") - stacked = np.stack(channels, axis=0) # (3, Y, X) - document.root.arrays["0"] = ZarrArray(data=stacked) - for name in ("red", "green", "blue"): - del lsst.groups[name] - # Rewrite source URLs in the JSON tree. - if "tree" in lsst.arrays: - json_bytes = bytes(lsst.arrays["tree"].data) - rewrites = { - b"zarr:/lsst/red/0": b"zarr:/0?c=0", - b"zarr:/lsst/green/0": b"zarr:/0?c=1", - b"zarr:/lsst/blue/0": b"zarr:/0?c=2", - } - for old, new in rewrites.items(): - json_bytes = json_bytes.replace(old, new) - lsst.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) -``` + _decorate_walk(document.root, depth=0) -- [ ] **Step 5: Call the fixup from `add_tree`** -In `python/lsst/images/zarr/_output_archive.py`, extend `add_tree` so that — after the JSON tree is staged — it calls the fixup and adds OME `omero/channels`: +def _decorate_walk(group: "ZarrGroup", *, depth: int) -> None: + from ._model import OmeMultiscale, ZarrGroup # local: avoid cycle -```python - if archive_class == "ColorImage": - from ._layout import fixup_color_image - from ._model import OmeOmeroChannel - - fixup_color_image(self.document) - self.document.root.attributes.ome["omero"] = { - "channels": [ - OmeOmeroChannel(label="red", color="FF0000").dump(), - OmeOmeroChannel(label="green", color="00FF00").dump(), - OmeOmeroChannel(label="blue", color="0000FF").dump(), - ] - } - if "0" in self.document.root.arrays: - top = self.document.root.arrays["0"] - multiscale = OmeMultiscale( - name=archive_class.lower(), - axes=axes_for_archive_class(archive_class), - ) - self.document.root.attributes.ome["multiscales"] = [multiscale.dump()] -``` - -(Move the existing multiscales logic so it runs *after* the fixup; otherwise the axes would still say `(y, x)` because `/0` doesn't exist yet at fixup time.) - -- [ ] **Step 6: Run the tests to verify they pass** - -Run: `pytest tests/test_zarr_output_archive.py::ZarrColorImageWriteTestCase -v` -Expected: PASS — `/0` is `(3, 4, 5)`, per-channel groups are gone, OME axes are `(c, y, x)`. - -- [ ] **Step 7: Commit** - -```bash -git add python/lsst/images/zarr/_layout.py python/lsst/images/zarr/_output_archive.py python/lsst/images/zarr/_common.py tests/test_zarr_output_archive.py -git commit -m "feat: stack ColorImage channels into a single (3, Y, X) OME image" -``` - -### Task 4.3: ColorImage round-trip - -**Files:** -- Modify: `tests/test_zarr_round_trip.py` - -This task only adds a round-trip assertion. The work in Tasks 4.1 and 4.2 means the engineer expects this to pass without further code changes; if it fails, the failure surfaces a missing piece in either the JSON-rewrite step or the channel-slice parser. - -- [ ] **Step 1: Write the test** - -Append to `tests/test_zarr_round_trip.py`: - -```python -@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") -class ZarrColorImageRoundTripTestCase(unittest.TestCase): - def test_color_image_round_trip(self) -> None: - from lsst.images import Box, ColorImage, Image - - red = Image(np.full((4, 5), 1, dtype=np.uint8), bbox=Box.factory[10:14, 20:25]) - green = Image(np.full((4, 5), 2, dtype=np.uint8), bbox=red.bbox) - blue = Image(np.full((4, 5), 3, dtype=np.uint8), bbox=red.bbox) - original = ColorImage(red=red, green=green, blue=blue) - - with RoundtripZarr(self, original) as roundtrip: - recovered = roundtrip.recovered - np.testing.assert_array_equal(recovered.red.array, original.red.array) - np.testing.assert_array_equal(recovered.green.array, original.green.array) - np.testing.assert_array_equal(recovered.blue.array, original.blue.array) -``` - -- [ ] **Step 2: Run the test** - -Run: `pytest tests/test_zarr_round_trip.py::ZarrColorImageRoundTripTestCase -v` -Expected: PASS. - -- [ ] **Step 3: Commit** - -```bash -git add tests/test_zarr_round_trip.py -git commit -m "test: round-trip ColorImage through the zarr backend" -``` - -### Task 4.4: CellCoadd default chunk geometry - -**Files:** -- Modify: `python/lsst/images/zarr/_layout.py` (extend `chunks_for` to honor `cell_shape`) -- Modify: `python/lsst/images/zarr/_output_archive.py` (publish `cell_shape` to the archive so `chunks_for` can see it) -- Modify: `tests/test_zarr_layout.py` -- Modify: `tests/test_zarr_output_archive.py` - -`CellCoadd` chunks should align to the coadd's cell grid: instead of `(1024, 1024)`, the default chunks become `cell_shape` (typically `(256, 256)`). The fix surfaces the cell shape from the in-memory object to `chunks_for` via a new optional `archive_metadata` argument. - -- [ ] **Step 1: Write the failing test** - -Append to `tests/test_zarr_layout.py`: - -```python - def test_chunks_for_cell_coadd_uses_cell_shape(self) -> None: - result = chunks_for( - "CellCoadd", - (4096, 4096), - None, - archive_metadata={"cell_shape": (256, 256)}, - ) - self.assertEqual(result, (256, 256)) - - def test_chunks_for_cell_coadd_without_metadata_falls_back(self) -> None: - # When cell_shape isn't available, fall back to the 1024 default. - self.assertEqual(chunks_for("CellCoadd", (4096, 4096), None), (1024, 1024)) -``` - -- [ ] **Step 2: Run to verify failure** - -Run: `pytest tests/test_zarr_layout.py -v` -Expected: FAIL — `chunks_for` doesn't accept `archive_metadata`. - -- [ ] **Step 3: Extend `chunks_for`** - -Replace `chunks_for` in `python/lsst/images/zarr/_layout.py`: - -```python -def chunks_for( - archive_class: str, - shape: tuple[int, ...], - override: tuple[int, ...] | None, - *, - archive_metadata: Mapping[str, Any] | None = None, -) -> tuple[int, ...]: - """Return the chunk shape to use for an array. - - Parameters - ---------- - archive_class - The top-level archive class. - shape - The full array shape, used to clamp the default per-axis. - override - User-supplied chunk shape; if not ``None`` it is returned - verbatim after a length check. - archive_metadata - Optional dict carrying class-specific layout hints. ``CellCoadd`` - uses ``"cell_shape"`` to align chunks to the cell grid. - """ - if override is not None: - if len(override) != len(shape): - raise ValueError( - f"chunks override has rank {len(override)}, " - f"expected {len(shape)} for {archive_class!r}." - ) - return tuple(override) - if archive_class == "CellCoadd" and archive_metadata is not None: - cell_shape = archive_metadata.get("cell_shape") - if cell_shape is not None: - # Align chunks to the cell grid (still clamped to the array shape). - return tuple(min(c, dim) for c, dim in zip(cell_shape, shape, strict=True)) - return tuple(min(_DEFAULT_AXIS_LIMIT, dim) for dim in shape) -``` - -(Add `from collections.abc import Mapping` and `from typing import Any` to the imports.) - -- [ ] **Step 4: Surface `cell_shape` from `write()` into the archive** - -Edit `python/lsst/images/zarr/_output_archive.py` so `write()` extracts `cell_shape` from the object and stashes it on the archive: - -```python -def write( - obj: Any, - path: Any, - *, - chunks: Mapping[str, tuple[int, ...] | None] | None = None, - shards: Mapping[str, tuple[int, ...] | None] | None = None, - compression: Mapping[str, ZarrCompressionOptions | None] | None = None, - metadata: Mapping[str, Any] | None = None, - butler_info: Any | None = None, -) -> ArchiveTree: - from ._store import open_store_for_write - - archive_default_name = getattr(obj, "_archive_default_name", None) - archive_metadata: dict[str, Any] = {} - if (cell_shape := getattr(obj, "cell_shape", None)) is not None: - archive_metadata["cell_shape"] = tuple(cell_shape) - archive = ZarrOutputArchive( - chunks=chunks, - shards=shards, - compression=compression, - archive_metadata=archive_metadata, - ) - ... + for name, sub in group.groups.items(): + if "image" in sub.arrays: + sub.attributes.lsst.setdefault("archive_class", "Image") + sub.attributes.lsst.setdefault("tree", "tree") if "tree" in sub.arrays else None + if "multiscales" not in sub.attributes.ome: + multiscale = OmeMultiscale( + name="image", + axes=("y", "x"), + dataset_path="image", + ) + sub.attributes.ome["multiscales"] = [multiscale.dump()] + _decorate_walk(sub, depth=depth + 1) ``` -…and route `archive_metadata` through to `add_array` so it can call `chunks_for` with the hints. The simplest path is for `ZarrOutputArchive.add_array` to compute the default chunks itself instead of leaving it to `to_zarr`: +In `python/lsst/images/zarr/_output_archive.py`, call it at the end of `add_tree` (just before the method returns): ```python - def __init__( - self, - *, - chunks=None, - shards=None, - compression=None, - archive_metadata: Mapping[str, Any] | None = None, - ) -> None: - ... - self._archive_metadata = dict(archive_metadata) if archive_metadata else {} + from ._layout import decorate_sub_archives - def add_array(self, array, *, name=None, update_header=...): - ... - chunks = self._chunks.get(name) - if chunks is None: - chunks = chunks_for( - self._archive_class_hint or "Image", - array.shape, - None, - archive_metadata=self._archive_metadata, - ) - parent.arrays[leaf] = ZarrArray( - data=np.ascontiguousarray(array), - chunks=chunks, - ... - ) + decorate_sub_archives(self.document) ``` -`self._archive_class_hint` is set by `add_tree` when it knows the top-level class — but `add_array` may run before `add_tree`. Workable approach: set it in `write()` *before* `obj.serialize` runs, by passing the class name into the constructor. - -- [ ] **Step 5: Write the output-archive layout test** +- [ ] **Step 4: Add an output-archive integration test** Append to `tests/test_zarr_output_archive.py`: ```python @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") -class ZarrCellCoaddChunkLayoutTestCase(unittest.TestCase): - def test_cell_coadd_chunks_align_to_cell_shape(self) -> None: +class ZarrColorImageWriteTestCase(unittest.TestCase): + def test_color_image_emits_recursive_sub_archives(self) -> None: import os import tempfile - from lsst.images import Box, Image, MaskPlane, MaskSchema - from lsst.images.cells import CellCoadd + import numpy as np + import zarr + + from lsst.images import Box, ColorImage, Image from lsst.images.zarr import write + from lsst.images.zarr._common import LSST_NS, OME_NS from lsst.images.zarr._model import ZarrDocument - # Construct a minimal CellCoadd; the constructor signature will - # depend on the public API. Adjust to whatever this codebase - # exposes — the assertion below depends only on chunk shape. - coadd = _make_minimal_cell_coadd(cell_shape=(256, 256), shape=(512, 512)) + red = Image(np.full((4, 5), 1, dtype=np.uint8), bbox=Box.factory[10:14, 20:25]) + green = Image(np.full((4, 5), 2, dtype=np.uint8), bbox=red.bbox) + blue = Image(np.full((4, 5), 3, dtype=np.uint8), bbox=red.bbox) + color = ColorImage(red=red, green=green, blue=blue) with tempfile.TemporaryDirectory() as tmp: - target = os.path.join(tmp, "coadd.zarr") - write(coadd, target) + target = os.path.join(tmp, "out.zarr") + write(color, target) with zarr.storage.LocalStore(target, read_only=True) as store: doc = ZarrDocument.from_zarr(store) - self.assertEqual(tuple(doc.root.arrays["0"].chunks), (256, 256)) + # Root: ColorImage, no ome.multiscales (axes_for_archive_class + # returns () for ColorImage). + self.assertEqual( + doc.root.attributes.lsst["archive_class"], "ColorImage" + ) + self.assertNotIn("multiscales", doc.root.attributes.ome) + # Each channel sub-archive has its own image array... + for channel in ("red", "green", "blue"): + sub = doc.root.groups[channel] + self.assertIn("image", sub.arrays) + self.assertEqual(sub.arrays["image"].shape, (4, 5)) + # ...and is decorated as a valid Image sub-archive. + self.assertEqual(sub.attributes.lsst["archive_class"], "Image") + self.assertIn("multiscales", sub.attributes.ome) + self.assertEqual( + sub.attributes.ome["multiscales"][0]["datasets"][0]["path"], + "image", + ) +``` +- [ ] **Step 5: Run all tests** -def _make_minimal_cell_coadd(*, cell_shape, shape): # noqa: ANN001, ANN201 - """Construct a minimal CellCoadd for layout testing. +Run: `pytest tests/test_zarr_layout.py tests/test_zarr_output_archive.py -v` +Expected: PASS — decoration is applied recursively, ColorImage's three channels are valid Image sub-archives. - The constructor in this codebase may require a Projection, mask, - variance, etc. Build the smallest valid instance possible — this - helper is a placeholder; the implementer should replace it with the - actual minimal construction once they consult the CellCoadd ctor. - """ - raise unittest.SkipTest("Implementer: build a minimal CellCoadd per the local ctor.") +- [ ] **Step 6: Commit** + +```bash +git add python/lsst/images/zarr/_layout.py python/lsst/images/zarr/_output_archive.py tests/test_zarr_layout.py tests/test_zarr_output_archive.py +git commit -m "feat: decorate sub-archives with lsst.archive_class and ome.multiscales" ``` -(The `SkipTest` placeholder is intentional — the engineer may need to read the `CellCoadd` constructor in `python/lsst/images/cells/_coadd.py` to assemble a minimal valid coadd. Replace the placeholder with the real construction code; the chunk-shape assertion stands as-is.) +### Task 4.2: ColorImage round-trip -- [ ] **Step 6: Run the tests** +**Files:** +- Modify: `tests/test_zarr_round_trip.py` -Run: `pytest tests/test_zarr_layout.py tests/test_zarr_output_archive.py -v` -Expected: layout tests PASS; the CellCoadd output-archive test runs (or skips with the placeholder). +The decoration in 4.1 plus the existing `read()` deserializer should round-trip ColorImage with no further code changes. This task asserts that. + +- [ ] **Step 1: Write the test** + +Append to `tests/test_zarr_round_trip.py`: + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrColorImageRoundTripTestCase(unittest.TestCase): + def test_color_image_round_trip(self) -> None: + from lsst.images import Box, ColorImage, Image + + red = Image(np.full((4, 5), 1, dtype=np.uint8), bbox=Box.factory[10:14, 20:25]) + green = Image(np.full((4, 5), 2, dtype=np.uint8), bbox=red.bbox) + blue = Image(np.full((4, 5), 3, dtype=np.uint8), bbox=red.bbox) + original = ColorImage(red=red, green=green, blue=blue) + + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.recovered + np.testing.assert_array_equal(recovered.red.array, original.red.array) + np.testing.assert_array_equal(recovered.green.array, original.green.array) + np.testing.assert_array_equal(recovered.blue.array, original.blue.array) +``` + +- [ ] **Step 2: Run the test** -- [ ] **Step 7: Commit** +Run: `pytest tests/test_zarr_round_trip.py::ZarrColorImageRoundTripTestCase -v` +Expected: PASS. If this fails because the ColorImage deserializer needs sub-archive `tree` documents that we are not staging (since we use `serialize_direct`, not `serialize_pointer`), the failure tells you exactly what's missing — adapt the `decorate_sub_archives` pass to also write a per-sub-archive `tree` document if the ColorImage deserializer demands it. + +- [ ] **Step 3: Commit** ```bash -git add python/lsst/images/zarr/_layout.py python/lsst/images/zarr/_output_archive.py tests/test_zarr_layout.py tests/test_zarr_output_archive.py -git commit -m "feat: align CellCoadd default chunks to cell_shape" +git add tests/test_zarr_round_trip.py +git commit -m "test: round-trip ColorImage through the zarr backend" ``` -### Task 4.5: 4D per-cell PSF stacking for `CellCoadd` +### Task 4.3: CellCoadd PSF — single-cell chunks for the 4-D array **Files:** -- Modify: `python/lsst/images/zarr/_layout.py` (add `fixup_cell_coadd_psf`) -- Modify: `python/lsst/images/zarr/_output_archive.py` (call the fixup when `archive_class == "CellCoadd"`) +- Modify: `python/lsst/images/zarr/_layout.py` (extend `chunks_for` to accept `axis_hint`) +- Modify: `python/lsst/images/zarr/_output_archive.py` (special-case `name="psf"` to chunk per-cell) +- Modify: `tests/test_zarr_layout.py` - Modify: `tests/test_zarr_output_archive.py` -The mechanics mirror `fixup_color_image`: - -1. Walk `/lsst/psf/per_cell/cell__/0` arrays in the IR. -2. Stack into a single `(Cy, Cx, Py, Px)` array at `/lsst/psf/per_cell/0` with chunks `(1, 1, Py, Px)` so per-cell reads stay one-chunk. -3. Remove the per-cell sub-groups. -4. Rewrite source URLs `zarr:/lsst/psf/per_cell/cell__/0` → `zarr:/lsst/psf/per_cell/0?cell=,`. +CellCoadd's PSF is a 4-D array `(Cy, Cx, Py, Px)` where the leading two axes index cells and the trailing two are the per-cell PSF image. Single-cell reads should be one chunk, so the default chunk shape is `(1, 1, Py, Px)`. -Per the spec: "Always nested under `lsst/psf/per_cell` of a parent CellCoadd". The fixup only runs when the IR has at least one such per-cell group; CellCoadd objects without per-cell PSFs leave the IR alone. +`add_array` recognises `name == "psf"` (or names ending in `/psf`) and applies the single-cell-chunked default if the user has not overridden. - [ ] **Step 1: Write the failing test** @@ -3636,132 +4137,153 @@ Append to `tests/test_zarr_output_archive.py`: ```python @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") -class ZarrCellCoaddPsfStackingTestCase(unittest.TestCase): - def test_per_cell_psf_stacks_into_4d_array(self) -> None: - # Hand-build the IR shape that CellCoadd.serialize would produce - # so this test is independent of the (still-evolving) CellCoadd - # constructor. +class ZarrPsfChunkingTestCase(unittest.TestCase): + def test_psf_array_uses_single_cell_chunks(self) -> None: import numpy as np - from lsst.images.zarr._layout import fixup_cell_coadd_psf - from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup - - doc = ZarrDocument(root=ZarrGroup()) - psf = doc.root.ensure_group("/lsst/psf/per_cell") - for i in range(2): - for j in range(3): - cell = psf.groups.setdefault(f"cell_{i}_{j}", ZarrGroup()) - cell.arrays["0"] = ZarrArray( - data=np.full((4, 5), i * 10 + j, dtype=np.float32) - ) - # Stub a JSON tree referencing one cell so we can spot-check the - # rewrite. - ref = b'"source": "zarr:/lsst/psf/per_cell/cell_1_2/0"' - doc.root.ensure_group("/lsst").arrays["tree"] = ZarrArray( - data=np.frombuffer(b"{" + ref + b"}", dtype=np.uint8) - ) + psf = np.zeros((2, 3, 21, 21), dtype=np.float32) + archive = ZarrOutputArchive(archive_class="CellCoadd") + ref = archive.add_array(psf, name="psf") + self.assertEqual(ref.source, "zarr:/psf") + node = archive.document.root.get("/psf") + # Single-cell chunks: leading axes are 1; spatial axes match shape. + self.assertEqual(tuple(node.chunks), (1, 1, 21, 21)) - fixup_cell_coadd_psf(doc) + def test_psf_user_override_wins(self) -> None: + import numpy as np - stacked = doc.root.get("/lsst/psf/per_cell/0") - self.assertEqual(stacked.shape, (2, 3, 4, 5)) - # Each per-cell group is gone. - per_cell = doc.root.get("/lsst/psf/per_cell") - self.assertEqual(per_cell.groups, {}) - # JSON rewrite happened. - rewritten = bytes(doc.root.get("/lsst/tree").data).decode("utf-8") - self.assertIn('"zarr:/lsst/psf/per_cell/0?cell=1,2"', rewritten) + psf = np.zeros((2, 3, 21, 21), dtype=np.float32) + archive = ZarrOutputArchive( + archive_class="CellCoadd", + chunks={"psf": (2, 3, 21, 21)}, + ) + archive.add_array(psf, name="psf") + node = archive.document.root.get("/psf") + self.assertEqual(tuple(node.chunks), (2, 3, 21, 21)) ``` - [ ] **Step 2: Run to verify failure** -Run: `pytest tests/test_zarr_output_archive.py::ZarrCellCoaddPsfStackingTestCase -v` -Expected: FAIL — `fixup_cell_coadd_psf` does not exist. +Run: `pytest tests/test_zarr_output_archive.py::ZarrPsfChunkingTestCase -v` +Expected: FAIL — current `add_array` defaults to `min(1024, dim)` per axis, giving `(2, 3, 21, 21)` (already small enough) but for larger Cy/Cx the leading axes would not be 1. -- [ ] **Step 3: Implement the fixup** +- [ ] **Step 3: Implement the special case** -Append to `python/lsst/images/zarr/_layout.py` (and add to `__all__`): +In `python/lsst/images/zarr/_output_archive.py`, edit `add_array` to handle the PSF name. After computing `parent_path` and `leaf` and before staging the `ZarrArray`, add: ```python -def fixup_cell_coadd_psf(document: "ZarrDocument") -> None: - """Stack per-cell PSF sub-archives into a single 4D OME image.""" - from ._model import ZarrArray, ZarrDocument + # Default chunks for a CellCoadd-style 4-D PSF: one cell per chunk. + if ( + chunks is None + and leaf == "psf" + and array.ndim == 4 + and parent_path in ("/", "") + ): + chunks = (1, 1, array.shape[2], array.shape[3]) +``` - if not isinstance(document, ZarrDocument): - raise TypeError(type(document).__name__) - try: - per_cell = document.root.get("/lsst/psf/per_cell") - except KeyError: - return - if not isinstance(per_cell, ZarrGroup := type(per_cell)) or not per_cell.groups: - return - # Parse and sort cell coordinates. - coords: dict[tuple[int, int], np.ndarray] = {} - for name, sub in per_cell.groups.items(): - if not name.startswith("cell_"): - continue - try: - i_str, j_str = name[len("cell_") :].split("_", 1) - i, j = int(i_str), int(j_str) - except ValueError: - continue - if "0" not in sub.arrays: - continue - arr = sub.arrays["0"].data - if not isinstance(arr, np.ndarray): - raise TypeError("CellCoadd PSF fixup requires staged numpy arrays.") - coords[(i, j)] = arr - if not coords: - return - cy = max(k[0] for k in coords) + 1 - cx = max(k[1] for k in coords) + 1 - sample = next(iter(coords.values())) - py, px = sample.shape - stacked = np.zeros((cy, cx, py, px), dtype=sample.dtype) - for (i, j), arr in coords.items(): - stacked[i, j] = arr - per_cell.arrays["0"] = ZarrArray(data=stacked, chunks=(1, 1, py, px)) - per_cell.groups.clear() - # Rewrite the JSON tree URLs. - lsst = document.root.groups.get("lsst") - if lsst is None or "tree" not in lsst.arrays: - return - json_bytes = bytes(lsst.arrays["tree"].data) - for (i, j) in coords: - old = f"zarr:/lsst/psf/per_cell/cell_{i}_{j}/0".encode() - new = f"zarr:/lsst/psf/per_cell/0?cell={i},{j}".encode() - json_bytes = json_bytes.replace(old, new) - lsst.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) +(Place this after the existing `chunks` resolution chain so user overrides still win.) + +- [ ] **Step 4: Run the tests** + +Run: `pytest tests/test_zarr_output_archive.py::ZarrPsfChunkingTestCase -v` +Expected: PASS — both tests. + +- [ ] **Step 5: Commit** + +```bash +git add python/lsst/images/zarr/_output_archive.py tests/test_zarr_output_archive.py +git commit -m "feat: default CellCoadd PSF to single-cell chunks (1, 1, Py, Px)" ``` -- [ ] **Step 4: Call the fixup from `add_tree`** +### Task 4.4: CellCoadd output-archive layout test + +**Files:** +- Modify: `tests/test_zarr_output_archive.py` + +Pin the on-disk layout for a `CellCoadd`: image / variance / mask siblings with cell-aligned chunks, 4-D PSF with single-cell chunks, `lsst.cell_grid` on the root. + +The test's CellCoadd construction is implementer-supplied — the existing `python/lsst/images/cells/_coadd.py` constructor takes a particular set of arguments. The implementer must read it and assemble a minimal valid coadd; the on-disk assertions below stand regardless of constructor specifics. + +- [ ] **Step 1: Write the test** -In `python/lsst/images/zarr/_output_archive.py`, in the `add_tree` body, before the multiscales block: +Append to `tests/test_zarr_output_archive.py`: ```python - if archive_class == "CellCoadd": - from ._layout import fixup_cell_coadd_psf +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrCellCoaddWriteTestCase(unittest.TestCase): + def test_cell_coadd_layout(self) -> None: + import os + import tempfile + + import zarr + + from lsst.images.zarr import write + from lsst.images.zarr._model import ZarrDocument + + coadd = _make_minimal_cell_coadd( + cell_shape=(256, 256), + shape=(512, 512), + n_cells=(2, 2), + psf_shape=(21, 21), + ) + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "coadd.zarr") + write(coadd, target) + with zarr.storage.LocalStore(target, read_only=True) as store: + doc = ZarrDocument.from_zarr(store) + # Root archive class. + self.assertEqual( + doc.root.attributes.lsst["archive_class"], "CellCoadd" + ) + # cell_grid metadata is on the root attrs. + self.assertIn("cell_grid", doc.root.attributes.lsst) + cg = doc.root.attributes.lsst["cell_grid"] + self.assertEqual(tuple(cg["cell_shape"]), (256, 256)) + # image / variance / mask siblings, cell-aligned chunks. + self.assertEqual(tuple(doc.root.arrays["image"].chunks), (256, 256)) + self.assertEqual(tuple(doc.root.arrays["variance"].chunks), (256, 256)) + self.assertEqual(tuple(doc.root.arrays["mask"].chunks), (256, 256)) + # 4-D psf with single-cell chunks. + psf = doc.root.arrays["psf"] + self.assertEqual(psf.shape, (2, 2, 21, 21)) + self.assertEqual(tuple(psf.chunks), (1, 1, 21, 21)) + + +def _make_minimal_cell_coadd(*, cell_shape, shape, n_cells, psf_shape): # noqa: ANN001, ANN201 + """Construct a minimal CellCoadd for layout testing. - fixup_cell_coadd_psf(self.document) + Implementer: read ``python/lsst/images/cells/_coadd.py`` and + assemble the smallest valid CellCoadd whose ``cell_shape``, + overall image shape, cell-grid dimensions, and per-cell PSF + shape match the requested values. The test only asserts on the + on-disk layout the write helper produces. + """ + raise unittest.SkipTest( + "Implementer: build a minimal CellCoadd per the local ctor." + ) ``` -- [ ] **Step 5: Run the tests** +- [ ] **Step 2: Run the test** -Run: `pytest tests/test_zarr_output_archive.py::ZarrCellCoaddPsfStackingTestCase -v` -Expected: PASS — stacked shape is `(2, 3, 4, 5)` and the JSON rewrite is present. +Run: `pytest tests/test_zarr_output_archive.py::ZarrCellCoaddWriteTestCase -v` +Expected: After the implementer replaces `_make_minimal_cell_coadd`, PASS. SKIP otherwise — the placeholder must be replaced before merging this phase. -- [ ] **Step 6: Commit** +- [ ] **Step 3: Commit** ```bash -git add python/lsst/images/zarr/_layout.py python/lsst/images/zarr/_output_archive.py tests/test_zarr_output_archive.py -git commit -m "feat: stack CellCoadd per-cell PSFs into a 4D (Cy, Cx, Py, Px) image" +git add tests/test_zarr_output_archive.py +git commit -m "test: pin on-disk zarr layout for CellCoadd" ``` -### Task 4.6: CellCoadd round-trip +### Task 4.5: CellCoadd round-trip **Files:** - Modify: `tests/test_zarr_round_trip.py` +The same minimal CellCoadd factory used in Task 4.4 round-trips through `RoundtripZarr`. Spot-checks the image and one per-cell PSF. + - [ ] **Step 1: Write the test** Append to `tests/test_zarr_round_trip.py`: @@ -3773,7 +4295,9 @@ class ZarrCellCoaddRoundTripTestCase(unittest.TestCase): original = _make_minimal_cell_coadd_with_psf() # implementer-supplied with RoundtripZarr(self, original) as roundtrip: recovered = roundtrip.recovered - np.testing.assert_array_equal(recovered.image.array, original.image.array) + np.testing.assert_array_equal( + recovered.image.array, original.image.array + ) # Spot-check one per-cell PSF if the API exposes them. if hasattr(original, "psf") and hasattr(original.psf, "per_cell"): np.testing.assert_array_equal( @@ -3782,15 +4306,20 @@ class ZarrCellCoaddRoundTripTestCase(unittest.TestCase): def _make_minimal_cell_coadd_with_psf(): # noqa: ANN201 - raise unittest.SkipTest("Implementer: assemble a minimal CellCoadd with a per-cell PSF.") -``` + """Implementer: assemble a minimal CellCoadd with a 4-D per-cell PSF. -(As in Task 4.4, the constructor is implementer-supplied — the test asserts only round-trip correctness.) + Reuse `_make_minimal_cell_coadd` from `test_zarr_output_archive.py` + if the same factory works here, or build one in this file. + """ + raise unittest.SkipTest( + "Implementer: assemble a minimal CellCoadd with a per-cell PSF." + ) +``` - [ ] **Step 2: Run the test** Run: `pytest tests/test_zarr_round_trip.py::ZarrCellCoaddRoundTripTestCase -v` -Expected: PASS (or SKIP if the helper is still a placeholder; replace before merging the phase). +Expected: After the implementer replaces the factory, PASS. SKIP otherwise; replace before merging this phase. - [ ] **Step 3: Commit** @@ -3801,22 +4330,22 @@ git commit -m "test: round-trip CellCoadd through the zarr backend" --- -**End of Phase 4.** Six tasks. ColorImage stacks into a single OME `(c, y, x)` image; CellCoadd uses cell-aligned chunks and stacks per-cell PSFs into a 4D `(Cy, Cx, Py, Px)` image. Both types round-trip and read sliced sources via the Phase 3 lazy-handle path. Phase 5 covers cross-format round-trips (FITS↔Zarr opaque-metadata preservation) and the optional external-reader sanity tests. +**End of Phase 4.** Five tasks. ColorImage writes its three channels as recursive sub-archives (each a valid Image sub-archive with its own OME multiscales), CellCoadd writes flat siblings with cell-aligned chunks plus a 4-D PSF with single-cell chunks. Both types round-trip without any byte duplication or fixup pass. Phase 5 covers FITS↔Zarr opaque-metadata round-trips, xarray interop assertions, and the optional external-reader sanity tests. -## Phase 5 — Cross-format round-trips and external-reader sanity +## Phase 5 — Cross-format round-trips, xarray interop, external readers -This phase makes the zarr backend a peer of FITS / NDF for round-trip preservation: an object read from FITS carries its primary-HDU header in `_opaque_metadata`, and writing that object to zarr must preserve those cards so a later round-trip back to FITS reproduces the original headers byte-for-byte. +This phase makes the zarr backend a peer of FITS / NDF for round-trip preservation: an object read from FITS carries its primary-HDU header in `_opaque_metadata`, and writing that object to zarr preserves those cards so a later round-trip back to FITS reproduces the original headers byte-for-byte. -The phase also adds two **optional** external-reader tests that run only when the corresponding tools are installed: an `ngff-validator` compliance check and an `ome-zarr-py` reader sanity check. Both are skipped silently in environments where the tooling isn't available. +It also confirms the **xarray interop contract**: `xr.open_zarr(path)` returns a `Dataset` with `image` / `variance` / `mask` data variables sharing the `(y, x)` dimensions and CF flag attrs surviving on the mask. Two optional external-reader checks (`ngff-validator`, `ome-zarr-py`) round out the phase; both skip silently when their dependencies are absent. ### Task 5.1: Persist `FitsOpaqueMetadata` on write to zarr **Files:** -- Modify: `python/lsst/images/zarr/_output_archive.py` (extend `write` to accept and stash opaque metadata) - Modify: `python/lsst/images/zarr/_layout.py` (add `serialize_fits_opaque_metadata`) -- Test: `tests/test_zarr_output_archive.py` +- Modify: `python/lsst/images/zarr/_output_archive.py` (extend `write` to call it) +- Modify: `tests/test_zarr_output_archive.py` -The opaque metadata lives at `/lsst/opaque_metadata/fits/primary` as a 1-D `uint8` zarr array containing UTF-8 JSON. The JSON encodes the astropy `Header` via `header.tostring(sep="\n", endcard=False, padding=False)` (lossless and human-readable). The root attribute `lsst.opaque_metadata_format = "fits"` flags its presence. +The opaque metadata lives at `/lsst/opaque_metadata/fits/primary` as a 1-D `uint8` array containing UTF-8 JSON. The JSON encodes the astropy `Header` as a flat `{keyword: value}` dict. The root attribute `lsst.opaque_metadata_format = "fits"` flags its presence. - [ ] **Step 1: Write the failing test** @@ -3826,6 +4355,7 @@ Append to `tests/test_zarr_output_archive.py`: @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class ZarrOpaqueMetadataWriteTestCase(unittest.TestCase): def test_fits_opaque_metadata_persists(self) -> None: + import json as _json import os import tempfile @@ -3838,7 +4368,10 @@ class ZarrOpaqueMetadataWriteTestCase(unittest.TestCase): from lsst.images.zarr import write from lsst.images.zarr._model import ZarrDocument - image = Image(np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25]) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) header = astropy.io.fits.Header() header["ORIGIN"] = "RUBIN" header["EXPTIME"] = 30.0 @@ -3852,23 +4385,20 @@ class ZarrOpaqueMetadataWriteTestCase(unittest.TestCase): with zarr.storage.LocalStore(target, read_only=True) as store: doc = ZarrDocument.from_zarr(store) self.assertEqual( - doc.root.attributes.lsst.get("opaque_metadata_format"), "fits" + doc.root.attributes.lsst.get("opaque_metadata_format"), + "fits", ) opaque_node = doc.root.get("/lsst/opaque_metadata/fits/primary") json_bytes = bytes(opaque_node.read()) - # Round-trip through astropy. - import json as _json - cards = _json.loads(json_bytes) - self.assertIn("ORIGIN", cards) self.assertEqual(cards["ORIGIN"], "RUBIN") self.assertEqual(cards["EXPTIME"], 30.0) ``` -- [ ] **Step 2: Run the test to verify it fails** +- [ ] **Step 2: Run to verify failure** Run: `pytest tests/test_zarr_output_archive.py::ZarrOpaqueMetadataWriteTestCase -v` -Expected: FAIL — `/lsst/opaque_metadata/fits/primary` not in store. +Expected: FAIL — `/lsst/opaque_metadata/fits/primary` is not in the store. - [ ] **Step 3: Implement opaque-metadata serialization** @@ -3881,31 +4411,64 @@ def serialize_fits_opaque_metadata(document: "ZarrDocument", opaque: Any) -> Non Stores the primary-HDU header as a JSON-encoded ``uint8`` array at ``/lsst/opaque_metadata/fits/primary`` and sets the ``lsst.opaque_metadata_format`` attribute on the root group. + No-op if the metadata is empty or missing a primary header. """ import json as _json + import numpy as np + from ..fits._common import ExtensionKey from ._model import ZarrArray primary = opaque.headers.get(ExtensionKey()) if primary is None or len(primary) == 0: return - # Encode as a flat dict of card key → value. Multi-record cards - # (CONTINUE, HISTORY, COMMENT) are preserved by astropy on round-trip - # via the same card-stringification used by the NDF backend. cards = {card.keyword: card.value for card in primary.cards if card.keyword} json_bytes = _json.dumps(cards).encode("utf-8") parent = document.root.ensure_group("/lsst/opaque_metadata/fits") - parent.arrays["primary"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) + parent.arrays["primary"] = ZarrArray( + data=np.frombuffer(json_bytes, dtype=np.uint8) + ) document.root.attributes.lsst["opaque_metadata_format"] = "fits" ``` -In `python/lsst/images/zarr/_output_archive.py`, extend `write` to call this: +In `python/lsst/images/zarr/_output_archive.py`, extend `write` to call this *after* `add_tree` returns and *before* the IR is materialized: ```python -def write(...): - ... - archive = ZarrOutputArchive(...) +def write( + obj: Any, + path: Any, + *, + chunks=None, + shards=None, + compression=None, + metadata=None, + butler_info=None, +) -> ArchiveTree: + from ._store import open_store_for_write + + archive_class = type(obj).__name__ + archive_default_name = getattr(obj, "_archive_default_name", None) + archive_metadata: dict[str, Any] = {} + if (cell_shape := getattr(obj, "cell_shape", None)) is not None: + archive_metadata["cell_shape"] = tuple(cell_shape) + if (cell_grid := getattr(obj, "cell_grid", None)) is not None: + archive_metadata["cell_grid"] = { + "bbox": list(cell_grid.bbox) if hasattr(cell_grid, "bbox") else None, + "cell_shape": list(cell_grid.cell_shape) + if hasattr(cell_grid, "cell_shape") + else None, + } + if (mask_schema := getattr(obj, "mask_schema", None)) is not None: + archive_metadata["mask_schema"] = mask_schema + + archive = ZarrOutputArchive( + chunks=chunks, + shards=shards, + compression=compression, + archive_class=archive_class, + archive_metadata=archive_metadata, + ) if archive_default_name is not None: tree = archive.serialize_direct(archive_default_name, obj.serialize) else: @@ -3914,8 +4477,9 @@ def write(...): tree.metadata.update(metadata) if butler_info is not None: tree.butler_info = butler_info - archive.add_tree(tree, archive_class=type(obj).__name__) - # Persist opaque metadata if the object carries any. + archive.add_tree(tree) + # Stage opaque metadata after add_tree so the namespace attribute + # writes happen in the right order. opaque = getattr(obj, "_opaque_metadata", None) if opaque is not None: from ._layout import serialize_fits_opaque_metadata @@ -3923,7 +4487,7 @@ def write(...): try: serialize_fits_opaque_metadata(archive.document, opaque) except ImportError: - pass # opaque type isn't a FITS one; ignore + pass # opaque is not a FITS one; ignore with open_store_for_write(path) as store: archive.document.to_zarr(store) return tree @@ -3932,7 +4496,7 @@ def write(...): - [ ] **Step 4: Run the tests** Run: `pytest tests/test_zarr_output_archive.py::ZarrOpaqueMetadataWriteTestCase -v` -Expected: PASS — opaque metadata is staged at the spec path. +Expected: PASS — opaque metadata is staged at the spec path with the correct format flag. - [ ] **Step 5: Commit** @@ -3944,11 +4508,11 @@ git commit -m "feat: persist FitsOpaqueMetadata at /lsst/opaque_metadata/fits/pr ### Task 5.2: Restore `FitsOpaqueMetadata` on read from zarr **Files:** -- Modify: `python/lsst/images/zarr/_input_archive.py` (read opaque metadata in `__init__`; expose via `get_opaque_metadata`) - Modify: `python/lsst/images/zarr/_layout.py` (add `deserialize_fits_opaque_metadata`) +- Modify: `python/lsst/images/zarr/_input_archive.py` (read it in `__init__`; expose via `get_opaque_metadata`; attach in `read`) - Modify: `tests/test_zarr_input_archive.py` -`get_opaque_metadata()` returns a `FitsOpaqueMetadata` reconstructed from `/lsst/opaque_metadata/fits/primary`. The `read()` helper attaches it to the deserialized object as `obj._opaque_metadata` (matching the FITS / NDF read patterns). +`get_opaque_metadata()` returns a `FitsOpaqueMetadata` reconstructed from `/lsst/opaque_metadata/fits/primary`. The `read()` helper attaches it to the deserialized object as `obj._opaque_metadata` (matching FITS / NDF read patterns). - [ ] **Step 1: Write the failing test** @@ -3965,7 +4529,8 @@ class ZarrOpaqueMetadataReadTestCase(unittest.TestCase): from lsst.images.zarr import read, write image = Image( - np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], ) header = astropy.io.fits.Header() header["ORIGIN"] = "RUBIN" @@ -3985,7 +4550,7 @@ class ZarrOpaqueMetadataReadTestCase(unittest.TestCase): self.assertEqual(recovered_header["EXPTIME"], 30.0) ``` -- [ ] **Step 2: Run the test to verify it fails** +- [ ] **Step 2: Run to verify failure** Run: `pytest tests/test_zarr_input_archive.py::ZarrOpaqueMetadataReadTestCase -v` Expected: FAIL — `recovered._opaque_metadata` is `None` or unset. @@ -3996,9 +4561,9 @@ Append to `python/lsst/images/zarr/_layout.py`: ```python def deserialize_fits_opaque_metadata(document: "ZarrDocument") -> Any | None: - """Return a `FitsOpaqueMetadata` reconstructed from the IR, or None. + """Reconstruct a `FitsOpaqueMetadata` from the IR, or return None. - Returns ``None`` when the archive doesn't have a FITS opaque + Returns ``None`` when the archive does not have a FITS opaque metadata block (the common case for archives that originated as native zarr). """ @@ -4027,7 +4592,7 @@ def deserialize_fits_opaque_metadata(document: "ZarrDocument") -> Any | None: return opaque ``` -In `python/lsst/images/zarr/_input_archive.py`, store opaque metadata at construction time and expose it; attach it in `read`: +In `python/lsst/images/zarr/_input_archive.py`, store opaque metadata at construction time, expose it, and attach it in `read`: ```python def __init__(self, document: ZarrDocument) -> None: @@ -4059,7 +4624,7 @@ def read[T: Any](cls, path, **kwargs): - [ ] **Step 4: Run the tests** Run: `pytest tests/test_zarr_input_archive.py::ZarrOpaqueMetadataReadTestCase -v` -Expected: PASS — recovered header has ORIGIN and EXPTIME. +Expected: PASS — recovered header has both cards. - [ ] **Step 5: Commit** @@ -4073,7 +4638,7 @@ git commit -m "feat: restore FitsOpaqueMetadata on zarr read" **Files:** - Create: `tests/test_zarr_cross_format.py` -This test exercises the full cross-format pipeline: read a FITS file, write it to zarr, read the zarr back, write it to FITS, and verify the final FITS file's primary header matches the original FITS header card-for-card. +End-to-end: read a FITS file, write it to zarr, read the zarr back, write it to FITS. The final FITS file's primary header must match the original's card-for-card. - [ ] **Step 1: Write the test** @@ -4120,14 +4685,14 @@ class FitsZarrCrossFormatTestCase(unittest.TestCase): from lsst.images.fits import write as fits_write original = Image( - np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], ) with tempfile.TemporaryDirectory() as tmp: fits_a = os.path.join(tmp, "a.fits") zarr_path = os.path.join(tmp, "b.zarr") fits_b = os.path.join(tmp, "c.fits") - # Stamp a recognisable card on the FITS write. def update_header(header): # noqa: ANN001 header["ORIGIN"] = "RUBIN" header["EXPTIME"] = 30.0 @@ -4159,7 +4724,108 @@ git add tests/test_zarr_cross_format.py git commit -m "test: FITS↔Zarr opaque-metadata round-trip" ``` -### Task 5.4: Optional `ome-zarr-py` external-reader sanity test +### Task 5.4: xarray interop assertion + +**Files:** +- Create: `tests/test_zarr_xarray_interop.py` + +The whole point of the xarray/CF root layout is that `xr.open_zarr(path)` returns a `Dataset` with the masked-image components as data variables sharing the `(y, x)` dimensions, and the CF `flag_masks` / `flag_meanings` survive on the `mask` variable. This test pins that contract. + +Skipped if `xarray` is not installed; the implementer adds `xarray` to the test extras when this test is added. + +- [ ] **Step 1: Write the test** + +Create `tests/test_zarr_xarray_interop.py`: + +```python +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +import numpy as np + +try: + import zarr # noqa: F401 + + from lsst.images.zarr import write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + +try: + import xarray as xr # noqa: F401 + + HAVE_XARRAY = True +except ImportError: + HAVE_XARRAY = False + + +@unittest.skipUnless(HAVE_ZARR and HAVE_XARRAY, "xarray is not installed") +class XarrayInteropTestCase(unittest.TestCase): + def test_open_zarr_returns_dataset_with_masked_image_components(self) -> None: + from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema + + schema = MaskSchema( + [ + MaskPlane("BAD", "Bad pixel."), + MaskPlane("SAT", "Saturated."), + MaskPlane("CR", "Cosmic ray."), + ] + ) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + masked = MaskedImage(image, mask_schema=schema) + masked.mask.set("BAD", image.array % 2 == 0) + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "masked.zarr") + write(masked, target) + ds = xr.open_zarr(target) + # Three data variables sharing the (y, x) dims. + self.assertIn("image", ds.data_vars) + self.assertIn("variance", ds.data_vars) + self.assertIn("mask", ds.data_vars) + self.assertEqual(ds["image"].dims, ("y", "x")) + self.assertEqual(ds["mask"].dims, ("y", "x")) + self.assertEqual(ds["image"].shape, (4, 5)) + # CF flag attrs survive on the mask variable. + self.assertEqual(ds["mask"].attrs["flag_meanings"], "BAD SAT CR") + self.assertEqual(list(ds["mask"].attrs["flag_masks"]), [1, 2, 4]) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run the test** + +Run: `pytest tests/test_zarr_xarray_interop.py -v` +Expected: PASS if `xarray` is installed; SKIP otherwise. If it fails when xarray is present, inspect what xarray sees: most often it's a `_ARRAY_DIMENSIONS` typo, or `tree` / `wcs_ast` arrays leaking into the Dataset (xarray treats every zarr array in the group as a data variable — those are 1-D `uint8` arrays so they should appear as 1-D variables, harmless, but they shouldn't shadow `image` etc.). + +- [ ] **Step 3: Commit** + +```bash +git add tests/test_zarr_xarray_interop.py +git commit -m "test: xarray.open_zarr returns Dataset with image/variance/mask data variables" +``` + +### Task 5.5: Optional `ome-zarr-py` external-reader sanity test **Files:** - Create: `tests/test_zarr_external_reader.py` @@ -4215,12 +4881,12 @@ class OmeZarrReaderTestCase(unittest.TestCase): from lsst.images import Box, Image original = Image( - np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], ) with tempfile.TemporaryDirectory() as tmp: target = os.path.join(tmp, "out.zarr") write(original, target) - # ome-zarr opens the store and exposes the multiscales node. from ome_zarr.io import parse_url from ome_zarr.reader import Reader @@ -4229,7 +4895,6 @@ class OmeZarrReaderTestCase(unittest.TestCase): reader = Reader(location) nodes = list(reader()) self.assertGreaterEqual(len(nodes), 1) - # The first node should expose the (y, x) science array. data = nodes[0].data[0] # level 0 self.assertEqual(tuple(data.shape), (4, 5)) @@ -4250,12 +4915,12 @@ git add tests/test_zarr_external_reader.py git commit -m "test: ome-zarr-py can open archives written by lsst.images.zarr" ``` -### Task 5.5: Optional `ngff-validator` compliance test +### Task 5.6: Optional `ngff-validator` compliance test **Files:** - Create: `tests/test_zarr_ome_compliance.py` -`ngff-validator` (or its successor) checks an archive against the OME-NGFF schema. We invoke it via subprocess if available; the test is skipped silently otherwise. Validation runs against representative outputs of every supported archive class. +`ngff-validator` checks an archive against the OME-NGFF schema. Invoked via subprocess if available; skipped otherwise. Validates representative outputs of every supported archive class. - [ ] **Step 1: Write the test** @@ -4307,31 +4972,35 @@ class NgffComplianceTestCase(unittest.TestCase): self.assertEqual( result.returncode, 0, - f"ngff-validator failed for {target}:\nSTDOUT:\n{result.stdout}\nSTDERR:\n{result.stderr}", + f"ngff-validator failed for {target}:\n" + f"STDOUT:\n{result.stdout}\nSTDERR:\n{result.stderr}", ) def test_image_validates(self) -> None: from lsst.images import Box, Image image = Image( - np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25] + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], ) with tempfile.TemporaryDirectory() as tmp: target = os.path.join(tmp, "out.zarr") write(image, target) self._validate(target) - def test_color_image_validates(self) -> None: - from lsst.images import Box, ColorImage, Image + def test_masked_image_validates(self) -> None: + from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema - red = Image(np.full((4, 5), 1, dtype=np.uint8), bbox=Box.factory[10:14, 20:25]) - green = Image(np.full((4, 5), 2, dtype=np.uint8), bbox=red.bbox) - blue = Image(np.full((4, 5), 3, dtype=np.uint8), bbox=red.bbox) - color = ColorImage(red=red, green=green, blue=blue) + schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + masked = MaskedImage(image, mask_schema=schema) with tempfile.TemporaryDirectory() as tmp: - target = os.path.join(tmp, "color.zarr") - write(color, target) + target = os.path.join(tmp, "masked.zarr") + write(masked, target) self._validate(target) @@ -4342,7 +5011,7 @@ if __name__ == "__main__": - [ ] **Step 2: Run the test** Run: `pytest tests/test_zarr_ome_compliance.py -v` -Expected: PASS if `ngff-validator` is on PATH; SKIP otherwise. If a real-world install is available and the test fails, fix the layout (e.g. correct an axis-type misclassification) before merging. +Expected: PASS if `ngff-validator` is on PATH; SKIP otherwise. If a real install is available and validation fails, fix the layout (most likely an axis-type misclassification or a `coordinateTransformations` shape error) before merging. - [ ] **Step 3: Commit** @@ -4353,37 +5022,39 @@ git commit -m "test: ngff-validator compliance check (skipped when validator abs --- -**End of Phase 5.** Five tasks. Cross-format round-trips preserve FITS primary-header cards through Zarr; optional external-reader tests confirm spec compliance and tooling interoperability when their dependencies are installed. Phase 6 wraps up with module documentation and a changelog entry. +**End of Phase 5.** Six tasks. FITS↔Zarr round-trips preserve primary-HDU cards through Zarr; xarray interop is pinned by an `xr.open_zarr` test that asserts `Dataset` shape and CF flag attrs on the mask; optional external-reader checks confirm OME-NGFF compliance and `ome-zarr-py` interop when their dependencies are installed. Phase 6 wraps up with module documentation and a changelog entry. ## Phase 6 — Documentation, changelog, and final integration -Phase 6 wraps up the backend with the documentation and metadata that make it discoverable to package users. The reference docs live under `doc/lsst.images/` and follow the same `automodapi`-driven pattern as the other backends; the changelog uses `towncrier` fragments under `doc/changes/`. +Phase 6 wraps up the backend with the documentation that makes it discoverable. The reference docs live under `doc/lsst.images/` and follow the same `automodapi`-driven pattern as the other backends; the changelog uses `towncrier` fragments under `doc/changes/`. ### Task 6.1: Expand the module docstring **Files:** - Modify: `python/lsst/images/zarr/__init__.py` -The Phase 1 `__init__.py` carries a placeholder docstring. Replace it with a fully-documented version that covers: - -- which image types are supported -- the on-disk layout summary (top-level `/0`, `/lsst/mask/0`, `/lsst/variance/0`, `/lsst/tree`, the OME `multiscales` block) -- chunking and sharding defaults -- store dispatch (LocalStore vs ZipStore vs FsspecStore by URI) -- the lazy-subset guarantee for remote reads -- known follow-ups (multiscale pyramid, NGFF nonlinear transformations, dask/lazy read API, consolidated metadata) +The Phase 1 `__init__.py` carries a short docstring. Replace it with a full-fat version covering layout, lazy reads, FITS round-trip, and the v1 follow-ups. - [ ] **Step 1: Replace the docstring** -Edit `python/lsst/images/zarr/__init__.py` and replace the existing docstring (everything between the first `"""` and the matching `"""`): +Edit `python/lsst/images/zarr/__init__.py`. Replace the existing docstring (everything between the first triple-quote and the matching one) with: ```python -"""OME-Zarr v0.5 archive backend for `lsst.images`. +"""Zarr v3 archive backend for `lsst.images`. + +This module reads and writes Zarr v3 archives whose root layout is +xarray/CF-shaped (``image``, ``variance``, ``mask`` as siblings sharing +``(y, x)`` dimensions, CF ``flag_masks`` / ``flag_meanings`` / +``flag_descriptions`` on the mask) with OME-NGFF v0.5 multiscales +metadata as a discoverability layer pointing at the same ``image`` +array. The same bytes are visible to ``xarray``, GDAL's Zarr driver, +and OME-Zarr tooling like ``napari`` and ``ome-zarr-py``. -This module reads and writes OME-Zarr v0.5 NGFF files augmented with -namespaced ``lsst:`` extensions. Every image type that already serializes -to FITS / JSON / NDF is supported: `~lsst.images.Image`, -`~lsst.images.Mask`, `~lsst.images.MaskedImage`, +Supported types +--------------- + +Every image type that already serializes to FITS / JSON / NDF: +`~lsst.images.Image`, `~lsst.images.Mask`, `~lsst.images.MaskedImage`, `~lsst.images.VisitImage`, `~lsst.images.ColorImage`, and `lsst.images.cells.CellCoadd`, plus any object reachable through the `~lsst.images.serialization.OutputArchive` interface. @@ -4391,66 +5062,91 @@ to FITS / JSON / NDF is supported: `~lsst.images.Image`, On-disk layout -------------- -A zarr archive written by `~lsst.images.zarr.write` always has: - -- A top-level data array at ``/0`` whose OME axes (``[y, x]`` for plain - images, ``[c, y, x]`` for `ColorImage`) appear in the - ``ome.multiscales`` block of the root group's attributes. -- A ``lsst.archive_class`` attribute on the root group naming the - in-memory class so the matching deserializer can be dispatched. -- A 1-D ``uint8`` array at ``/lsst/tree`` containing the JSON - serialization of the object's archive tree. -- Heterogeneous companion sub-images at ``/lsst/mask/0``, - ``/lsst/variance/0``, etc., each of which is itself a valid OME-Zarr - group with its own multiscale metadata. -- For `ColorImage`, the three channels are stacked into a single - ``(3, Y, X)`` array at ``/0`` and an ``ome.omero.channels`` block - describes them. Per-channel references in the JSON tree use the - ``zarr:/0?c=N`` query suffix. -- For `CellCoadd`, per-cell PSFs are stacked into a single 4-D - ``(Cy, Cx, Py, Px)`` array at ``/lsst/psf/per_cell/0``; per-cell - references use the ``zarr:/lsst/psf/per_cell/0?cell=Cy,Cx`` suffix. - -External tools that consume OME-Zarr (`napari`, `neuroglancer`, -`ome-zarr-py`, `ngff-validator`) can render the science array without -LSST-specific awareness. +A `MaskedImage` archive contains: + +- ``image``, ``variance``, ``mask`` arrays at the root, shaped + ``(Y, X)`` with shared chunk sizes. +- ``tree`` — 1-D ``uint8`` zarr array containing UTF-8 JSON of the + Pydantic archive tree (the round-trip authority). +- ``wcs_ast`` — 1-D ``uint8`` zarr array containing the AST FrameSet + text (the WCS round-trip authority). + +The mask is a 2-D unsigned integer (``uint8`` for ≤8 planes, up to +``uint64`` for 64 planes; >64 raises). Each pixel's bits encode the +applicable mask planes — the same logical representation the FITS +backend uses, so FITS↔Zarr mask round-trips need no bit-repacking. + +For `ColorImage`, the three channels are written as recursive +sub-archives at ``red/``, ``green/``, ``blue/``. Each sub-archive is +itself a valid Image-shaped OME-NGFF group with its own ``image`` +array, OME multiscales metadata, and ``lsst.archive_class = "Image"``. + +For `CellCoadd`, ``image`` / ``variance`` / ``mask`` are siblings +(cell-aligned chunks driven by ``cell_shape``), and ``psf`` is a 4-D +``(Cy, Cx, Py, Px)`` array with single-cell chunks +``(1, 1, Py, Px)``. ``lsst.cell_grid = {bbox, cell_shape}`` lives on +the root attrs. + +The OME multiscales ``dataset.path`` always points at a sibling array +(``"image"`` for the standard case). No bytes are duplicated for the +OME view — the science array is the same array xarray sees. + +WCS handling +------------ + +The AST ``FrameSet`` text at ``wcs_ast`` is the round-trip authority. +For external tools (napari, neuroglancer), the layout layer also +emits an OME-NGFF v0.5 affine ``coordinateTransformations`` block +that approximates the linear part of the pixel-to-sky map. Before +emitting, residuals are sampled on an 11×11 grid; if the worst +pixel-equivalent error exceeds 1.0 pixel, the affine block is dropped +and ``lsst.wcs_simplified_dropped: true`` is recorded with the +observed maximum. Readers always reconstruct the projection from +``wcs_ast``. + +Full RFC-5 nonlinear coordinate transformations as authoritative +output is a follow-up; it is blocked on writing an AST JSON channel +that serializes a ``FrameSet`` to / from RFC-5 transformation JSON. Cloud-friendly defaults ----------------------- -- Default chunk geometry is tile-aligned: ``min(1024, dim)`` per axis - for plain images, ``cell_shape`` for `CellCoadd`. -- Sharding (zarr v3 native) is enabled by default with a tunable shard - size (4×4 chunks by default) so object counts on S3 / GCS stay small - even for multi-gigabyte images. +- Default chunk geometry is tile-aligned: ``min(1024, dim)`` per + axis for plain images, ``cell_shape`` for `CellCoadd`, single-cell + for `CellCoadd`'s 4-D PSF. +- Sharding (zarr v3 native) is enabled by default with a tunable + shard size (4×4 chunks by default) so object counts on S3 / GCS + stay manageable for multi-gigabyte images. - Subset reads via the ``slices=`` argument to `~lsst.images.serialization.InputArchive.get_array` exploit zarr's - chunk index: only the chunks intersecting the requested slice are - fetched, even from remote stores. -- Both ``DirectoryStore`` and ``ZipStore`` are supported. The store is - selected from the URI shape: ``*.zarr.zip`` → ZipStore, otherwise - directory. Remote URIs (``s3://``, ``gs://``, ``http(s)://``) go - through `lsst.resources.ResourcePath` and `fsspec`. + chunk index: only chunks intersecting the slice are fetched, even + from remote stores. +- Both ``DirectoryStore`` and ``ZipStore`` are supported. The store + is selected from the URI shape: ``*.zarr.zip`` → ZipStore, + otherwise directory. Remote URIs (``s3://``, ``gs://``, + ``http(s)://``) go through `lsst.resources.ResourcePath` and + `fsspec`. Round-trip with FITS -------------------- -When an object that originated from a FITS read (``_opaque_metadata`` -is a `~lsst.images.fits.FitsOpaqueMetadata`) is written to zarr, the -primary-HDU header is preserved at ``/lsst/opaque_metadata/fits/primary``. -Reading the zarr back attaches an equivalent ``FitsOpaqueMetadata`` to -the deserialized object so a subsequent write to FITS reproduces the -original header cards. +When an object that originated from a FITS read carries a +`~lsst.images.fits.FitsOpaqueMetadata`, the primary-HDU header is +preserved at ``/lsst/opaque_metadata/fits/primary``. Reading the +zarr back attaches an equivalent ``FitsOpaqueMetadata`` to the +deserialized object so a subsequent FITS write reproduces the +original cards. Optional install ---------------- -This backend requires `zarr >= 3.0`. Install via the ``[zarr]`` extra:: +This backend requires `zarr >= 3.0`. Install via the ``[zarr]`` +extra:: pip install lsst-images[zarr] -The top-level ``import lsst.images.zarr`` raises a clear `ImportError` -with this guidance if `zarr` is not installed. +The top-level ``import lsst.images.zarr`` raises a clear +`ImportError` with this guidance if `zarr` is not installed. Follow-ups ---------- @@ -4458,20 +5154,25 @@ Follow-ups These items are tracked separately from the initial backend release: - Lazy / dask-friendly read API (``read_lazy()``). -- Multiscale pyramid generation (level 1, 2, …) for visualization tools. -- NGFF nonlinear coordinate transformations once tooling adoption catches - up (currently the affine projection of the AST WCS is exposed via OME - ``coordinateTransformations``; the AST string remains the - authoritative round-trip source). -- ``zarr.consolidated_metadata`` extension to reduce object-list calls - on cloud stores. +- Multiscale pyramid generation (level 1, 2, …) for visualization + tools. +- NGFF RFC-5 nonlinear coordinate transformations as authoritative + output (blocked on AST JSON channel work). +- 3-D mask fallback for `>64` planes. +- ``zarr.consolidated_metadata`` extension to reduce object-list + calls on cloud stores. +- NCZarr / NetCDF interop (``_NCZARR_*`` markers + optional 1-D + coordinate variables; purely additive when adopted). +- Stacked OME view for `ColorImage` (single ``(3, Y, X)`` array + alongside the per-channel sub-archives, gated by an explicit + flag because of the byte-duplication cost). """ ``` - [ ] **Step 2: Verify the docstring is well-formed** -Run: `python -c "import lsst.images.zarr; help(lsst.images.zarr)" | head -40` -Expected: docstring renders cleanly, no Sphinx warnings (a deeper Sphinx build will run in Task 6.3). +Run: `python -c "import lsst.images.zarr; help(lsst.images.zarr)" | head -60` +Expected: docstring renders cleanly with no `:role:` typos or unclosed code blocks. A deeper Sphinx build runs in Task 6.2. - [ ] **Step 3: Commit** @@ -4493,19 +5194,19 @@ Mirrors `doc/lsst.images/ndf.rst` exactly so Sphinx renders the API in the same Create `doc/lsst.images/zarr.rst`: ```rst -OME-Zarr I/O -============ +Zarr I/O +======== -This is an OME-Zarr v0.5 NGFF serialization backend with namespaced -``lsst:`` extensions. Files written by this archive are valid OME-Zarr -images at every level where image-shaped data lives, and can be read by -external OME-Zarr tooling (napari, neuroglancer, ome-zarr-py) for the -science array. The LSST extensions add round-trip support for mask -plane semantics, AST WCS, the LSST archive tree, table layout, and -cell-grid hints. +A Zarr v3 serialization backend whose on-disk layout is xarray/CF-shaped +at the root (``image`` / ``variance`` / ``mask`` as siblings sharing +``(y, x)`` dimensions, CF ``flag_masks`` / ``flag_meanings`` on the +mask) with OME-NGFF v0.5 multiscales metadata as a discoverability +layer pointing at the same ``image`` array. The same bytes are visible +to ``xarray``, GDAL's Zarr driver, and OME-Zarr tooling like +``napari`` and ``ome-zarr-py``. Default chunking is tile-aligned (~1024×1024 for plain images, -``cell_shape`` for ``CellCoadd``), sharding is enabled by default, and +``cell_shape`` for ``CellCoadd``); sharding is enabled by default; and subset reads via ``slices=`` only fetch the chunks they need — including on remote stores accessed through ``lsst.resources.ResourcePath`` and ``fsspec``. @@ -4518,7 +5219,7 @@ on remote stores accessed through ``lsst.resources.ResourcePath`` and - [ ] **Step 2: Add the page to the toctree** -Edit `doc/lsst.images/index.rst` and insert `zarr.rst` immediately after `ndf.rst` (preserve the existing alphabetical ordering near that block; `zarr` lands at the end since it sorts last): +In `doc/lsst.images/index.rst`, find the line containing `ndf.rst` and add `zarr.rst` after it (preserving alphabetical order): ```rst fits.rst @@ -4530,28 +5231,28 @@ Edit `doc/lsst.images/index.rst` and insert `zarr.rst` immediately after `ndf.rs - [ ] **Step 3: Verify the docs build** Run: `cd doc && sphinx-build -W -b html . _build/html` (only if a Sphinx environment is set up locally; otherwise skip and rely on CI). -Expected: clean build, no warnings about undefined references. +Expected: clean build with no warnings about undefined references. - [ ] **Step 4: Commit** ```bash git add doc/lsst.images/zarr.rst doc/lsst.images/index.rst -git commit -m "docs: add OME-Zarr backend reference page" +git commit -m "docs: add Zarr backend reference page" ``` ### Task 6.3: Add the towncrier changelog fragment **Files:** -- Create: `doc/changes/DM-NNNNN.feature.md` (use the actual ticket number; the placeholder below is the design ticket for this feature) +- Create: `doc/changes/DM-XXXXX.feature.md` (replace `XXXXX` with the assigned Jira ticket number) -The package uses towncrier — each user-visible change lands as a single Markdown fragment under `doc/changes/`. Pick the **bugfix / feature / api / removal / perf / misc** category that fits; for this work it's **feature**. +Each user-visible change lands as a single Markdown fragment under `doc/changes/`. For this work it's a **feature**. - [ ] **Step 1: Create the fragment** -Create `doc/changes/DM-XXXXX.feature.md` (replace `XXXXX` with the assigned Jira ticket number — confirm with the engineer before merging): +Create `doc/changes/DM-XXXXX.feature.md` (replace `XXXXX` with the actual Jira ticket number): ```markdown -Added a new `lsst.images.zarr` archive backend that reads and writes OME-Zarr v0.5 NGFF files with namespaced `lsst:` extensions. Supports every image type the FITS / JSON / NDF backends support (`Image`, `Mask`, `MaskedImage`, `VisitImage`, `ColorImage`, `CellCoadd`). Cloud-friendly defaults (tile-aligned chunks, zarr v3 sharding, fsspec-backed remote stores) and subset reads that only fetch the chunks they need. Install via the new `[zarr]` extra (`pip install lsst-images[zarr]`). +Added a new `lsst.images.zarr` archive backend that reads and writes Zarr v3 archives. The on-disk layout is xarray/CF-shaped at the root (`image`, `variance`, `mask` as siblings sharing `(y, x)` dimensions, CF `flag_masks`/`flag_meanings` on the mask) with OME-NGFF v0.5 multiscales metadata layered on top — the same bytes are visible to xarray, GDAL, and OME-Zarr tooling like `napari` and `ome-zarr-py`. Supports every image type the FITS / JSON / NDF backends support (`Image`, `Mask`, `MaskedImage`, `VisitImage`, `ColorImage`, `CellCoadd`). Cloud-friendly defaults (tile-aligned chunks, zarr v3 sharding, fsspec-backed remote stores) and subset reads that only fetch the chunks they need. Install via the new `[zarr]` extra (`pip install lsst-images[zarr]`). ``` - [ ] **Step 2: Commit** @@ -4563,13 +5264,12 @@ git commit -m "docs: changelog entry for lsst.images.zarr backend" ### Task 6.4: Run the full test suite and finalize -**Files:** -- (No code changes — verification step.) +**Files:** none (verification step). - [ ] **Step 1: Run the full zarr test set** Run: `pytest tests/test_zarr_*.py -v` -Expected: all tests pass; external-reader and validator tests pass or skip cleanly depending on what's installed. +Expected: all tests pass; external-reader and validator tests pass or skip cleanly depending on what's installed; CellCoadd tests skip cleanly until the implementer-supplied factories are filled in. - [ ] **Step 2: Run the full package test suite to catch regressions** @@ -4592,11 +5292,11 @@ Expected: no findings. git status # should be clean ``` -If lint/mypy required fixes, commit them with a focused message such as `chore: type-check and lint cleanup for lsst.images.zarr`. +If lint / mypy required fixes, commit them with a focused message such as `chore: type-check and lint cleanup for lsst.images.zarr`. --- -**End of Phase 6.** Documentation, changelog, and final verification complete. The backend is ready for review and merge. +**End of Phase 6.** Documentation and final verification complete. The backend is ready for review and merge. --- @@ -4606,35 +5306,51 @@ If lint/mypy required fixes, commit them with a focused message such as `chore: | Spec section | Task(s) | |---|---| -| §1 Goals / scope | All phases collectively | -| §2 Module layout | 1.1 (skeleton), 2.1 (`_store`), 2.2 (`_layout`), 1.3-1.5 (`_model`), 2.3-2.5 (`_output_archive`), 3.1-3.5 (`_input_archive`) | -| §3 Top-level group attributes | 2.5 (`add_tree`), 4.2 (ColorImage `omero`), 4.5 (CellCoadd cell-grid hooks via fixup), 5.1 (`opaque_metadata_format`) | -| §3 Axis choice per archive class | 2.2 (`axes_for_archive_class`), 4.2 (ColorImage), 4.4-4.5 (CellCoadd) | -| §3 Mask / table / frame-set layout | 2.4 (output side), 3.4 (input side) | -| §3 Recursive composition | 2.3 (`serialize_pointer` writes nested OME groups), 3.3 (read-side cache) | -| §3 Chunking / sharding defaults | 2.2 (`chunks_for`), 4.4 (CellCoadd cell alignment) | -| §4 Round-trip rules / opaque metadata | 5.1, 5.2, 5.3 | -| §4 Error taxonomy | 3.1 (`_validate_root_attributes`), 3.2 (`get_array` source-format errors) | +| §1 Goals / scope / standards alignment | All phases collectively | +| §2 Module layout | 1.1 (skeleton), 1.2 (`_common`), 1.3-1.5 (`_model`), 2.1 (`_store`), 2.2-2.3 (`_layout`), 2.4-2.7 / 3.1-3.5 (archives) | +| §3 On-disk layout (root, siblings, attrs) | 2.5 (`add_array` for image/variance/mask), 2.7 (`add_tree` for root attrs and OME multiscales), 4.1 (recursive sub-archive decoration) | +| §3 Axis choice per archive class | 2.2 (`axes_for_archive_class`), 4.1 (sub-archive `("y", "x")`), 4.4 (CellCoadd) | +| §3 Mask 2-D packed integer with CF flag attrs | 1.2 (`mask_dtype_for_plane_count`), 1.5 (`CfFlagAttributes`), 2.5 (mask packing in `add_array`), 3.0 (native-mask flag), 3.2 (mask unpack on read), 3.6 (3-plane and 40-plane round-trips) | +| §3 JSON tree at `/tree` | 2.7 (`add_tree` stages JSON bytes); 3.1 (`get_tree` reads them) | +| §3 AST WCS at `/wcs_ast` | 2.7 (`_stage_wcs_ast`), 3.3 (Projection deserializer reads it) | +| §3 Tables under `/lsst/tables//` | 2.6 (output), 3.4 (input) | +| §3 Recursive composition | 4.1 (`decorate_sub_archives`) | +| §3 Chunking / sharding defaults / aligned siblings | 1.3-1.4 (defaults in IR), 2.2 (`chunks_for`, `chunks_aligned_to`), 4.3 (PSF single-cell chunks) | +| §4 FITS opaque-metadata round-trip | 5.1 (write), 5.2 (read), 5.3 (full FITS↔Zarr) | +| §4 WCS validation: 11×11 grid, 1-pixel threshold | 2.3 (`affine_check`), 2.7 (integration in `add_tree`) | +| §4 Error taxonomy | 1.2 (`>64`-plane refusal), 3.1 (missing `archive_class`, `>LSST_VERSION`), 3.2 (bad source string) | | §4 Mode and atomicity | 2.1 (create-only enforcement) | -| §4 Chunk-aligned subset reads | 3.2 (`_CountingStore` regression test) | -| §4 Forward compatibility | 3.1 (version refusal), 1.3 (unknown-key preservation) | -| §5 Test strategy | One test file per module, plus round-trip and external-reader sets | -| §5 Rollout plan (6 numbered steps) | Phases 1-6 directly mirror the spec's rollout | -| §6 Follow-ups | Documented in 6.1's docstring | +| §4 Chunk-aligned subset reads (lazy invariant) | 1.3 (`_CountingStore` test on the IR), 3.2 (regression test on the input archive) | +| §4 Mask schema mismatches | inherited from existing `Mask.deserialize`; v1 surfaces it through the standard error path; explicit dedicated test deferred to a follow-up | +| §4 Empty / minimal cases | 2.7 (no `wcs_ast` when no projection; unit-scale `coordinateTransformations` default), 2.5 (image without variance / mask) | +| §4 Forward compatibility | 1.3 (unknown-key preservation in `ZarrAttributes`), 3.1 (version refusal) | +| §5 Test layout | One test file per module, plus `test_zarr_round_trip.py`, `test_zarr_cross_format.py`, `test_zarr_xarray_interop.py`, `test_zarr_ome_compliance.py`, `test_zarr_external_reader.py` | +| §5 Rollout plan (6 numbered steps) | Phases 1–6 directly mirror the spec's rollout | +| §6 Follow-ups | Documented in 6.1's docstring (RFC-5, 3-D mask fallback, dask read, multiscale pyramid, consolidated metadata, NCZarr, stacked OME ColorImage view) | -**Type / name consistency** — IR types and key methods use consistent names across tasks: +**Implementer-judgement handoffs** — places where the plan asks the engineer to consult local code rather than follow a literal recipe: -- `ZarrDocument`, `ZarrGroup`, `ZarrArray`, `ZarrAttributes` (introduced 1.3-1.4, used everywhere after). -- `ZarrCompressionOptions` field names (`codec`, `cname`, `clevel`, `shuffle`) match between Tasks 1.2 and 1.4. -- `ZarrPointerModel.path` (Task 1.2) is read by `deserialize_pointer` (3.3) and written by `serialize_pointer` (2.3); both use absolute zarr paths starting with `/`. -- The sliced-source convention (`zarr:/?c=N`, `?cell=Cy,Cx`) is defined in Task 4.1 and consumed by ColorImage (4.2) and CellCoadd PSF (4.5) layout fixups. +- Tasks 4.4 / 4.5: minimal `CellCoadd` constructor — `_make_minimal_cell_coadd` and `_make_minimal_cell_coadd_with_psf` are `SkipTest` placeholders to be replaced by reading `python/lsst/images/cells/_coadd.py`. +- Task 6.3: the towncrier fragment filename uses `DM-XXXXX` — pick the real ticket number when committing. +- Task 3.6: the `RoundtripBase` helper may need a small directory-vs-file fix to accept `.zarr` directories. -**Open implementer judgement calls** — the plan flags two places where the implementer needs to consult local code rather than follow a literal recipe: +These are intentional handoffs, not placeholder content in the production code. -- Task 4.4 / 4.6: minimal `CellCoadd` constructor — the test helpers are placeholders that should be replaced by real construction once the implementer reads `python/lsst/images/cells/_coadd.py`. -- Task 6.3: the towncrier fragment filename uses `DM-XXXXX` — pick the real ticket number when committing. +**Type / name consistency** — IR types and key methods stay consistent across phases: + +- `ZarrDocument`, `ZarrGroup`, `ZarrArray`, `ZarrAttributes` introduced in 1.3-1.4, used everywhere after. +- `ZarrAttributes` has three namespaces (`lsst`, `ome`, `extra`); `extra` is read by xarray / CF tooling and tested in 1.3, 1.4, 5.4. +- `ZarrCompressionOptions.default_for_dtype` from 1.2 is consumed by the `to_zarr` codec builder in 1.4. +- `_layout.chunks_for` / `chunks_aligned_to` defined in 2.2 are used by the output archive in 2.5; `_layout.affine_check` defined in 2.3 is used in 2.7. +- `lsst.archive_class`, `lsst.tree`, `lsst.wcs_ast`, `lsst.cell_grid`, `lsst.opaque_metadata_format`, `lsst.wcs_simplified_dropped`, `lsst.wcs_simplified_max_residual_pixels` are spelled the same in every task that reads or writes them. +- The sliced-source convention (`?c=N`, `?cell=Cy,Cx`) from the v1 plan is **deliberately absent** — the no-stacking rule means every `ArrayReferenceModel.source` is plain `zarr:/`. + +**Critical invariants pinned by tests** — the four invariants stated in the plan header each have at least one failing test: -These are intentional handoffs, not placeholder content in the production code itself. +1. Lazy reads — `_CountingStore` test in 1.3 (IR level) and 3.2 (input archive level). +2. Aligned chunks — Phase 2.5 test asserting `variance` follows `image_chunks` after the override; CellCoadd test in 4.4 asserting all three siblings have `cell_shape` chunks. +3. Affine residual validator — Phase 2.3 tests with a synthetic linear FrameSet (passes) and a synthetic high-distortion FrameSet (drops). +4. No byte duplication — implicit in the "no fixup pass" architecture; explicit assertions in 4.1 (root has no OME multiscales for ColorImage) and 4.4 (CellCoadd PSF is a single 4-D array, not per-cell groups + a stacked array). From 396a04eb759154276304e37c610d24911b980025 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 21:24:51 -0700 Subject: [PATCH 06/60] feat: add lsst.images.zarr package skeleton with guarded import --- pyproject.toml | 2 ++ python/lsst/images/zarr/__init__.py | 39 +++++++++++++++++++++++++++++ 2 files changed, 41 insertions(+) create mode 100644 python/lsst/images/zarr/__init__.py diff --git a/pyproject.toml b/pyproject.toml index b5309d3b..919a1685 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -53,6 +53,8 @@ piff = ["piff >= 1.6", "galsim >= 2.7"] butler = ["lsst-daf-butler"] # Add feature for Starlink NDF (HDS-on-HDF5) read/write support. ndf = ["h5py >= 3.10"] +# Add feature for Zarr v3 read/write support. +zarr = ["zarr >= 3.0"] [tool.setuptools.packages.find] where = ["python"] diff --git a/python/lsst/images/zarr/__init__.py b/python/lsst/images/zarr/__init__.py new file mode 100644 index 00000000..40da59ae --- /dev/null +++ b/python/lsst/images/zarr/__init__.py @@ -0,0 +1,39 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +"""Zarr v3 archive backend for `lsst.images`. + +Files written by this archive are xarray/CF-shaped at the root +(``image`` / ``variance`` / ``mask`` as siblings sharing ``(y, x)`` +dimensions, CF ``flag_masks`` / ``flag_meanings`` on the mask) with +OME-NGFF v0.5 multiscales metadata as a discoverability layer +pointing at the same ``image`` array. The same bytes are visible to +``xarray``, GDAL's Zarr driver, and OME-Zarr tooling like ``napari`` +and ``ome-zarr-py``. + +Default chunk geometry is tile-aligned (~1024×1024 for plain images, +``cell_shape`` for ``CellCoadd``). Sharding (zarr v3 native) is +enabled by default with a tunable shard size to keep object counts +manageable on S3/GCS. Both ``DirectoryStore`` and ``ZipStore`` are +supported; the choice is driven by URI shape (``*.zarr.zip`` → +``ZipStore``, otherwise directory). Remote URIs go through +`lsst.resources.ResourcePath` and `fsspec`. +""" + +try: + import zarr # noqa: F401 +except ImportError as e: + raise ImportError( + "lsst.images.zarr requires the optional 'zarr' package (>=3.0). " + "Install it directly or via 'pip install lsst-images[zarr]'." + ) from e + +# Phase 1 has no public archive API yet. Re-exports are added in later phases. From a7dcc5708d7ee81cea1f578e3fbc42d0e94e6926 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 21:45:36 -0700 Subject: [PATCH 07/60] feat: add ZarrPointerModel, ZarrCompressionOptions, mask-dtype helper --- python/lsst/images/zarr/_common.py | 125 +++++++++++++++++++++++++++++ tests/test_zarr_common.py | 82 +++++++++++++++++++ 2 files changed, 207 insertions(+) create mode 100644 python/lsst/images/zarr/_common.py create mode 100644 tests/test_zarr_common.py diff --git a/python/lsst/images/zarr/_common.py b/python/lsst/images/zarr/_common.py new file mode 100644 index 00000000..e3e4c003 --- /dev/null +++ b/python/lsst/images/zarr/_common.py @@ -0,0 +1,125 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +__all__ = ( + "LSST_NS", + "LSST_VERSION", + "OME_NS", + "OME_VERSION", + "ZarrCompressionOptions", + "ZarrPointerModel", + "archive_path_to_zarr_path", + "mask_dtype_for_plane_count", +) + +from dataclasses import dataclass +from typing import ClassVar, Self + +import numpy as np +import pydantic + +LSST_NS = "lsst" +"""Top-level zarr-attributes namespace key for LSST extensions.""" + +OME_NS = "ome" +"""Top-level zarr-attributes namespace key for OME-NGFF metadata.""" + +OME_VERSION = "0.5" +"""OME-Zarr / NGFF version this backend writes.""" + +LSST_VERSION = 1 +"""Schema version of the ``lsst:`` extension this backend writes. + +Readers refuse versions newer than they understand. Bump on +backwards-incompatible changes to the on-disk layout. +""" + + +class ZarrPointerModel(pydantic.BaseModel): + """Reference to a zarr archive sub-tree by absolute zarr path. + + Used by `ZarrOutputArchive` / `ZarrInputArchive` to point to + sub-trees that have been hoisted out of the main JSON tree into + separate zarr arrays. The path is interpreted relative to the + archive root, e.g. ``"/lsst/psf/tree"``. + """ + + path: str + """Absolute zarr path (e.g. ``/lsst/psf/tree``).""" + + +@dataclass(frozen=True) +class ZarrCompressionOptions: + """Per-array zarr v3 codec configuration. + + The default codec stack is ``bytes -> blosc(zstd, clevel=5)`` with + byte-shuffle for floats and bit-shuffle for integers (and masks). + All defaults are overridable per-array via the ``compression`` + keyword to ``write()``. + """ + + codec: str = "blosc" + cname: str = "zstd" + clevel: int = 5 + shuffle: str = "shuffle" # 'shuffle' (byte) or 'bitshuffle' or 'noshuffle' + + DEFAULT_FLOAT: ClassVar[Self] + DEFAULT_INT: ClassVar[Self] + + @classmethod + def default_for_dtype(cls, dtype: str | np.dtype) -> Self: + """Return the default codec stack for a numpy dtype.""" + kind = np.dtype(dtype).kind + # 'u' (unsigned int), 'i' (signed int), 'b' (bool) -> bit-shuffle. + if kind in ("u", "i", "b"): + return cls.DEFAULT_INT + return cls.DEFAULT_FLOAT + + +ZarrCompressionOptions.DEFAULT_FLOAT = ZarrCompressionOptions(shuffle="shuffle") +ZarrCompressionOptions.DEFAULT_INT = ZarrCompressionOptions(shuffle="bitshuffle") + + +def archive_path_to_zarr_path(archive_path: str) -> str: + """Translate a serialization archive path to its zarr path. + + The empty archive path maps to the root-level JSON tree at + ``/tree``. Non-empty archive paths are kept verbatim (with a + leading slash). The v1 design's JSON-pointer mapping table is + intentionally absent: arrays land where their archive name says + they do. + """ + if not archive_path: + return "/tree" + stripped = archive_path.strip("/") + return f"/{stripped}" + + +def mask_dtype_for_plane_count(n_planes: int) -> np.dtype: + """Pick the smallest unsigned-integer dtype that holds ``n_planes`` bits. + + Returns ``uint8`` for <=8 planes, ``uint16`` for <=16, ``uint32`` + for <=32, ``uint64`` for <=64. Raises `ValueError` for >64 planes; + a 3-D fallback for that case is tracked as a follow-up. + """ + if n_planes <= 0: + raise ValueError(f"n_planes must be positive, got {n_planes}.") + if n_planes <= 8: + return np.dtype("uint8") + if n_planes <= 16: + return np.dtype("uint16") + if n_planes <= 32: + return np.dtype("uint32") + if n_planes <= 64: + return np.dtype("uint64") + raise ValueError(f"Mask has {n_planes} planes; v1 supports up to 64. 3-D fallback is a follow-up.") diff --git a/tests/test_zarr_common.py b/tests/test_zarr_common.py new file mode 100644 index 00000000..0e19105d --- /dev/null +++ b/tests/test_zarr_common.py @@ -0,0 +1,82 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +import numpy as np + +try: + from lsst.images.zarr._common import ( + LSST_NS, + LSST_VERSION, + OME_NS, + OME_VERSION, + ZarrCompressionOptions, + ZarrPointerModel, + archive_path_to_zarr_path, + mask_dtype_for_plane_count, + ) + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class CommonTestCase(unittest.TestCase): + """Tests for the zarr ``_common`` module.""" + + def test_pointer_round_trips(self) -> None: + original = ZarrPointerModel(path="/lsst/psf/tree") + recovered = ZarrPointerModel.model_validate_json(original.model_dump_json()) + self.assertEqual(recovered, original) + + def test_constants(self) -> None: + self.assertEqual(LSST_NS, "lsst") + self.assertEqual(OME_NS, "ome") + self.assertEqual(OME_VERSION, "0.5") + self.assertGreaterEqual(LSST_VERSION, 1) + + def test_archive_path_translation(self) -> None: + # Empty archive path -> the canonical root-level JSON tree. + self.assertEqual(archive_path_to_zarr_path(""), "/tree") + # Non-empty archive paths are kept verbatim. + self.assertEqual(archive_path_to_zarr_path("/image"), "/image") + self.assertEqual(archive_path_to_zarr_path("image"), "/image") + self.assertEqual(archive_path_to_zarr_path("/red/image"), "/red/image") + self.assertEqual(archive_path_to_zarr_path("/psf"), "/psf") + + def test_compression_defaults(self) -> None: + floats = ZarrCompressionOptions.default_for_dtype("float32") + self.assertEqual(floats.codec, "blosc") + self.assertEqual(floats.shuffle, "shuffle") + ints = ZarrCompressionOptions.default_for_dtype("uint8") + self.assertEqual(ints.shuffle, "bitshuffle") + + def test_mask_dtype_picks_smallest_fit(self) -> None: + self.assertEqual(mask_dtype_for_plane_count(1), np.dtype("uint8")) + self.assertEqual(mask_dtype_for_plane_count(8), np.dtype("uint8")) + self.assertEqual(mask_dtype_for_plane_count(9), np.dtype("uint16")) + self.assertEqual(mask_dtype_for_plane_count(16), np.dtype("uint16")) + self.assertEqual(mask_dtype_for_plane_count(17), np.dtype("uint32")) + self.assertEqual(mask_dtype_for_plane_count(32), np.dtype("uint32")) + self.assertEqual(mask_dtype_for_plane_count(33), np.dtype("uint64")) + self.assertEqual(mask_dtype_for_plane_count(64), np.dtype("uint64")) + + def test_mask_dtype_refuses_more_than_64_planes(self) -> None: + with self.assertRaisesRegex(ValueError, "supports up to 64"): + mask_dtype_for_plane_count(65) + + +if __name__ == "__main__": + unittest.main() From 666a5d7f0e0f651ebfa951ff83c8b2a7c2dc711b Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 21:53:26 -0700 Subject: [PATCH 08/60] feat: add ZarrAttributes and ZarrArray IR with lazy zarr.Array backing --- python/lsst/images/zarr/_model.py | 139 ++++++++++++++++++++++++++++++ tests/test_zarr_model.py | 122 ++++++++++++++++++++++++++ 2 files changed, 261 insertions(+) create mode 100644 python/lsst/images/zarr/_model.py create mode 100644 tests/test_zarr_model.py diff --git a/python/lsst/images/zarr/_model.py b/python/lsst/images/zarr/_model.py new file mode 100644 index 00000000..c18ce8ef --- /dev/null +++ b/python/lsst/images/zarr/_model.py @@ -0,0 +1,139 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +"""Python intermediate representation for zarr / xarray-CF / OME-NGFF content. + +The IR is the source of truth for what gets written. ``ZarrOutputArchive`` +populates a `ZarrDocument`; on context-manager exit, `to_zarr` materializes +it through a configured ``zarr.storage.Store``. + +Reads invert that flow: ``ZarrInputArchive`` opens the store and calls +`ZarrDocument.from_zarr`, which builds the IR around **lazy** ``zarr.Array`` +handles. No array bytes are read until a caller asks for them via +`ZarrArray.read`, which forwards slices straight to the underlying handle. +This keeps subset reads of remote files cheap: only the chunks intersecting +the requested slice are fetched. +""" + +from __future__ import annotations + +__all__ = ( + "ZarrArray", + "ZarrAttributes", +) + +from dataclasses import dataclass, field +from types import EllipsisType +from typing import Any, Self + +import numpy as np +import zarr + +from ._common import LSST_NS, LSST_VERSION, OME_NS, OME_VERSION, ZarrCompressionOptions + + +@dataclass +class ZarrAttributes: + """Namespaced attributes attached to a `ZarrGroup` or `ZarrArray`. + + Three namespaces: + + - ``lsst`` — LSST extensions (always emitted with a ``version`` key). + - ``ome`` — OME-NGFF (emitted only when non-empty). + - ``extra`` — flat top-level keys for CF / xarray conventions + (``_ARRAY_DIMENSIONS``, ``flag_masks``, ``flag_meanings``, + ``flag_descriptions``, ``units``, ``long_name``, …). These live at + the top of ``zarr.json`` ``attributes`` so xarray and CF tooling + see them without unwrapping a namespace. + """ + + lsst: dict[str, Any] = field(default_factory=dict) + ome: dict[str, Any] = field(default_factory=dict) + extra: dict[str, Any] = field(default_factory=dict) + + def dump(self) -> dict[str, Any]: + """Return the raw mapping zarr-python writes to ``zarr.json``.""" + out: dict[str, Any] = dict(self.extra) + # lsst is always present so readers can dispatch on lsst.archive_class. + out[LSST_NS] = {"version": LSST_VERSION, **self.lsst} + if self.ome: + out[OME_NS] = {"version": OME_VERSION, **self.ome} + return out + + @classmethod + def load(cls, raw: dict[str, Any]) -> Self: + """Construct from a raw attributes mapping read from zarr.""" + lsst = dict(raw.get(LSST_NS, {})) + lsst.pop("version", None) # version implicit in the namespace + ome = dict(raw.get(OME_NS, {})) + ome.pop("version", None) + extra = {k: v for k, v in raw.items() if k not in (LSST_NS, OME_NS)} + return cls(lsst=lsst, ome=ome, extra=extra) + + +@dataclass +class ZarrArray: + """An IR node holding either staged numpy data or a lazy zarr handle. + + Parameters + ---------- + data + Either a ``numpy.ndarray`` (when staged for write by the output + archive) or a ``zarr.Array`` (when read by the input archive). + The two forms never mix in a single instance. + chunks + Per-axis chunk shape. ``None`` lets `to_zarr` derive a default + from the array shape (~1024 per axis for plain images). + shards + Per-axis shard shape (zarr v3 native). ``None`` lets `to_zarr` + derive a default of 4× the chunk shape per axis when the + resulting shard exceeds 1 MiB. + compression + Codec configuration. ``None`` falls back to + `ZarrCompressionOptions.default_for_dtype`. + attributes + Namespaced attributes for this array's ``zarr.json``. + """ + + data: np.ndarray | zarr.Array + chunks: tuple[int, ...] | None = None + shards: tuple[int, ...] | None = None + compression: ZarrCompressionOptions | None = None + attributes: ZarrAttributes = field(default_factory=ZarrAttributes) + + @property + def shape(self) -> tuple[int, ...]: + return tuple(self.data.shape) + + @property + def dtype(self) -> np.dtype: + return np.dtype(self.data.dtype) + + @classmethod + def from_zarr(cls, zarr_array: zarr.Array) -> Self: + """Wrap an open ``zarr.Array`` without reading its data.""" + attrs = ZarrAttributes.load(dict(zarr_array.attrs)) + return cls( + data=zarr_array, + chunks=tuple(zarr_array.chunks), + attributes=attrs, + ) + + def read(self, *, slices: tuple[slice, ...] | EllipsisType = ...) -> np.ndarray: + """Materialize this array (or a slice of it) into numpy. + + For a `ZarrArray` backed by a lazy handle, this is the only + place that touches array bytes. ``slices`` is forwarded straight + to the handle so only chunks intersecting the slice are fetched. + """ + if isinstance(self.data, np.ndarray): + return self.data if slices is ... else self.data[slices] + return self.data[...] if slices is ... else self.data[slices] diff --git a/tests/test_zarr_model.py b/tests/test_zarr_model.py new file mode 100644 index 00000000..4c771b33 --- /dev/null +++ b/tests/test_zarr_model.py @@ -0,0 +1,122 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +import numpy as np + +try: + import zarr + + from lsst.images.zarr._common import LSST_NS, LSST_VERSION, OME_NS, OME_VERSION + from lsst.images.zarr._model import ZarrArray, ZarrAttributes + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrAttributesTestCase(unittest.TestCase): + """Tests for `ZarrAttributes` namespacing and round-tripping.""" + + def test_dump_separates_namespaces(self) -> None: + attrs = ZarrAttributes() + attrs.lsst["archive_class"] = "MaskedImage" + attrs.ome["multiscales"] = [{"name": "image"}] + attrs.extra["_ARRAY_DIMENSIONS"] = ["y", "x"] + attrs.extra["units"] = "adu" + dumped = attrs.dump() + self.assertEqual(dumped[LSST_NS]["archive_class"], "MaskedImage") + self.assertEqual(dumped[LSST_NS]["version"], LSST_VERSION) + self.assertEqual(dumped[OME_NS]["multiscales"], [{"name": "image"}]) + self.assertEqual(dumped[OME_NS]["version"], OME_VERSION) + # CF / xarray attrs sit at the top level, not inside lsst: or ome:. + self.assertEqual(dumped["_ARRAY_DIMENSIONS"], ["y", "x"]) + self.assertEqual(dumped["units"], "adu") + + def test_load_preserves_unknown_keys(self) -> None: + # Forward compatibility: unknown lsst.* keys must survive a + # load -> dump round-trip. + raw = { + LSST_NS: { + "version": LSST_VERSION, + "archive_class": "Image", + "future_thing": {"x": 1}, + }, + OME_NS: {"version": OME_VERSION, "multiscales": []}, + "_ARRAY_DIMENSIONS": ["y", "x"], + "units": "adu", + } + attrs = ZarrAttributes.load(raw) + dumped = attrs.dump() + self.assertEqual(dumped[LSST_NS]["future_thing"], {"x": 1}) + self.assertEqual(dumped["units"], "adu") + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrArrayTestCase(unittest.TestCase): + """Tests for `ZarrArray` lazy backing and slice forwarding.""" + + def test_lazy_data_after_from_zarr(self) -> None: + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + zarr_array = root.create_array(name="image", shape=(8, 8), chunks=(4, 4), dtype="float32") + zarr_array[:] = np.arange(64, dtype=np.float32).reshape(8, 8) + + ir_array = ZarrArray.from_zarr(zarr_array) + # Lazy invariant: data is the zarr.Array handle, not numpy. + self.assertIsInstance(ir_array.data, zarr.Array) + self.assertNotIsInstance(ir_array.data, np.ndarray) + self.assertEqual(ir_array.shape, (8, 8)) + self.assertEqual(str(ir_array.dtype), "float32") + + def test_subset_does_not_materialize_full_array(self) -> None: + store = _CountingStore() + root = zarr.create_group(store=store, zarr_format=3) + zarr_array = root.create_array(name="image", shape=(16, 16), chunks=(4, 4), dtype="int32") + zarr_array[:] = np.arange(256, dtype=np.int32).reshape(16, 16) + store.reads = 0 # reset after the write phase + + ir_array = ZarrArray.from_zarr(zarr_array) + # Reading shape / dtype must not fetch any chunk data. + self.assertEqual(ir_array.shape, (16, 16)) + self.assertEqual(store.reads, 0) + + subset = ir_array.read(slices=(slice(0, 4), slice(0, 4))) + self.assertEqual(subset.shape, (4, 4)) + np.testing.assert_array_equal(subset, np.arange(256).reshape(16, 16)[:4, :4]) + # 16 chunks total in the array; we should have touched far fewer. + self.assertLess(store.reads, 16) + + def test_staged_numpy_array_is_eager(self) -> None: + data = np.arange(12, dtype=np.float64).reshape(3, 4) + ir_array = ZarrArray(data=data) + self.assertIs(ir_array.data, data) + self.assertEqual(ir_array.shape, (3, 4)) + + +class _CountingStore(zarr.storage.MemoryStore if HAVE_ZARR else object): + """A MemoryStore that counts get() calls.""" + + def __init__(self) -> None: + super().__init__() + self.reads = 0 + + async def get(self, key, prototype, byte_range=None): # type: ignore[override] + self.reads += 1 + return await super().get(key, prototype, byte_range) + + +if __name__ == "__main__": + unittest.main() From 2c3a267fc543348fd94ee58113a677d992e24516 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 21:58:54 -0700 Subject: [PATCH 09/60] test: tighten lazy-subset chunk-read bound --- tests/test_zarr_model.py | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/tests/test_zarr_model.py b/tests/test_zarr_model.py index 4c771b33..7c1f736a 100644 --- a/tests/test_zarr_model.py +++ b/tests/test_zarr_model.py @@ -96,8 +96,10 @@ def test_subset_does_not_materialize_full_array(self) -> None: subset = ir_array.read(slices=(slice(0, 4), slice(0, 4))) self.assertEqual(subset.shape, (4, 4)) np.testing.assert_array_equal(subset, np.arange(256).reshape(16, 16)[:4, :4]) - # 16 chunks total in the array; we should have touched far fewer. - self.assertLess(store.reads, 16) + # A 4x4 subset aligned with chunks=(4, 4) intersects exactly one + # data chunk; allow a small margin for incidental metadata reads, + # but stay tight enough to catch a regression that fetches 2+ chunks. + self.assertLessEqual(store.reads, 4) def test_staged_numpy_array_is_eager(self) -> None: data = np.arange(12, dtype=np.float64).reshape(3, 4) From 840fbaf3c30a46edcb7d7f5c33424a838ad7006c Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Fri, 22 May 2026 22:09:01 -0700 Subject: [PATCH 10/60] feat: add ZarrGroup and ZarrDocument with lazy-on-read materialization --- python/lsst/images/zarr/_model.py | 122 ++++++++++++++++++++++++++++++ tests/test_zarr_model.py | 64 ++++++++++++++++ 2 files changed, 186 insertions(+) diff --git a/python/lsst/images/zarr/_model.py b/python/lsst/images/zarr/_model.py index c18ce8ef..d4f12e66 100644 --- a/python/lsst/images/zarr/_model.py +++ b/python/lsst/images/zarr/_model.py @@ -28,6 +28,8 @@ __all__ = ( "ZarrArray", "ZarrAttributes", + "ZarrDocument", + "ZarrGroup", ) from dataclasses import dataclass, field @@ -137,3 +139,123 @@ def read(self, *, slices: tuple[slice, ...] | EllipsisType = ...) -> np.ndarray: if isinstance(self.data, np.ndarray): return self.data if slices is ... else self.data[slices] return self.data[...] if slices is ... else self.data[slices] + + +@dataclass +class ZarrGroup: + """A zarr group: nested groups, arrays, and namespaced attributes.""" + + groups: dict[str, ZarrGroup] = field(default_factory=dict) + arrays: dict[str, ZarrArray] = field(default_factory=dict) + attributes: ZarrAttributes = field(default_factory=ZarrAttributes) + + def get(self, path: str) -> ZarrGroup | ZarrArray: + """Return a child by absolute or relative zarr path.""" + if path in ("", "/"): + return self + parts = [p for p in path.strip("/").split("/") if p] + cursor: ZarrGroup | ZarrArray = self + for part in parts: + if not isinstance(cursor, ZarrGroup): + raise KeyError(path) + if part in cursor.arrays: + cursor = cursor.arrays[part] + elif part in cursor.groups: + cursor = cursor.groups[part] + else: + raise KeyError(path) + return cursor + + def ensure_group(self, path: str) -> ZarrGroup: + """Return or create a sub-group at ``path``.""" + if path in ("", "/"): + return self + parts = [p for p in path.strip("/").split("/") if p] + cursor = self + for part in parts: + if part in cursor.arrays: + raise KeyError(f"{part!r} already exists as an array.") + if part not in cursor.groups: + cursor.groups[part] = ZarrGroup() + cursor = cursor.groups[part] + return cursor + + +@dataclass +class ZarrDocument: + """A complete zarr archive root.""" + + root: ZarrGroup = field(default_factory=ZarrGroup) + + @classmethod + def from_zarr(cls, store: zarr.storage.Store) -> Self: + """Open ``store`` and build a lazy IR view of its contents.""" + zarr_root = zarr.open_group(store=store, mode="r", zarr_format=3) + return cls(root=_group_from_zarr(zarr_root)) + + def to_zarr(self, store: zarr.storage.Store) -> None: + """Materialize this IR into ``store`` (which must be empty).""" + zarr_root = zarr.create_group(store=store, zarr_format=3, overwrite=False) + _group_to_zarr(self.root, zarr_root) + + +def _group_from_zarr(zarr_group: zarr.Group) -> ZarrGroup: + """Build a lazy `ZarrGroup` IR from an open ``zarr.Group``.""" + ir = ZarrGroup(attributes=ZarrAttributes.load(dict(zarr_group.attrs))) + for name, child in zarr_group.members(): + if isinstance(child, zarr.Array): + ir.arrays[name] = ZarrArray.from_zarr(child) + else: + ir.groups[name] = _group_from_zarr(child) + return ir + + +def _group_to_zarr(ir: ZarrGroup, zarr_group: zarr.Group) -> None: + """Write a `ZarrGroup` IR into an open ``zarr.Group``.""" + if dumped := ir.attributes.dump(): + zarr_group.update_attributes(dumped) + for name, sub in ir.groups.items(): + sub_zarr = zarr_group.create_group(name) + _group_to_zarr(sub, sub_zarr) + for name, array in ir.arrays.items(): + if not isinstance(array.data, np.ndarray): + raise TypeError( + f"Cannot write ZarrArray at {name!r}: data is a lazy zarr.Array, " + "not numpy. Read it first or pass a fresh numpy array." + ) + chunks = array.chunks or _default_chunks(array.data.shape) + compression = array.compression or ZarrCompressionOptions.default_for_dtype(str(array.dtype)) + serializer, compressors = _build_codecs(compression) + zarr_array = zarr_group.create_array( + name=name, + shape=array.data.shape, + chunks=chunks, + dtype=array.data.dtype, + shards=array.shards, + serializer=serializer, + compressors=compressors, + ) + zarr_array[:] = array.data + if dumped := array.attributes.dump(): + zarr_array.update_attributes(dumped) + + +def _default_chunks(shape: tuple[int, ...]) -> tuple[int, ...]: + """Return the default chunk shape: ``min(1024, dim)`` per axis.""" + return tuple(min(1024, dim) for dim in shape) + + +def _build_codecs(options: ZarrCompressionOptions) -> tuple[Any, list[Any]]: + """Build a zarr v3 codec stack from `ZarrCompressionOptions`. + + Returns a ``(serializer, compressors)`` pair suitable for the + ``serializer=`` and ``compressors=`` keyword arguments of + `zarr.Group.create_array`. + """ + if options.codec != "blosc": + raise NotImplementedError(f"Unsupported codec {options.codec!r}.") + serializer = zarr.codecs.BytesCodec() + compressors = [ + zarr.codecs.BloscCodec(cname=options.cname, clevel=options.clevel, shuffle=options.shuffle) + ] + return serializer, compressors diff --git a/tests/test_zarr_model.py b/tests/test_zarr_model.py index 7c1f736a..e8492c5a 100644 --- a/tests/test_zarr_model.py +++ b/tests/test_zarr_model.py @@ -120,5 +120,69 @@ async def get(self, key, prototype, byte_range=None): # type: ignore[override] return await super().get(key, prototype, byte_range) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrDocumentTestCase(unittest.TestCase): + """Tests for `ZarrDocument` / `ZarrGroup` round-trip and tree walking.""" + + def test_round_trip_through_memory_store(self) -> None: + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + + # Build a flat IR: image, variance, mask siblings at root. + doc = ZarrDocument(root=ZarrGroup()) + doc.root.attributes.lsst["archive_class"] = "MaskedImage" + doc.root.attributes.lsst["tree"] = "tree" + + image = ZarrArray(data=np.ones((4, 4), dtype="float32")) + image.attributes.extra["_ARRAY_DIMENSIONS"] = ["y", "x"] + doc.root.arrays["image"] = image + + mask = ZarrArray(data=np.zeros((4, 4), dtype="uint8")) + mask.attributes.extra["_ARRAY_DIMENSIONS"] = ["y", "x"] + mask.attributes.extra["flag_masks"] = [1, 2] + mask.attributes.extra["flag_meanings"] = "BAD SAT" + doc.root.arrays["mask"] = mask + + # Stub a 1-D uint8 'tree' array (JSON bytes). + doc.root.arrays["tree"] = ZarrArray(data=np.frombuffer(b"{}", dtype=np.uint8)) + + store = zarr.storage.MemoryStore() + doc.to_zarr(store) + + # Reload and verify lazy invariant on every array. + recovered = ZarrDocument.from_zarr(store) + self.assertIsInstance(recovered.root.arrays["image"].data, zarr.Array) + self.assertIsInstance(recovered.root.arrays["mask"].data, zarr.Array) + self.assertEqual(recovered.root.attributes.lsst["archive_class"], "MaskedImage") + # CF flag attrs round-trip via the extra namespace. + self.assertEqual( + recovered.root.arrays["mask"].attributes.extra["flag_meanings"], + "BAD SAT", + ) + # xarray dims round-trip. + self.assertEqual( + recovered.root.arrays["image"].attributes.extra["_ARRAY_DIMENSIONS"], + ["y", "x"], + ) + # Subset reads still go through the lazy handle. + np.testing.assert_array_equal(recovered.root.arrays["image"].read(), np.ones((4, 4), dtype="float32")) + + def test_get_walks_paths(self) -> None: + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + + doc = ZarrDocument(root=ZarrGroup()) + doc.root.arrays["image"] = ZarrArray(data=np.zeros((2, 2), dtype="float32")) + red = doc.root.ensure_group("/red") + red.arrays["image"] = ZarrArray(data=np.ones((2, 2), dtype="float32")) + + # Absolute and relative paths. + self.assertIs(doc.root.get("/image"), doc.root.arrays["image"]) + self.assertIs(doc.root.get("image"), doc.root.arrays["image"]) + self.assertIs(doc.root.get("/red/image"), red.arrays["image"]) + self.assertIs(doc.root.get("/"), doc.root) + + with self.assertRaises(KeyError): + doc.root.get("/missing") + + if __name__ == "__main__": unittest.main() From c22751097291dbe11e163ff6b124abd376af7681 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 05:30:12 -0700 Subject: [PATCH 11/60] feat: add OmeMultiscale, CfFlagAttributes, image-array-attrs helpers Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_model.py | 119 ++++++++++++++++++++++++++++++ tests/test_zarr_model.py | 71 ++++++++++++++++++ 2 files changed, 190 insertions(+) diff --git a/python/lsst/images/zarr/_model.py b/python/lsst/images/zarr/_model.py index d4f12e66..70adb692 100644 --- a/python/lsst/images/zarr/_model.py +++ b/python/lsst/images/zarr/_model.py @@ -26,10 +26,15 @@ from __future__ import annotations __all__ = ( + "CfFlagAttributes", + "MaskPlaneEntry", + "OmeMultiscale", + "OmeOmeroChannel", "ZarrArray", "ZarrAttributes", "ZarrDocument", "ZarrGroup", + "build_image_array_attrs", ) from dataclasses import dataclass, field @@ -245,6 +250,120 @@ def _default_chunks(shape: tuple[int, ...]) -> tuple[int, ...]: return tuple(min(1024, dim) for dim in shape) +@dataclass +class OmeMultiscale: + """OME-NGFF v0.5 multiscales metadata for a single-level image. + + The backend always writes one level whose ``path`` points at a + sibling array (``image`` for typical archives). + ``coordinate_transformations`` defaults to a unit ``scale`` so the + OME block is well-formed even when the simplified affine is + dropped by the residual validator. + """ + + name: str + axes: tuple[str, ...] + dataset_path: str = "image" + coordinate_transformations: list[dict[str, Any]] | None = None + + @staticmethod + def _axis_block(name: str) -> dict[str, Any]: + if name == "c": + return {"name": "c", "type": "channel"} + if name == "t": + return {"name": "t", "type": "time"} + return {"name": name, "type": "space", "unit": "pixel"} + + def dump(self) -> dict[str, Any]: + ndim = len(self.axes) + ct = self.coordinate_transformations + if ct is None: + ct = [{"type": "scale", "scale": [1.0] * ndim}] + return { + "name": self.name, + "axes": [self._axis_block(a) for a in self.axes], + "datasets": [ + { + "path": self.dataset_path, + "coordinateTransformations": ct, + } + ], + } + + +@dataclass +class OmeOmeroChannel: + """OME ``omero/channels`` entry (used only when a channel axis exists).""" + + label: str + color: str | None = None + + def dump(self) -> dict[str, Any]: + out: dict[str, Any] = {"label": self.label} + if self.color is not None: + out["color"] = self.color + return out + + +@dataclass +class MaskPlaneEntry: + """One mask-plane definition.""" + + name: str + bit: int + description: str = "" + + +@dataclass +class CfFlagAttributes: + """CF-conventions flag metadata for a 2-D packed mask array. + + Emits ``flag_masks`` (list of bit values), ``flag_meanings`` + (single space-separated string per CF), and the LSST extension + ``flag_descriptions`` (list of human-readable strings parallel to + ``flag_meanings``). + """ + + planes: list[MaskPlaneEntry] = field(default_factory=list) + + def dump(self) -> dict[str, Any]: + return { + "flag_masks": [int(1 << p.bit) for p in self.planes], + "flag_meanings": " ".join(p.name for p in self.planes), + "flag_descriptions": [p.description for p in self.planes], + } + + @classmethod + def load(cls, raw: dict[str, Any]) -> Self: + meanings = raw.get("flag_meanings", "").split() + masks = [int(m) for m in raw.get("flag_masks", [])] + descriptions = list(raw.get("flag_descriptions", [""] * len(meanings))) + planes = [] + for name, mask, desc in zip(meanings, masks, descriptions, strict=False): + # Recover bit position from the mask value (always a power of 2). + bit = (mask & -mask).bit_length() - 1 + planes.append(MaskPlaneEntry(name=name, bit=bit, description=desc)) + return cls(planes=planes) + + +def build_image_array_attrs( + *, + axes: tuple[str, ...], + units: str | None = None, + long_name: str | None = None, +) -> dict[str, Any]: + """Build the CF / xarray attribute block for an image array. + + Used for arrays of rank 2 or higher. + """ + out: dict[str, Any] = {"_ARRAY_DIMENSIONS": list(axes)} + if units is not None: + out["units"] = units + if long_name is not None: + out["long_name"] = long_name + return out + + def _build_codecs(options: ZarrCompressionOptions) -> tuple[Any, list[Any]]: """Build a zarr v3 codec stack from `ZarrCompressionOptions`. diff --git a/tests/test_zarr_model.py b/tests/test_zarr_model.py index e8492c5a..cd32227d 100644 --- a/tests/test_zarr_model.py +++ b/tests/test_zarr_model.py @@ -184,5 +184,76 @@ def test_get_walks_paths(self) -> None: doc.root.get("/missing") +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class OmeCfHelpersTestCase(unittest.TestCase): + """Tests for the OME / CF attribute-shape helpers.""" + + def test_multiscale_emits_expected_shape(self) -> None: + from lsst.images.zarr._model import OmeMultiscale + + m = OmeMultiscale( + name="visitimage", + axes=("y", "x"), + dataset_path="image", + ) + d = m.dump() + self.assertEqual(d["name"], "visitimage") + self.assertEqual( + d["axes"], + [ + {"name": "y", "type": "space", "unit": "pixel"}, + {"name": "x", "type": "space", "unit": "pixel"}, + ], + ) + self.assertEqual(d["datasets"][0]["path"], "image") + # Default coordinate transform is unit scale until a real one is set. + self.assertEqual( + d["datasets"][0]["coordinateTransformations"], + [{"type": "scale", "scale": [1.0, 1.0]}], + ) + + def test_multiscale_with_affine(self) -> None: + from lsst.images.zarr._model import OmeMultiscale + + m = OmeMultiscale( + name="image", + axes=("y", "x"), + dataset_path="image", + coordinate_transformations=[ + {"type": "scale", "scale": [0.2, 0.2]}, + { + "type": "affine", + "affine": [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], + }, + ], + ) + d = m.dump() + self.assertEqual(len(d["datasets"][0]["coordinateTransformations"]), 2) + self.assertEqual(d["datasets"][0]["coordinateTransformations"][0]["type"], "scale") + + def test_cf_flag_attributes(self) -> None: + from lsst.images.zarr._model import CfFlagAttributes, MaskPlaneEntry + + cf = CfFlagAttributes( + planes=[ + MaskPlaneEntry(name="BAD", bit=0, description="Bad pixel."), + MaskPlaneEntry(name="SAT", bit=1, description="Saturated."), + MaskPlaneEntry(name="CR", bit=2, description="Cosmic ray."), + ] + ) + d = cf.dump() + self.assertEqual(d["flag_masks"], [1, 2, 4]) + self.assertEqual(d["flag_meanings"], "BAD SAT CR") + self.assertEqual(d["flag_descriptions"], ["Bad pixel.", "Saturated.", "Cosmic ray."]) + + def test_image_array_attrs(self) -> None: + from lsst.images.zarr._model import build_image_array_attrs + + attrs = build_image_array_attrs(axes=("y", "x"), units="adu", long_name="science image") + self.assertEqual(attrs["_ARRAY_DIMENSIONS"], ["y", "x"]) + self.assertEqual(attrs["units"], "adu") + self.assertEqual(attrs["long_name"], "science image") + + if __name__ == "__main__": unittest.main() From c0a336cef528dc0b4d4bf5d0a9fe12b5d0e61f95 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 05:34:37 -0700 Subject: [PATCH 12/60] feat: add zarr store dispatch (LocalStore / ZipStore / FsspecStore) Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_store.py | 101 ++++++++++++++++++++++++++++++ tests/test_zarr_store.py | 63 +++++++++++++++++++ 2 files changed, 164 insertions(+) create mode 100644 python/lsst/images/zarr/_store.py create mode 100644 tests/test_zarr_store.py diff --git a/python/lsst/images/zarr/_store.py b/python/lsst/images/zarr/_store.py new file mode 100644 index 00000000..50ffadbd --- /dev/null +++ b/python/lsst/images/zarr/_store.py @@ -0,0 +1,101 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +__all__ = ("open_store_for_read", "open_store_for_write") + +import os +from collections.abc import Iterator +from contextlib import contextmanager + +import zarr + +from lsst.resources import ResourcePath, ResourcePathExpression + + +def _is_zip(rp: ResourcePath) -> bool: + return rp.path.endswith(".zarr.zip") or rp.path.endswith(".zip") + + +def _is_remote(rp: ResourcePath) -> bool: + return rp.scheme not in ("", "file") + + +@contextmanager +def open_store_for_write(path: ResourcePathExpression) -> Iterator[zarr.storage.Store]: + """Open a zarr store for writing. + + Refuses to overwrite a non-empty existing store. The returned + context manager closes the store on exit; for ``ZipStore`` this + finalizes the central directory. + """ + rp = ResourcePath(path) + if _is_zip(rp): + if _is_remote(rp): + raise NotImplementedError("Remote ZipStore writes are a follow-up.") + local = rp.ospath + if os.path.exists(local) and os.path.getsize(local) > 0: + raise OSError(f"File {local!r} already exists.") + store = zarr.storage.ZipStore(local, mode="w") + try: + yield store + finally: + if getattr(store, "_is_open", False): + store.close() + return + if _is_remote(rp): + import fsspec + + fs, fs_path = fsspec.url_to_fs(str(rp)) + if fs.exists(fs_path) and fs.ls(fs_path): + raise OSError(f"Store {rp!s} already exists.") + store = zarr.storage.FsspecStore(fs=fs, path=fs_path, read_only=False) + yield store + return + local = rp.ospath + if os.path.exists(local) and os.listdir(local): + raise OSError(f"Directory {local!r} already exists and is non-empty.") + os.makedirs(local, exist_ok=True) + store = zarr.storage.LocalStore(local, read_only=False) + yield store + + +@contextmanager +def open_store_for_read(path: ResourcePathExpression) -> Iterator[zarr.storage.Store]: + """Open a zarr store for reading.""" + rp = ResourcePath(path) + if _is_zip(rp): + if _is_remote(rp): + with rp.as_local() as local: + store = zarr.storage.ZipStore(local.ospath, mode="r") + try: + yield store + finally: + if getattr(store, "_is_open", False): + store.close() + return + store = zarr.storage.ZipStore(rp.ospath, mode="r") + try: + yield store + finally: + if getattr(store, "_is_open", False): + store.close() + return + if _is_remote(rp): + import fsspec + + fs, fs_path = fsspec.url_to_fs(str(rp)) + store = zarr.storage.FsspecStore(fs=fs, path=fs_path, read_only=True) + yield store + return + store = zarr.storage.LocalStore(rp.ospath, read_only=True) + yield store diff --git a/tests/test_zarr_store.py b/tests/test_zarr_store.py new file mode 100644 index 00000000..28267569 --- /dev/null +++ b/tests/test_zarr_store.py @@ -0,0 +1,63 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +try: + import zarr + + from lsst.images.zarr._store import open_store_for_read, open_store_for_write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class StoreDispatchTestCase(unittest.TestCase): + """URI-based dispatch for zarr stores.""" + + def test_local_directory(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + with open_store_for_write(target) as store: + self.assertIsInstance(store, zarr.storage.LocalStore) + zarr.create_group(store=store, zarr_format=3) + with open_store_for_read(target) as store: + self.assertIsInstance(store, zarr.storage.LocalStore) + root = zarr.open_group(store=store, mode="r") + self.assertEqual(list(root.keys()), []) + + def test_zip_store(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr.zip") + with open_store_for_write(target) as store: + self.assertIsInstance(store, zarr.storage.ZipStore) + zarr.create_group(store=store, zarr_format=3) + with open_store_for_read(target) as store: + self.assertIsInstance(store, zarr.storage.ZipStore) + + def test_create_only_refuses_existing(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + with open_store_for_write(target) as store: + zarr.create_group(store=store, zarr_format=3) + with self.assertRaisesRegex(OSError, "already exists"): + with open_store_for_write(target): + pass + + +if __name__ == "__main__": + unittest.main() From b84f005361919da7f99a9474b7852ef88ae36b3a Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 05:35:53 -0700 Subject: [PATCH 13/60] feat: add zarr layout rules for axes and chunk derivation Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_layout.py | 106 +++++++++++++++++++++++++++++ tests/test_zarr_layout.py | 77 +++++++++++++++++++++ 2 files changed, 183 insertions(+) create mode 100644 python/lsst/images/zarr/_layout.py create mode 100644 tests/test_zarr_layout.py diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py new file mode 100644 index 00000000..612aff29 --- /dev/null +++ b/python/lsst/images/zarr/_layout.py @@ -0,0 +1,106 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +"""Per-archive-class layout rules for the zarr backend. + +This module centralises the decisions that vary by image type: + +- which OME axes apply (``ColorImage`` has no root multiscale) +- default chunk sizes (clamped to 1024 per axis for plain images, + cell-aligned for `CellCoadd`, image-aligned for `variance` / `mask` + siblings) +- the affine residual validator that gates the OME + ``coordinateTransformations`` block + +Keeping these in one place lets the output archive populate the IR +generically. +""" + +from __future__ import annotations + +__all__ = ( + "axes_for_archive_class", + "chunks_aligned_to", + "chunks_for", +) + +from collections.abc import Mapping +from typing import Any + +_DEFAULT_AXIS_LIMIT = 1024 + + +def axes_for_archive_class(name: str) -> tuple[str, ...]: + """Return the OME axis tuple for a given archive class. + + Returns an empty tuple for ``ColorImage`` to signal that there is + no OME multiscale at the root of that class — the per-channel + sub-archives carry their own ``(y, x)`` multiscales. + """ + if name == "ColorImage": + return () + return ("y", "x") + + +def chunks_for( + archive_class: str, + shape: tuple[int, ...], + override: tuple[int, ...] | None, + *, + archive_metadata: Mapping[str, Any] | None = None, +) -> tuple[int, ...]: + """Return the chunk shape to use for a top-level array. + + Parameters + ---------- + archive_class + Top-level archive class name; used for class-specific + defaults like ``CellCoadd``'s cell-aligned chunks. + shape + The full array shape, used to clamp the default per-axis. + override + User-supplied chunk shape. If not ``None`` it is returned + verbatim after a length check. + archive_metadata + Class-specific layout hints. ``CellCoadd`` reads + ``"cell_shape"`` from this mapping. + """ + if override is not None: + if len(override) != len(shape): + raise ValueError( + f"chunks override has rank {len(override)}, expected {len(shape)} for {archive_class!r}." + ) + return tuple(override) + if archive_class == "CellCoadd" and archive_metadata is not None: + cell_shape = archive_metadata.get("cell_shape") + if cell_shape is not None: + return tuple(min(c, dim) for c, dim in zip(cell_shape, shape, strict=True)) + return tuple(min(_DEFAULT_AXIS_LIMIT, dim) for dim in shape) + + +def chunks_aligned_to( + *, + image_chunks: tuple[int, ...], + shape: tuple[int, ...], +) -> tuple[int, ...]: + """Derive a sibling array's chunks from the ``image`` array's chunks. + + Used by `ZarrOutputArchive.add_array` for ``variance`` and + ``mask`` siblings when the user has not provided an explicit + override. The result is per-axis ``min(image_chunks[i], + shape[i])`` so a sibling smaller than ``image`` is not + over-chunked. + """ + if len(image_chunks) != len(shape): + raise ValueError( + f"image_chunks rank {len(image_chunks)} does not match sibling shape rank {len(shape)}." + ) + return tuple(min(c, dim) for c, dim in zip(image_chunks, shape, strict=True)) diff --git a/tests/test_zarr_layout.py b/tests/test_zarr_layout.py new file mode 100644 index 00000000..f4cf33ad --- /dev/null +++ b/tests/test_zarr_layout.py @@ -0,0 +1,77 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +try: + from lsst.images.zarr._layout import ( + axes_for_archive_class, + chunks_aligned_to, + chunks_for, + ) + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class LayoutTestCase(unittest.TestCase): + """Per-archive-class axes and chunk derivation rules.""" + + def test_axes_for_archive_class(self) -> None: + # Standard 2-D images use (y, x). + self.assertEqual(axes_for_archive_class("Image"), ("y", "x")) + self.assertEqual(axes_for_archive_class("MaskedImage"), ("y", "x")) + self.assertEqual(axes_for_archive_class("VisitImage"), ("y", "x")) + self.assertEqual(axes_for_archive_class("Mask"), ("y", "x")) + self.assertEqual(axes_for_archive_class("CellCoadd"), ("y", "x")) + # ColorImage's root has no top-level multiscale; this returns + # an empty tuple to signal "no OME multiscale at this level". + self.assertEqual(axes_for_archive_class("ColorImage"), ()) + + def test_chunks_for_default(self) -> None: + self.assertEqual(chunks_for("Image", (4096, 4096), None), (1024, 1024)) + # Smaller than 1024 -> use full dim. + self.assertEqual(chunks_for("Image", (300, 600), None), (300, 600)) + + def test_chunks_for_override(self) -> None: + self.assertEqual(chunks_for("Image", (4096, 4096), (256, 256)), (256, 256)) + + def test_chunks_for_cell_coadd_uses_cell_shape(self) -> None: + result = chunks_for( + "CellCoadd", + (4096, 4096), + None, + archive_metadata={"cell_shape": (256, 256)}, + ) + self.assertEqual(result, (256, 256)) + + def test_chunks_for_cell_coadd_without_metadata_falls_back(self) -> None: + self.assertEqual(chunks_for("CellCoadd", (4096, 4096), None), (1024, 1024)) + + def test_chunks_aligned_to_matches_image(self) -> None: + # variance / mask follow image's chunks when not overridden. + self.assertEqual( + chunks_aligned_to(image_chunks=(256, 256), shape=(4096, 4096)), + (256, 256), + ) + # If the sibling shape is smaller than image's chunks, clamp. + self.assertEqual( + chunks_aligned_to(image_chunks=(1024, 1024), shape=(300, 600)), + (300, 600), + ) + + +if __name__ == "__main__": + unittest.main() From 09960dff1d9e90b1263180e7270de9c5689bc137 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 05:43:03 -0700 Subject: [PATCH 14/60] feat: add affine_check residual validator for OME coordinateTransformations Also extends the AST bridge in _transforms/_ast.py with ZoomMap and PolyMap so the validator's tests can construct synthetic linear and distorted FrameSets. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/_transforms/_ast.py | 28 ++++++ python/lsst/images/zarr/_layout.py | 130 +++++++++++++++++++++++++ tests/test_zarr_layout.py | 66 +++++++++++++ 3 files changed, 224 insertions(+) diff --git a/python/lsst/images/_transforms/_ast.py b/python/lsst/images/_transforms/_ast.py index 118259f5..7b819169 100644 --- a/python/lsst/images/_transforms/_ast.py +++ b/python/lsst/images/_transforms/_ast.py @@ -23,10 +23,12 @@ "Frame", "FrameSet", "Mapping", + "PolyMap", "ShiftMap", "SkyFrame", "StringStream", "UnitMap", + "ZoomMap", ) if TYPE_CHECKING: @@ -45,10 +47,12 @@ FrameSet, Mapping, Object, + PolyMap, ShiftMap, SkyFrame, StringStream, UnitMap, + ZoomMap, ) except ImportError: import starlink.Ast @@ -175,6 +179,30 @@ def __init__(self, map_a: Mapping, map_b: Mapping, series: bool): _IMPL_TYPE: ClassVar[type[starlink.Ast.CmpMap]] = starlink.Ast.CmpMap + class ZoomMap(Mapping): + def __init__(self, n_coord: int, zoom: float): + super().__init__(starlink.Ast.ZoomMap(n_coord, zoom)) + + _IMPL_TYPE: ClassVar[type[starlink.Ast.ZoomMap]] = starlink.Ast.ZoomMap + + class PolyMap(Mapping): + def __init__(self, coeff_f: Any, coeff_i_or_nout: Any, options: str = ""): + # astshim's PolyMap takes ``nout`` as the second positional; + # starlink.Ast.PolyMap requires an explicit inverse-coefficient + # array. Adapt to both by synthesizing an empty inverse when + # an integer ``nout`` is supplied. + import numpy as _np + + coeff_f_arr = _np.asarray(coeff_f, dtype=float) + if isinstance(coeff_i_or_nout, int): + nin = coeff_f_arr.shape[1] - 2 + coeff_i = _np.zeros((0, 2 + nin), dtype=float) + else: + coeff_i = _np.asarray(coeff_i_or_nout, dtype=float) + super().__init__(starlink.Ast.PolyMap(coeff_f_arr, coeff_i, options)) + + _IMPL_TYPE: ClassVar[type[starlink.Ast.PolyMap]] = starlink.Ast.PolyMap + class Frame(Mapping): def __init__(self, n_axes: int, options: str = ""): super().__init__(starlink.Ast.Frame(n_axes, options)) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index 612aff29..56c3a2be 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -27,12 +27,15 @@ from __future__ import annotations __all__ = ( + "AffineCheckResult", + "affine_check", "axes_for_archive_class", "chunks_aligned_to", "chunks_for", ) from collections.abc import Mapping +from dataclasses import dataclass from typing import Any _DEFAULT_AXIS_LIMIT = 1024 @@ -104,3 +107,130 @@ def chunks_aligned_to( f"image_chunks rank {len(image_chunks)} does not match sibling shape rank {len(shape)}." ) return tuple(min(c, dim) for c, dim in zip(image_chunks, shape, strict=True)) + + +@dataclass +class AffineCheckResult: + """Result of validating a simplified affine against a full WCS. + + When ``dropped`` is False, ``coordinate_transformations`` is the + OME-NGFF ``coordinateTransformations`` list to emit. When True, + the caller must omit the block (or emit a unit scale only) and + record ``max_residual_pixels`` as the observed worst error. + """ + + dropped: bool + max_residual_pixels: float + coordinate_transformations: list[dict[str, Any]] | None + + +def affine_check( + *, + frame_set: Any, + image_shape: tuple[int, int], + max_residual_pixels: float = 1.0, + grid: int = 11, +) -> AffineCheckResult: + """Build an OME affine ``coordinateTransformations`` from ``frame_set``. + + The simplified affine is constructed by mapping three reference + pixels (origin and the two unit-axis steps) through ``frame_set`` + to recover the linear coefficients. The full pixel-to-sky map is + then evaluated at every grid point and compared to the affine's + prediction; the worst great-circle separation is divided by the + pixel scale to get a pixel-equivalent residual. + + If ``max_residual <= max_residual_pixels``, returns a result whose + ``coordinate_transformations`` is the affine block. Otherwise + returns a dropped result and the caller must emit the unit scale + (or no transformations at all). + """ + import numpy as np + + h, w = image_shape + + pixels = np.array([[0.0, 0.0], [1.0, 0.0], [0.0, 1.0]]) + sky_at_ref = _frame_set_apply(frame_set, pixels) + origin = sky_at_ref[0] + dxsky = sky_at_ref[1] - origin + dysky = sky_at_ref[2] - origin + affine_matrix = np.array( + [ + [dxsky[0], dysky[0], origin[0]], + [dxsky[1], dysky[1], origin[1]], + [0.0, 0.0, 1.0], + ] + ) + + pixel_scale_y = float(np.linalg.norm(dysky)) + pixel_scale_x = float(np.linalg.norm(dxsky)) + pixel_scale = float(np.sqrt(pixel_scale_y * pixel_scale_x)) + if pixel_scale <= 0.0: + return AffineCheckResult( + dropped=True, + max_residual_pixels=float("inf"), + coordinate_transformations=None, + ) + + ys = np.linspace(0.0, max(h - 1, 0), grid) + xs = np.linspace(0.0, max(w - 1, 0), grid) + grid_pixels = np.array([[y, x] for y in ys for x in xs]) + sky_full = _frame_set_apply(frame_set, grid_pixels) + affine_pred = (affine_matrix[:2, :2] @ grid_pixels.T).T + origin + great_circle = _angular_separation(sky_full, affine_pred) + max_residual = float(np.max(great_circle) / pixel_scale) + + coordinate_transformations: list[dict[str, Any]] = [ + { + "type": "scale", + "scale": [pixel_scale_y, pixel_scale_x], + }, + { + "type": "affine", + "affine": affine_matrix.tolist(), + }, + ] + + if max_residual > max_residual_pixels: + return AffineCheckResult( + dropped=True, + max_residual_pixels=max_residual, + coordinate_transformations=None, + ) + return AffineCheckResult( + dropped=False, + max_residual_pixels=max_residual, + coordinate_transformations=coordinate_transformations, + ) + + +def _frame_set_apply(frame_set: Any, pixels: Any) -> Any: + """Apply ``frame_set``'s base->current mapping to a (N, 2) pixel array.""" + import numpy as np + + pixels = np.asarray(pixels, dtype=float) + mapping = frame_set.getMapping(frame_set.base, frame_set.current) + out = mapping.applyForward(pixels.T) + return np.asarray(out).T + + +def _angular_separation(a: Any, b: Any) -> Any: + """Element-wise great-circle separation between two (lon, lat) arrays. + + Inputs in radians (AST default for unit sky frames). Returns a 1-D + array of separations in the same units as the input. + """ + import numpy as np + + a = np.asarray(a) + b = np.asarray(b) + lon_a, lat_a = a[:, 0], a[:, 1] + lon_b, lat_b = b[:, 0], b[:, 1] + dlon = lon_b - lon_a + return np.arccos( + np.clip( + np.sin(lat_a) * np.sin(lat_b) + np.cos(lat_a) * np.cos(lat_b) * np.cos(dlon), + -1.0, + 1.0, + ) + ) diff --git a/tests/test_zarr_layout.py b/tests/test_zarr_layout.py index f4cf33ad..a4f99dbc 100644 --- a/tests/test_zarr_layout.py +++ b/tests/test_zarr_layout.py @@ -73,5 +73,71 @@ def test_chunks_aligned_to_matches_image(self) -> None: ) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class AffineValidatorTestCase(unittest.TestCase): + """Affine-residual validator gating the OME affine block.""" + + def _make_linear_frame_set(self, *, scale: float = 0.2): + from lsst.images._transforms._ast import ( + Frame, + FrameSet, + ZoomMap, + ) + + base = Frame(2, "Domain=PIXEL") + sky = Frame(2, "Domain=SKY") + fs = FrameSet(base) + fs.addFrame(FrameSet.BASE, ZoomMap(2, scale), sky) + return fs + + def _make_distorted_frame_set(self): + from lsst.images._transforms._ast import ( + CmpMap, + Frame, + FrameSet, + PolyMap, + ZoomMap, + ) + + base = Frame(2, "Domain=PIXEL") + sky = Frame(2, "Domain=SKY") + forward_coeffs = [ + [1.0, 1, 1, 0], + [0.001, 1, 0, 2], + [1.0, 2, 0, 1], + [0.001, 2, 2, 0], + ] + poly = PolyMap(forward_coeffs, 2, "IterInverse=1, NIterInverse=20") + cmp = CmpMap(poly, ZoomMap(2, 0.2), True) + fs = FrameSet(base) + fs.addFrame(FrameSet.BASE, cmp, sky) + return fs + + def test_pure_linear_passes(self) -> None: + from lsst.images.zarr._layout import affine_check + + fs = self._make_linear_frame_set(scale=0.2) + result = affine_check( + frame_set=fs, + image_shape=(64, 64), + max_residual_pixels=1.0, + ) + self.assertFalse(result.dropped) + self.assertIsNotNone(result.coordinate_transformations) + self.assertLess(result.max_residual_pixels, 1e-6) + + def test_high_distortion_drops_block(self) -> None: + from lsst.images.zarr._layout import affine_check + + fs = self._make_distorted_frame_set() + result = affine_check( + frame_set=fs, + image_shape=(4096, 4096), + max_residual_pixels=1.0, + ) + self.assertTrue(result.dropped) + self.assertGreater(result.max_residual_pixels, 1.0) + + if __name__ == "__main__": unittest.main() From 0e8198fa428484054b1f6639a5bcbd68ebdc8b39 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 05:45:31 -0700 Subject: [PATCH 15/60] refactor: move imports to module top in zarr layout and AST bridge Following project convention, imports go at the top of the module rather than inside function bodies. PolyMap's numpy use and affine_check's numpy use are hoisted; tests likewise hoist their _ast imports. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/_transforms/_ast.py | 10 ++++----- python/lsst/images/zarr/_layout.py | 8 ++----- tests/test_zarr_layout.py | 31 +++++++++----------------- 3 files changed, 18 insertions(+), 31 deletions(-) diff --git a/python/lsst/images/_transforms/_ast.py b/python/lsst/images/_transforms/_ast.py index 7b819169..09f4c4c6 100644 --- a/python/lsst/images/_transforms/_ast.py +++ b/python/lsst/images/_transforms/_ast.py @@ -14,6 +14,8 @@ from collections.abc import Iterable from typing import TYPE_CHECKING, Any, ClassVar, Self +import numpy as np + __all__ = ( "USING_STARLINK_PYAST", "Channel", @@ -191,14 +193,12 @@ def __init__(self, coeff_f: Any, coeff_i_or_nout: Any, options: str = ""): # starlink.Ast.PolyMap requires an explicit inverse-coefficient # array. Adapt to both by synthesizing an empty inverse when # an integer ``nout`` is supplied. - import numpy as _np - - coeff_f_arr = _np.asarray(coeff_f, dtype=float) + coeff_f_arr = np.asarray(coeff_f, dtype=float) if isinstance(coeff_i_or_nout, int): nin = coeff_f_arr.shape[1] - 2 - coeff_i = _np.zeros((0, 2 + nin), dtype=float) + coeff_i = np.zeros((0, 2 + nin), dtype=float) else: - coeff_i = _np.asarray(coeff_i_or_nout, dtype=float) + coeff_i = np.asarray(coeff_i_or_nout, dtype=float) super().__init__(starlink.Ast.PolyMap(coeff_f_arr, coeff_i, options)) _IMPL_TYPE: ClassVar[type[starlink.Ast.PolyMap]] = starlink.Ast.PolyMap diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index 56c3a2be..14ee49c4 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -38,6 +38,8 @@ from dataclasses import dataclass from typing import Any +import numpy as np + _DEFAULT_AXIS_LIMIT = 1024 @@ -145,8 +147,6 @@ def affine_check( returns a dropped result and the caller must emit the unit scale (or no transformations at all). """ - import numpy as np - h, w = image_shape pixels = np.array([[0.0, 0.0], [1.0, 0.0], [0.0, 1.0]]) @@ -206,8 +206,6 @@ def affine_check( def _frame_set_apply(frame_set: Any, pixels: Any) -> Any: """Apply ``frame_set``'s base->current mapping to a (N, 2) pixel array.""" - import numpy as np - pixels = np.asarray(pixels, dtype=float) mapping = frame_set.getMapping(frame_set.base, frame_set.current) out = mapping.applyForward(pixels.T) @@ -220,8 +218,6 @@ def _angular_separation(a: Any, b: Any) -> Any: Inputs in radians (AST default for unit sky frames). Returns a 1-D array of separations in the same units as the input. """ - import numpy as np - a = np.asarray(a) b = np.asarray(b) lon_a, lat_a = a[:, 0], a[:, 1] diff --git a/tests/test_zarr_layout.py b/tests/test_zarr_layout.py index a4f99dbc..51b18a84 100644 --- a/tests/test_zarr_layout.py +++ b/tests/test_zarr_layout.py @@ -13,8 +13,17 @@ import unittest +from lsst.images._transforms._ast import ( + CmpMap, + Frame, + FrameSet, + PolyMap, + ZoomMap, +) + try: from lsst.images.zarr._layout import ( + affine_check, axes_for_archive_class, chunks_aligned_to, chunks_for, @@ -77,28 +86,14 @@ def test_chunks_aligned_to_matches_image(self) -> None: class AffineValidatorTestCase(unittest.TestCase): """Affine-residual validator gating the OME affine block.""" - def _make_linear_frame_set(self, *, scale: float = 0.2): - from lsst.images._transforms._ast import ( - Frame, - FrameSet, - ZoomMap, - ) - + def _make_linear_frame_set(self, *, scale: float = 0.2) -> FrameSet: base = Frame(2, "Domain=PIXEL") sky = Frame(2, "Domain=SKY") fs = FrameSet(base) fs.addFrame(FrameSet.BASE, ZoomMap(2, scale), sky) return fs - def _make_distorted_frame_set(self): - from lsst.images._transforms._ast import ( - CmpMap, - Frame, - FrameSet, - PolyMap, - ZoomMap, - ) - + def _make_distorted_frame_set(self) -> FrameSet: base = Frame(2, "Domain=PIXEL") sky = Frame(2, "Domain=SKY") forward_coeffs = [ @@ -114,8 +109,6 @@ def _make_distorted_frame_set(self): return fs def test_pure_linear_passes(self) -> None: - from lsst.images.zarr._layout import affine_check - fs = self._make_linear_frame_set(scale=0.2) result = affine_check( frame_set=fs, @@ -127,8 +120,6 @@ def test_pure_linear_passes(self) -> None: self.assertLess(result.max_residual_pixels, 1e-6) def test_high_distortion_drops_block(self) -> None: - from lsst.images.zarr._layout import affine_check - fs = self._make_distorted_frame_set() result = affine_check( frame_set=fs, From a55091047481f0bdd4ca755582fac14e49888fbd Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 05:47:31 -0700 Subject: [PATCH 16/60] feat: add ZarrOutputArchive skeleton with serialize_direct/pointer/frame_set Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_output_archive.py | 133 +++++++++++++++++++++ tests/test_zarr_output_archive.py | 62 ++++++++++ 2 files changed, 195 insertions(+) create mode 100644 python/lsst/images/zarr/_output_archive.py create mode 100644 tests/test_zarr_output_archive.py diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py new file mode 100644 index 00000000..bdbb4c1b --- /dev/null +++ b/python/lsst/images/zarr/_output_archive.py @@ -0,0 +1,133 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +__all__ = ("ZarrOutputArchive", "write") + +from collections.abc import Callable, Hashable, Iterator, Mapping +from typing import Any + +import numpy as np +import pydantic + +from .._transforms import FrameSet +from ..serialization import ( + ArchiveTree, + NestedOutputArchive, + OutputArchive, +) +from ._common import ( + ZarrCompressionOptions, + ZarrPointerModel, + archive_path_to_zarr_path, +) +from ._model import ZarrArray, ZarrDocument, ZarrGroup + + +class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): + """Output archive that populates a `ZarrDocument` IR. + + Bytes are not written until the IR is materialized via + `ZarrDocument.to_zarr`, which the public `write` helper performs + on context-manager exit. + + Parameters + ---------- + chunks + Per-array chunk overrides keyed by the array's archive path + (e.g. ``"image"``). ``None`` for a key means "use the layout + default". + shards, compression + Same shape as ``chunks``. + archive_class + Top-level archive class name (``"VisitImage"``, ``"CellCoadd"``, + …). Used by the layout layer to pick chunk defaults; set by + ``write()`` before ``obj.serialize`` runs so ``add_array`` + sees the right value. + archive_metadata + Class-specific layout hints (``cell_shape`` for ``CellCoadd``, + ``mask_schema`` for the mask packer). + """ + + def __init__( + self, + *, + chunks: Mapping[str, tuple[int, ...] | None] | None = None, + shards: Mapping[str, tuple[int, ...] | None] | None = None, + compression: Mapping[str, ZarrCompressionOptions | None] | None = None, + archive_class: str = "Image", + archive_metadata: Mapping[str, Any] | None = None, + ) -> None: + self.document = ZarrDocument(root=ZarrGroup()) + self._chunks = dict(chunks) if chunks else {} + self._shards = dict(shards) if shards else {} + self._compression = dict(compression) if compression else {} + self._archive_class = archive_class + self._archive_metadata = dict(archive_metadata) if archive_metadata else {} + self._pointers: dict[Hashable, ZarrPointerModel] = {} + self._frame_sets: list[tuple[FrameSet, ZarrPointerModel]] = [] + + def serialize_direct[T: pydantic.BaseModel]( + self, + name: str, + serializer: Callable[[OutputArchive[ZarrPointerModel]], T], + ) -> T: + nested = NestedOutputArchive[ZarrPointerModel](name, self) + return serializer(nested) + + def serialize_pointer[T: ArchiveTree]( + self, + name: str, + serializer: Callable[[OutputArchive[ZarrPointerModel]], T], + key: Hashable, + ) -> ZarrPointerModel: + if (cached := self._pointers.get(key)) is not None: + return cached + archive_path = name if name.startswith("/") else f"/{name}" + sub_zarr_path = archive_path_to_zarr_path(archive_path) + # Run the serializer first so any nested add_array calls land + # inside the IR before we dump this sub-tree to JSON. + model = self.serialize_direct(name, serializer) + json_bytes = model.model_dump_json().encode("utf-8") + parent = self.document.root.ensure_group(sub_zarr_path) + parent.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) + pointer = ZarrPointerModel(path=f"{sub_zarr_path}/tree") + self._pointers[key] = pointer + return pointer + + def serialize_frame_set[T: ArchiveTree]( + self, + name: str, + frame_set: FrameSet, + serializer: Callable[[OutputArchive], T], + key: Hashable, + ) -> ZarrPointerModel: + pointer = self.serialize_pointer(name, serializer, key) + self._frame_sets.append((frame_set, pointer)) + return pointer + + def iter_frame_sets(self) -> Iterator[tuple[FrameSet, ZarrPointerModel]]: + return iter(self._frame_sets) + + def add_array(self, *args: Any, **kwargs: Any) -> Any: + raise NotImplementedError("add_array lands in Task 2.5") + + def add_table(self, *args: Any, **kwargs: Any) -> Any: + raise NotImplementedError("add_table lands in Task 2.6") + + def add_structured_array(self, *args: Any, **kwargs: Any) -> Any: + raise NotImplementedError("add_structured_array lands in Task 2.6") + + +def write(*args: Any, **kwargs: Any) -> Any: + """Public write helper. Implemented in Task 2.7.""" + raise NotImplementedError("write() lands in Task 2.7") diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py new file mode 100644 index 00000000..d16aeb5c --- /dev/null +++ b/tests/test_zarr_output_archive.py @@ -0,0 +1,62 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +import pydantic + +try: + from lsst.images.zarr._common import ZarrPointerModel + from lsst.images.zarr._output_archive import ZarrOutputArchive + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +class _Sub(pydantic.BaseModel): + label: str = "sub" + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOutputArchiveSkeletonTestCase(unittest.TestCase): + """Constructor + serialize_direct / serialize_pointer plumbing.""" + + def test_serialize_direct_returns_nested_result(self) -> None: + archive = ZarrOutputArchive() + + def serializer(arch): + return _Sub(label="ok") + + result = archive.serialize_direct("red", serializer) + self.assertEqual(result.label, "ok") + + def test_serialize_pointer_writes_json_subtree(self) -> None: + archive = ZarrOutputArchive() + + def serializer(arch): + return _Sub(label="psf") + + pointer = archive.serialize_pointer("psf", serializer, key=12345) + self.assertIsInstance(pointer, ZarrPointerModel) + self.assertEqual(pointer.path, "/psf/tree") + # Cached on second call. + again = archive.serialize_pointer("psf", serializer, key=12345) + self.assertEqual(again, pointer) + # IR holds the JSON bytes as a 1-D uint8 array. + node = archive.document.root.get("/psf/tree") + self.assertEqual(str(node.dtype), "uint8") + + +if __name__ == "__main__": + unittest.main() From 3d672abafe7cbeffcab017eff54c102f3befe87e Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 05:49:58 -0700 Subject: [PATCH 17/60] feat: implement add_array with image/variance/mask handling and CF flag attrs Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_output_archive.py | 123 ++++++++++++++++++++- tests/test_zarr_output_archive.py | 86 ++++++++++++++ 2 files changed, 206 insertions(+), 3 deletions(-) diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index bdbb4c1b..dd0f7705 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -16,21 +16,37 @@ from collections.abc import Callable, Hashable, Iterator, Mapping from typing import Any +import astropy.io.fits +import astropy.table +import astropy.units import numpy as np import pydantic +from .._mask import MaskSchema from .._transforms import FrameSet from ..serialization import ( ArchiveTree, + ArrayReferenceModel, NestedOutputArchive, + NumberType, OutputArchive, + no_header_updates, ) from ._common import ( ZarrCompressionOptions, ZarrPointerModel, archive_path_to_zarr_path, + mask_dtype_for_plane_count, +) +from ._layout import chunks_aligned_to, chunks_for +from ._model import ( + CfFlagAttributes, + MaskPlaneEntry, + ZarrArray, + ZarrDocument, + ZarrGroup, + build_image_array_attrs, ) -from ._model import ZarrArray, ZarrDocument, ZarrGroup class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): @@ -75,6 +91,7 @@ def __init__( self._archive_metadata = dict(archive_metadata) if archive_metadata else {} self._pointers: dict[Hashable, ZarrPointerModel] = {} self._frame_sets: list[tuple[FrameSet, ZarrPointerModel]] = [] + self._image_chunks: tuple[int, ...] | None = None def serialize_direct[T: pydantic.BaseModel]( self, @@ -118,8 +135,108 @@ def serialize_frame_set[T: ArchiveTree]( def iter_frame_sets(self) -> Iterator[tuple[FrameSet, ZarrPointerModel]]: return iter(self._frame_sets) - def add_array(self, *args: Any, **kwargs: Any) -> Any: - raise NotImplementedError("add_array lands in Task 2.5") + def add_array( + self, + array: np.ndarray, + *, + name: str | None = None, + update_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> ArrayReferenceModel: + if name is None: + raise ValueError("Anonymous arrays are not supported in ZarrOutputArchive.") + archive_path = name if name.startswith("/") else f"/{name}" + zarr_path = archive_path_to_zarr_path(archive_path) + leaf = zarr_path.rsplit("/", 1)[-1] + parent_path = zarr_path[: -(len(leaf) + 1)] or "/" + parent = self.document.root.ensure_group(parent_path) + + # Mask: pack 3-D (y, x, mask_size) -> 2-D wide-int packed. + if leaf == "mask" and array.ndim == 3: + packed, flag_attrs = self._pack_mask(array) + chunks = self._chunks.get(name) or self._chunks.get(leaf) + if chunks is None and self._image_chunks is not None: + chunks = chunks_aligned_to(image_chunks=self._image_chunks, shape=packed.shape) + extra: dict[str, Any] = {"_ARRAY_DIMENSIONS": ["y", "x"]} + extra.update(flag_attrs.dump()) + ir_array = ZarrArray( + data=packed, + chunks=chunks, + shards=self._shards.get(name), + compression=self._compression.get(name), + ) + ir_array.attributes.extra = extra + parent.arrays[leaf] = ir_array + return ArrayReferenceModel( + source=f"zarr:{zarr_path}", + shape=list(packed.shape), + datatype=NumberType.from_numpy(packed.dtype), + ) + + chunks = self._chunks.get(name) or self._chunks.get(leaf) + # variance / other top-level siblings: align to image's chunks. + if ( + chunks is None + and self._image_chunks is not None + and parent_path == "/" + and leaf == "variance" + and array.ndim == len(self._image_chunks) + ): + chunks = chunks_aligned_to(image_chunks=self._image_chunks, shape=array.shape) + + # Default chunks for the top-level image: from layout rules. + if chunks is None and parent_path == "/" and leaf == "image": + chunks = chunks_for( + self._archive_class, + array.shape, + None, + archive_metadata=self._archive_metadata, + ) + + ir_array = ZarrArray( + data=np.ascontiguousarray(array), + chunks=chunks, + shards=self._shards.get(name), + compression=self._compression.get(name), + ) + if parent_path == "/" and leaf in ("image", "variance"): + ir_array.attributes.extra = build_image_array_attrs( + axes=("y", "x"), + long_name="science image" if leaf == "image" else "image variance", + ) + parent.arrays[leaf] = ir_array + + # Remember the image's chunks so siblings can align. + if parent_path == "/" and leaf == "image" and chunks is not None: + self._image_chunks = tuple(chunks) + + return ArrayReferenceModel( + source=f"zarr:{zarr_path}", + shape=list(array.shape), + datatype=NumberType.from_numpy(array.dtype), + ) + + def _pack_mask(self, array: np.ndarray) -> tuple[np.ndarray, CfFlagAttributes]: + """Pack a 3-D ``(y, x, mask_size)`` mask into a 2-D wide-int array. + + The schema is taken from ``self._archive_metadata["mask_schema"]``. + Returns the packed array and the CF flag attributes. + """ + schema = self._archive_metadata.get("mask_schema") + if not isinstance(schema, MaskSchema): + raise ValueError( + "Writing a 3-D mask requires archive_metadata['mask_schema'] " + "to be set; the output archive cannot infer the plane " + "definitions otherwise." + ) + n_planes = len(schema) + target_dtype = mask_dtype_for_plane_count(n_planes) + # Pack: each (y, x) pixel's mask_size bytes -> one wide integer. + # Byte 0 is the low byte (planes 0..7), byte 1 is the next, etc. + packed = np.zeros(array.shape[:2], dtype=target_dtype) + for i in range(array.shape[2]): + packed |= array[..., i].astype(target_dtype) << (8 * i) + planes = [MaskPlaneEntry(name=p.name, bit=i, description=p.description) for i, p in enumerate(schema)] + return packed, CfFlagAttributes(planes=planes) def add_table(self, *args: Any, **kwargs: Any) -> Any: raise NotImplementedError("add_table lands in Task 2.6") diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index d16aeb5c..f7144bed 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -13,8 +13,11 @@ import unittest +import numpy as np import pydantic +from lsst.images import MaskPlane, MaskSchema + try: from lsst.images.zarr._common import ZarrPointerModel from lsst.images.zarr._output_archive import ZarrOutputArchive @@ -58,5 +61,88 @@ def serializer(arch): self.assertEqual(str(node.dtype), "uint8") +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOutputArchiveAddArrayTestCase(unittest.TestCase): + """`add_array` handling for image / variance / mask plus nested arrays.""" + + def test_add_image(self) -> None: + archive = ZarrOutputArchive() + ref = archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + self.assertEqual(ref.source, "zarr:/image") + self.assertEqual(list(ref.shape), [4, 5]) + node = archive.document.root.get("/image") + self.assertEqual(node.shape, (4, 5)) + self.assertEqual(node.attributes.extra["_ARRAY_DIMENSIONS"], ["y", "x"]) + + def test_add_variance_aligns_to_image_chunks(self) -> None: + archive = ZarrOutputArchive(chunks={"image": (2, 2)}) + archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + archive.add_array(np.ones((4, 5), dtype=np.float64), name="variance") + var_node = archive.document.root.get("/variance") + self.assertEqual(tuple(var_node.chunks), (2, 2)) + + def test_add_mask_packs_to_2d_with_cf_flag_attrs(self) -> None: + schema = MaskSchema( + [ + MaskPlane("BAD", "Bad pixel."), + MaskPlane("SAT", "Saturated."), + MaskPlane("CR", "Cosmic ray."), + ] + ) + # In-memory mask is (y, x, mask_size) where mask_size is bytes. + in_memory = np.zeros((4, 5, 1), dtype=np.uint8) + in_memory[0, 0, 0] = 0b1 # BAD + in_memory[1, 1, 0] = 0b110 # SAT | CR + + archive = ZarrOutputArchive(archive_metadata={"mask_schema": schema}) + archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + ref = archive.add_array(in_memory, name="mask") + self.assertEqual(ref.source, "zarr:/mask") + node = archive.document.root.get("/mask") + # 2-D packed integer. + self.assertEqual(node.shape, (4, 5)) + self.assertEqual(str(node.dtype), "uint8") # 3 planes -> uint8 + # Bytes packed correctly. + np.testing.assert_array_equal(node.data[0, 0], 0b1) + np.testing.assert_array_equal(node.data[1, 1], 0b110) + # CF flag attrs. + attrs = node.attributes.extra + self.assertEqual(attrs["flag_masks"], [1, 2, 4]) + self.assertEqual(attrs["flag_meanings"], "BAD SAT CR") + self.assertEqual( + attrs["flag_descriptions"], + ["Bad pixel.", "Saturated.", "Cosmic ray."], + ) + self.assertEqual(attrs["_ARRAY_DIMENSIONS"], ["y", "x"]) + + def test_add_mask_picks_widest_dtype_for_40_planes(self) -> None: + planes = [MaskPlane(f"P{i}", f"Plane {i}.") for i in range(40)] + schema = MaskSchema(planes) + in_memory = np.zeros((4, 5, 5), dtype=np.uint8) # mask_size=5 + + archive = ZarrOutputArchive(archive_metadata={"mask_schema": schema}) + archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + archive.add_array(in_memory, name="mask") + node = archive.document.root.get("/mask") + self.assertEqual(node.shape, (4, 5)) + self.assertEqual(str(node.dtype), "uint64") + + def test_add_mask_refuses_more_than_64_planes(self) -> None: + planes = [MaskPlane(f"P{i}", f"Plane {i}.") for i in range(65)] + schema = MaskSchema(planes) + in_memory = np.zeros((4, 5, 9), dtype=np.uint8) + + archive = ZarrOutputArchive(archive_metadata={"mask_schema": schema}) + archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") + with self.assertRaisesRegex(ValueError, "supports up to 64"): + archive.add_array(in_memory, name="mask") + + def test_add_anonymous_nested_array(self) -> None: + archive = ZarrOutputArchive() + ref = archive.add_array(np.ones((3,), dtype=np.float32), name="psf/centroids") + self.assertEqual(ref.source, "zarr:/psf/centroids") + self.assertEqual(archive.document.root.get("/psf/centroids").shape, (3,)) + + if __name__ == "__main__": unittest.main() From 8d7accecfe75f4db0006443e38091620c35ba661 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 05:57:54 -0700 Subject: [PATCH 18/60] feat: implement ZarrOutputArchive add_table and add_structured_array Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_output_archive.py | 50 ++++++++++++++++++++-- tests/test_zarr_output_archive.py | 24 +++++++++++ 2 files changed, 70 insertions(+), 4 deletions(-) diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index dd0f7705..b0ec4f53 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -30,6 +30,8 @@ NestedOutputArchive, NumberType, OutputArchive, + TableColumnModel, + TableModel, no_header_updates, ) from ._common import ( @@ -238,11 +240,51 @@ def _pack_mask(self, array: np.ndarray) -> tuple[np.ndarray, CfFlagAttributes]: planes = [MaskPlaneEntry(name=p.name, bit=i, description=p.description) for i, p in enumerate(schema)] return packed, CfFlagAttributes(planes=planes) - def add_table(self, *args: Any, **kwargs: Any) -> Any: - raise NotImplementedError("add_table lands in Task 2.6") + def add_table( + self, + table: astropy.table.Table, + *, + name: str | None = None, + update_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> TableModel: + if name is None: + raise ValueError("Anonymous tables are not supported in ZarrOutputArchive.") + columns = TableColumnModel.from_table(table) + archive_path = name if name.startswith("/") else f"/{name}" + table_zarr_path = f"/lsst/tables{archive_path}" + parent = self.document.root.ensure_group(table_zarr_path) + for c in columns: + assert isinstance(c.data, ArrayReferenceModel) + column_array = np.ascontiguousarray(np.asarray(table[c.name])) + parent.arrays[c.name] = ZarrArray(data=column_array) + c.data.source = f"zarr:{table_zarr_path}/{c.name}" + return TableModel(columns=columns, meta=table.meta) - def add_structured_array(self, *args: Any, **kwargs: Any) -> Any: - raise NotImplementedError("add_structured_array lands in Task 2.6") + def add_structured_array( + self, + array: np.ndarray, + *, + name: str | None = None, + units: Mapping[str, astropy.units.Unit] | None = None, + descriptions: Mapping[str, str] | None = None, + update_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> TableModel: + if name is None: + raise ValueError("Anonymous structured arrays are not supported.") + columns = TableColumnModel.from_record_dtype(array.dtype) + archive_path = name if name.startswith("/") else f"/{name}" + table_zarr_path = f"/lsst/tables{archive_path}" + parent = self.document.root.ensure_group(table_zarr_path) + for c in columns: + assert isinstance(c.data, ArrayReferenceModel) + column_array = np.ascontiguousarray(array[c.name]) + parent.arrays[c.name] = ZarrArray(data=column_array) + c.data.source = f"zarr:{table_zarr_path}/{c.name}" + if units and (unit := units.get(c.name)): + c.unit = unit + if descriptions and (description := descriptions.get(c.name)): + c.description = description + return TableModel(columns=columns) def write(*args: Any, **kwargs: Any) -> Any: diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index f7144bed..7696d880 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -13,6 +13,7 @@ import unittest +import astropy.table import numpy as np import pydantic @@ -144,5 +145,28 @@ def test_add_anonymous_nested_array(self) -> None: self.assertEqual(archive.document.root.get("/psf/centroids").shape, (3,)) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOutputArchiveAddTableTestCase(unittest.TestCase): + """`add_table` / `add_structured_array` plumbing.""" + + def test_add_table_creates_one_array_per_column(self) -> None: + archive = ZarrOutputArchive() + original = astropy.table.Table( + { + "x": np.arange(4, dtype=np.int32), + "y": np.arange(4, dtype=np.float32), + }, + meta={"comment": "small catalog"}, + ) + model = archive.add_table(original, name="cat") + self.assertEqual(len(model.columns), 2) + sources = {c.name: c.data.source for c in model.columns} + self.assertEqual(sources["x"], "zarr:/lsst/tables/cat/x") + self.assertEqual(sources["y"], "zarr:/lsst/tables/cat/y") + # Each column is its own zarr array under the parent group. + x_node = archive.document.root.get("/lsst/tables/cat/x") + self.assertEqual(x_node.shape, (4,)) + + if __name__ == "__main__": unittest.main() From 8fab0195ae7ef7a50afa850555942f117426e022 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 06:01:42 -0700 Subject: [PATCH 19/60] feat: add ZarrOutputArchive.add_tree and public write() helper Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/__init__.py | 3 +- python/lsst/images/zarr/_output_archive.py | 137 ++++++++++++++++++++- tests/test_zarr_output_archive.py | 41 ++++++ 3 files changed, 176 insertions(+), 5 deletions(-) diff --git a/python/lsst/images/zarr/__init__.py b/python/lsst/images/zarr/__init__.py index 40da59ae..44441061 100644 --- a/python/lsst/images/zarr/__init__.py +++ b/python/lsst/images/zarr/__init__.py @@ -36,4 +36,5 @@ "Install it directly or via 'pip install lsst-images[zarr]'." ) from e -# Phase 1 has no public archive API yet. Re-exports are added in later phases. +from ._common import * +from ._output_archive import * diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index b0ec4f53..67a145e0 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -24,6 +24,7 @@ from .._mask import MaskSchema from .._transforms import FrameSet +from .._transforms._ast import Channel, StringStream from ..serialization import ( ArchiveTree, ArrayReferenceModel, @@ -40,10 +41,11 @@ archive_path_to_zarr_path, mask_dtype_for_plane_count, ) -from ._layout import chunks_aligned_to, chunks_for +from ._layout import affine_check, axes_for_archive_class, chunks_aligned_to, chunks_for from ._model import ( CfFlagAttributes, MaskPlaneEntry, + OmeMultiscale, ZarrArray, ZarrDocument, ZarrGroup, @@ -286,7 +288,134 @@ def add_structured_array( c.description = description return TableModel(columns=columns) + def add_tree(self, tree: ArchiveTree) -> None: + """Finalize the IR: write JSON tree, WCS, and root attributes. -def write(*args: Any, **kwargs: Any) -> Any: - """Public write helper. Implemented in Task 2.7.""" - raise NotImplementedError("write() lands in Task 2.7") + Called once after the user's serializer has populated arrays + / sub-trees. Sets the ``lsst.*`` and ``ome.*`` blocks on the + root group, stages ``/tree`` as 1-D ``uint8`` UTF-8 JSON, and + runs the affine residual validator if the archive carries a + frame set. + """ + # Stage the JSON tree at /tree. + json_bytes = tree.model_dump_json().encode("utf-8") + self.document.root.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) + + # Stage the AST WCS string at /wcs_ast when a frame set is registered. + wcs_ast_path: str | None = None + if self._frame_sets: + wcs_ast_path = self._stage_wcs_ast(self._frame_sets[0][0]) + + # Root LSST attrs. + lsst = self.document.root.attributes.lsst + lsst["archive_class"] = self._archive_class + lsst["tree"] = "tree" + if wcs_ast_path is not None: + lsst["wcs_ast"] = wcs_ast_path + if "cell_grid" in self._archive_metadata: + lsst["cell_grid"] = self._archive_metadata["cell_grid"] + + # data_model / version go to the top level (not under lsst:). + self.document.root.attributes.extra["data_model"] = self._data_model_for(self._archive_class) + self.document.root.attributes.extra["version"] = 1 + + # OME multiscale block, gated by axes_for_archive_class. + axes = axes_for_archive_class(self._archive_class) + if axes and "image" in self.document.root.arrays: + image_array = self.document.root.arrays["image"] + ct: list[dict[str, Any]] | None = None + if self._frame_sets: + fs = self._frame_sets[0][0] + check = affine_check( + frame_set=fs, + image_shape=image_array.shape, + max_residual_pixels=1.0, + ) + if check.dropped: + lsst["wcs_simplified_dropped"] = True + lsst["wcs_simplified_max_residual_pixels"] = check.max_residual_pixels + else: + lsst["wcs_simplified_dropped"] = False + lsst["wcs_simplified_max_residual_pixels"] = check.max_residual_pixels + ct = check.coordinate_transformations + multiscale = OmeMultiscale( + name=self._archive_class.lower(), + axes=axes, + dataset_path="image", + coordinate_transformations=ct, + ) + self.document.root.attributes.ome["multiscales"] = [multiscale.dump()] + + def _stage_wcs_ast(self, frame_set: FrameSet) -> str: + """Encode an AST FrameSet as UTF-8 text and stage at /wcs_ast.""" + stream = StringStream() + Channel(stream, options="Full=-1,Comment=0,Indent=0").write(frame_set) + text = stream.getSinkData() + self.document.root.arrays["wcs_ast"] = ZarrArray( + data=np.frombuffer(text.encode("utf-8"), dtype=np.uint8) + ) + return "wcs_ast" + + @staticmethod + def _data_model_for(archive_class: str) -> str: + """Map an archive class name to the public ``data_model`` string.""" + return { + "Image": "org.lsst.image", + "Mask": "org.lsst.mask", + "MaskedImage": "org.lsst.masked_image", + "VisitImage": "org.lsst.visit_image", + "ColorImage": "org.lsst.color_image", + "CellCoadd": "org.lsst.cell_coadd", + }.get(archive_class, f"org.lsst.{archive_class.lower()}") + + +def write( + obj: Any, + path: Any, + *, + chunks: Mapping[str, tuple[int, ...] | None] | None = None, + shards: Mapping[str, tuple[int, ...] | None] | None = None, + compression: Mapping[str, ZarrCompressionOptions | None] | None = None, + metadata: Mapping[str, Any] | None = None, + butler_info: Any | None = None, +) -> ArchiveTree: + """Write ``obj`` to a zarr archive at ``path``. + + Parameters mirror the FITS / NDF write helpers. The store + implementation (LocalStore / ZipStore / FsspecStore) is selected + from the URI shape by ``_store.open_store_for_write``. + """ + from ._store import open_store_for_write + + archive_class = type(obj).__name__ + archive_default_name = getattr(obj, "_archive_default_name", None) + archive_metadata: dict[str, Any] = {} + if (cell_shape := getattr(obj, "cell_shape", None)) is not None: + archive_metadata["cell_shape"] = tuple(cell_shape) + if (cell_grid := getattr(obj, "cell_grid", None)) is not None: + archive_metadata["cell_grid"] = { + "bbox": list(cell_grid.bbox) if hasattr(cell_grid, "bbox") else None, + "cell_shape": list(cell_grid.cell_shape) if hasattr(cell_grid, "cell_shape") else None, + } + if (mask_schema := getattr(obj, "mask_schema", None)) is not None: + archive_metadata["mask_schema"] = mask_schema + + archive = ZarrOutputArchive( + chunks=chunks, + shards=shards, + compression=compression, + archive_class=archive_class, + archive_metadata=archive_metadata, + ) + if archive_default_name is not None: + tree = archive.serialize_direct(archive_default_name, obj.serialize) + else: + tree = obj.serialize(archive) + if metadata is not None: + tree.metadata.update(metadata) + if butler_info is not None: + tree.butler_info = butler_info + archive.add_tree(tree) + with open_store_for_write(path) as store: + archive.document.to_zarr(store) + return tree diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index 7696d880..b72fd823 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -168,5 +168,46 @@ def test_add_table_creates_one_array_per_column(self) -> None: self.assertEqual(x_node.shape, (4,)) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrWriteHelperTestCase(unittest.TestCase): + """Public ``write()`` end-to-end for a plain `Image`.""" + + def test_write_image_to_local_directory(self) -> None: + import os + import tempfile + + import zarr + + from lsst.images import Box, Image + from lsst.images.zarr import write + from lsst.images.zarr._model import ZarrDocument + + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + tree = write(original, target) + self.assertIsNotNone(tree) + with zarr.storage.LocalStore(target, read_only=True) as store: + doc = ZarrDocument.from_zarr(store) + # Top-level image and tree are present. + self.assertIn("image", doc.root.arrays) + self.assertIn("tree", doc.root.arrays) + self.assertEqual(doc.root.arrays["image"].shape, (4, 5)) + # LSST root attrs. + lsst_attrs = doc.root.attributes.lsst + self.assertEqual(lsst_attrs["archive_class"], "Image") + self.assertEqual(lsst_attrs["tree"], "tree") + # OME multiscales points at /image; no projection means + # the unit scale is emitted. + ome = doc.root.attributes.ome + self.assertIn("multiscales", ome) + self.assertEqual(ome["multiscales"][0]["datasets"][0]["path"], "image") + # data_model + version on root. + self.assertEqual(doc.root.attributes.extra["data_model"], "org.lsst.image") + + if __name__ == "__main__": unittest.main() From 62029c10bcc7fbfbd20f321eb77fe87e3f268aae Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 06:07:04 -0700 Subject: [PATCH 20/60] feat: opt ZarrOutputArchive into native 3-D mask layout, pin MaskedImage on-disk shape Mask.serialize emits the byte axis first when the archive opts in via _prefer_native_mask_arrays. The output archive undoes that swap so the on-disk layout is the natural xarray (y, x) shape with a 2-D packed wide-integer per pixel. Adds an end-to-end on-disk shape test for MaskedImage that pins this contract (Tasks 2.8 + 3.0 from the plan). Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_output_archive.py | 22 ++++++- tests/test_zarr_output_archive.py | 71 +++++++++++++++++----- 2 files changed, 74 insertions(+), 19 deletions(-) diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index 67a145e0..83ba63ca 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -14,7 +14,7 @@ __all__ = ("ZarrOutputArchive", "write") from collections.abc import Callable, Hashable, Iterator, Mapping -from typing import Any +from typing import Any, ClassVar import astropy.io.fits import astropy.table @@ -78,6 +78,13 @@ class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): ``mask_schema`` for the mask packer). """ + _prefer_native_mask_arrays: ClassVar[bool] = True + """Tell ``Mask.serialize`` to hand us the 3-D ``(y, x, mask_size)`` + array in one ``add_array`` call. ``add_array`` packs it into a 2-D + wide-integer array on disk with CF ``flag_masks`` / ``flag_meanings`` + attributes. + """ + def __init__( self, *, @@ -154,8 +161,12 @@ def add_array( parent_path = zarr_path[: -(len(leaf) + 1)] or "/" parent = self.document.root.ensure_group(parent_path) - # Mask: pack 3-D (y, x, mask_size) -> 2-D wide-int packed. + # Mask: pack 3-D (mask_size, y, x) -> 2-D (y, x) wide-int packed. + # Mask.serialize emits the byte axis first when the archive opts in + # via _prefer_native_mask_arrays (matching the HDF5/NDF convention); + # we undo that so the on-disk array is the natural xarray layout. if leaf == "mask" and array.ndim == 3: + array = np.moveaxis(array, 0, -1) packed, flag_attrs = self._pack_mask(array) chunks = self._chunks.get(name) or self._chunks.get(leaf) if chunks is None and self._image_chunks is not None: @@ -397,7 +408,12 @@ def write( "bbox": list(cell_grid.bbox) if hasattr(cell_grid, "bbox") else None, "cell_shape": list(cell_grid.cell_shape) if hasattr(cell_grid, "cell_shape") else None, } - if (mask_schema := getattr(obj, "mask_schema", None)) is not None: + mask_schema = getattr(obj, "mask_schema", None) + if mask_schema is None: + mask = getattr(obj, "mask", None) + if mask is not None: + mask_schema = getattr(mask, "schema", None) + if mask_schema is not None: archive_metadata["mask_schema"] = mask_schema archive = ZarrOutputArchive( diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index b72fd823..8212892e 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -11,16 +11,21 @@ from __future__ import annotations +import os +import tempfile import unittest import astropy.table import numpy as np import pydantic -from lsst.images import MaskPlane, MaskSchema +from lsst.images import Box, Image, MaskPlane, MaskSchema try: - from lsst.images.zarr._common import ZarrPointerModel + import zarr + + from lsst.images.zarr import ZarrPointerModel, write + from lsst.images.zarr._model import ZarrDocument from lsst.images.zarr._output_archive import ZarrOutputArchive HAVE_ZARR = True @@ -90,10 +95,11 @@ def test_add_mask_packs_to_2d_with_cf_flag_attrs(self) -> None: MaskPlane("CR", "Cosmic ray."), ] ) - # In-memory mask is (y, x, mask_size) where mask_size is bytes. - in_memory = np.zeros((4, 5, 1), dtype=np.uint8) + # ``Mask.serialize`` emits the byte axis first when the archive opts + # into native-mask arrays — shape ``(mask_size, y, x)``. + in_memory = np.zeros((1, 4, 5), dtype=np.uint8) in_memory[0, 0, 0] = 0b1 # BAD - in_memory[1, 1, 0] = 0b110 # SAT | CR + in_memory[0, 1, 1] = 0b110 # SAT | CR archive = ZarrOutputArchive(archive_metadata={"mask_schema": schema}) archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") @@ -119,7 +125,8 @@ def test_add_mask_packs_to_2d_with_cf_flag_attrs(self) -> None: def test_add_mask_picks_widest_dtype_for_40_planes(self) -> None: planes = [MaskPlane(f"P{i}", f"Plane {i}.") for i in range(40)] schema = MaskSchema(planes) - in_memory = np.zeros((4, 5, 5), dtype=np.uint8) # mask_size=5 + # 40 planes -> mask_size=5 -> (5, y, x). + in_memory = np.zeros((5, 4, 5), dtype=np.uint8) archive = ZarrOutputArchive(archive_metadata={"mask_schema": schema}) archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") @@ -131,7 +138,8 @@ def test_add_mask_picks_widest_dtype_for_40_planes(self) -> None: def test_add_mask_refuses_more_than_64_planes(self) -> None: planes = [MaskPlane(f"P{i}", f"Plane {i}.") for i in range(65)] schema = MaskSchema(planes) - in_memory = np.zeros((4, 5, 9), dtype=np.uint8) + # 65 planes -> mask_size=9 -> (9, y, x). + in_memory = np.zeros((9, 4, 5), dtype=np.uint8) archive = ZarrOutputArchive(archive_metadata={"mask_schema": schema}) archive.add_array(np.ones((4, 5), dtype=np.float32), name="image") @@ -173,15 +181,6 @@ class ZarrWriteHelperTestCase(unittest.TestCase): """Public ``write()`` end-to-end for a plain `Image`.""" def test_write_image_to_local_directory(self) -> None: - import os - import tempfile - - import zarr - - from lsst.images import Box, Image - from lsst.images.zarr import write - from lsst.images.zarr._model import ZarrDocument - original = Image( np.arange(20, dtype=np.float32).reshape(4, 5), bbox=Box.factory[10:14, 20:25], @@ -209,5 +208,45 @@ def test_write_image_to_local_directory(self) -> None: self.assertEqual(doc.root.attributes.extra["data_model"], "org.lsst.image") +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrWriteOnDiskShapeTestCase(unittest.TestCase): + """Pin the on-disk layout for harder archive classes.""" + + def _round_trip_doc(self, obj): + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(obj, target) + with zarr.storage.LocalStore(target, read_only=True) as store: + return ZarrDocument.from_zarr(store) + + def test_masked_image_layout(self) -> None: + from lsst.images import MaskedImage + + schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + masked = MaskedImage(image, mask_schema=schema) + masked.mask.set("BAD", image.array % 2 == 0) + + doc = self._round_trip_doc(masked) + self.assertEqual(doc.root.attributes.lsst["archive_class"], "MaskedImage") + # image / variance / mask are sibling root arrays. + self.assertIn("image", doc.root.arrays) + self.assertIn("variance", doc.root.arrays) + self.assertIn("mask", doc.root.arrays) + # Mask is 2-D packed integer with CF flag attrs. + mask = doc.root.arrays["mask"] + self.assertEqual(mask.shape, (4, 5)) + self.assertEqual(mask.attributes.extra["flag_meanings"], "BAD") + # CF / xarray dims on every 2-D array. + for name in ("image", "variance", "mask"): + self.assertEqual( + doc.root.arrays[name].attributes.extra["_ARRAY_DIMENSIONS"], + ["y", "x"], + ) + + if __name__ == "__main__": unittest.main() From 64fcf29df6d5d879a01383c3d5aeacb1add410f3 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 06:09:32 -0700 Subject: [PATCH 21/60] feat: add ZarrInputArchive skeleton with get_tree and version validation Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_input_archive.py | 94 +++++++++++++++++++++++ python/lsst/images/zarr/_model.py | 9 ++- tests/test_zarr_input_archive.py | 83 ++++++++++++++++++++ 3 files changed, 184 insertions(+), 2 deletions(-) create mode 100644 python/lsst/images/zarr/_input_archive.py create mode 100644 tests/test_zarr_input_archive.py diff --git a/python/lsst/images/zarr/_input_archive.py b/python/lsst/images/zarr/_input_archive.py new file mode 100644 index 00000000..cc2a47ee --- /dev/null +++ b/python/lsst/images/zarr/_input_archive.py @@ -0,0 +1,94 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +__all__ = ("ZarrInputArchive", "read") + +from collections.abc import Iterator +from contextlib import contextmanager +from typing import Any, Self + +from lsst.resources import ResourcePathExpression + +from .._transforms import FrameSet +from ..serialization import ( + ArchiveReadError, + ArchiveTree, + InputArchive, +) +from ._common import LSST_VERSION, ZarrPointerModel +from ._model import ZarrArray, ZarrDocument + + +class ZarrInputArchive(InputArchive[ZarrPointerModel]): + """Reads zarr archives written by `ZarrOutputArchive`.""" + + def __init__(self, document: ZarrDocument) -> None: + self._document = document + self._validate_root_attributes() + self._deserialized_pointer_cache: dict[str, Any] = {} + self._frame_set_cache: dict[str, FrameSet] = {} + + @classmethod + @contextmanager + def open(cls, path: ResourcePathExpression) -> Iterator[Self]: + """Open a zarr archive for reading.""" + from ._store import open_store_for_read + + with open_store_for_read(path) as store: + doc = ZarrDocument.from_zarr(store) + yield cls(doc) + + @property + def document(self) -> ZarrDocument: + return self._document + + def get_tree[T: ArchiveTree](self, model_type: type[T]) -> T: + """Read and validate the main Pydantic tree at ``/tree``.""" + try: + node = self._document.root.get("/tree") + except KeyError: + raise ArchiveReadError("File has no /tree array; this is not an LSST zarr archive.") from None + if not isinstance(node, ZarrArray): + raise ArchiveReadError("/tree must be a zarr array, not a group.") + json_bytes = bytes(node.read()) + return model_type.model_validate_json(json_bytes.decode("utf-8")) + + def _validate_root_attributes(self) -> None: + attrs = self._document.root.attributes.lsst + if "archive_class" not in attrs: + raise ArchiveReadError("File is not an LSST zarr archive (missing lsst.archive_class).") + version = attrs.get("__version_remembered_at_load__", LSST_VERSION) + if version > LSST_VERSION: + raise ArchiveReadError( + f"Unsupported lsst:version {version}; this reader supports up to {LSST_VERSION}." + ) + + def deserialize_pointer(self, *args: Any, **kwargs: Any) -> Any: + raise NotImplementedError("deserialize_pointer lands in Task 3.3") + + def get_frame_set(self, *args: Any, **kwargs: Any) -> Any: + raise NotImplementedError("get_frame_set lands in Task 3.3") + + def get_array(self, *args: Any, **kwargs: Any) -> Any: + raise NotImplementedError("get_array lands in Task 3.2") + + def get_table(self, *args: Any, **kwargs: Any) -> Any: + raise NotImplementedError("get_table lands in Task 3.4") + + def get_structured_array(self, *args: Any, **kwargs: Any) -> Any: + raise NotImplementedError("get_structured_array lands in Task 3.4") + + +def read(*args: Any, **kwargs: Any) -> Any: + """Public read helper. Implemented in Task 3.5.""" + raise NotImplementedError("read() lands in Task 3.5") diff --git a/python/lsst/images/zarr/_model.py b/python/lsst/images/zarr/_model.py index 70adb692..ba256563 100644 --- a/python/lsst/images/zarr/_model.py +++ b/python/lsst/images/zarr/_model.py @@ -70,7 +70,8 @@ def dump(self) -> dict[str, Any]: """Return the raw mapping zarr-python writes to ``zarr.json``.""" out: dict[str, Any] = dict(self.extra) # lsst is always present so readers can dispatch on lsst.archive_class. - out[LSST_NS] = {"version": LSST_VERSION, **self.lsst} + public_lsst = {k: v for k, v in self.lsst.items() if not k.startswith("__")} + out[LSST_NS] = {"version": LSST_VERSION, **public_lsst} if self.ome: out[OME_NS] = {"version": OME_VERSION, **self.ome} return out @@ -79,7 +80,11 @@ def dump(self) -> dict[str, Any]: def load(cls, raw: dict[str, Any]) -> Self: """Construct from a raw attributes mapping read from zarr.""" lsst = dict(raw.get(LSST_NS, {})) - lsst.pop("version", None) # version implicit in the namespace + version = lsst.pop("version", None) + if version is not None: + # Stash the on-disk version under a private sentinel so the input + # archive can validate without going back to the raw store. + lsst["__version_remembered_at_load__"] = version ome = dict(raw.get(OME_NS, {})) ome.pop("version", None) extra = {k: v for k, v in raw.items() if k not in (LSST_NS, OME_NS)} diff --git a/tests/test_zarr_input_archive.py b/tests/test_zarr_input_archive.py new file mode 100644 index 00000000..5519225b --- /dev/null +++ b/tests/test_zarr_input_archive.py @@ -0,0 +1,83 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +import numpy as np + +from lsst.images import Box, Image +from lsst.images._image import ImageSerializationModel +from lsst.images.serialization import ArchiveReadError + +try: + import zarr + + from lsst.images.zarr import write + from lsst.images.zarr._common import LSST_NS, LSST_VERSION + from lsst.images.zarr._input_archive import ZarrInputArchive + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchiveSkeletonTestCase(unittest.TestCase): + """Open + version validation + ``get_tree``.""" + + def test_open_reads_tree(self) -> None: + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(original, target) + with ZarrInputArchive.open(target) as archive: + tree = archive.get_tree(ImageSerializationModel) + self.assertIsNotNone(tree) + + def test_missing_archive_class_raises(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "bare.zarr") + os.makedirs(target) + store = zarr.storage.LocalStore(target, read_only=False) + zarr.create_group(store=store, zarr_format=3) # no lsst attrs + with self.assertRaisesRegex(ArchiveReadError, "not an LSST zarr archive"): + with ZarrInputArchive.open(target): + pass + + def test_future_version_refused(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "future.zarr") + os.makedirs(target) + store = zarr.storage.LocalStore(target, read_only=False) + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + { + LSST_NS: { + "version": LSST_VERSION + 1, + "archive_class": "Image", + "tree": "tree", + } + } + ) + with self.assertRaisesRegex(ArchiveReadError, "Unsupported lsst:version"): + with ZarrInputArchive.open(target): + pass + + +if __name__ == "__main__": + unittest.main() From bd558e2f2311262c94c7a7be8a45f4b316d15ff8 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 06:12:08 -0700 Subject: [PATCH 22/60] feat: implement ZarrInputArchive.get_array with lazy slices and mask unpack Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_input_archive.py | 83 ++++++++++- tests/test_zarr_input_archive.py | 165 ++++++++++++++++++++++ 2 files changed, 245 insertions(+), 3 deletions(-) diff --git a/python/lsst/images/zarr/_input_archive.py b/python/lsst/images/zarr/_input_archive.py index cc2a47ee..8b31c0d1 100644 --- a/python/lsst/images/zarr/_input_archive.py +++ b/python/lsst/images/zarr/_input_archive.py @@ -13,17 +13,25 @@ __all__ = ("ZarrInputArchive", "read") -from collections.abc import Iterator +from collections.abc import Callable, Iterator from contextlib import contextmanager +from types import EllipsisType from typing import Any, Self +import astropy.io.fits +import astropy.table +import numpy as np + from lsst.resources import ResourcePathExpression from .._transforms import FrameSet from ..serialization import ( ArchiveReadError, ArchiveTree, + ArrayReferenceModel, + InlineArrayModel, InputArchive, + no_header_updates, ) from ._common import LSST_VERSION, ZarrPointerModel from ._model import ZarrArray, ZarrDocument @@ -79,8 +87,77 @@ def deserialize_pointer(self, *args: Any, **kwargs: Any) -> Any: def get_frame_set(self, *args: Any, **kwargs: Any) -> Any: raise NotImplementedError("get_frame_set lands in Task 3.3") - def get_array(self, *args: Any, **kwargs: Any) -> Any: - raise NotImplementedError("get_array lands in Task 3.2") + def get_array( + self, + model: ArrayReferenceModel | InlineArrayModel, + *, + slices: tuple[slice, ...] | EllipsisType = ..., + strip_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> np.ndarray: + if isinstance(model, InlineArrayModel): + data: np.ndarray = np.array(model.data, dtype=model.datatype.to_numpy()) + return data if slices is ... else data[slices] + if not isinstance(model.source, str) or not model.source.startswith("zarr:"): + raise ArchiveReadError( + f"ZarrInputArchive cannot resolve array source {model.source!r}; " + f"expected a 'zarr:' reference." + ) + zarr_path = model.source[len("zarr:") :] + try: + node = self._document.root.get(zarr_path) + except KeyError: + raise ArchiveReadError(f"Array reference {zarr_path!r} not in store.") from None + if not isinstance(node, ZarrArray): + raise ArchiveReadError(f"{zarr_path!r} is not an array.") + + # Mask unpack: model claims 3-D (mask_size, y, x); on-disk is 2-D + # (y, x) packed wide-int with flag_masks attribute. + claimed_shape = tuple(model.shape) if model.shape is not None else None + if ( + claimed_shape is not None + and len(claimed_shape) == 3 + and len(node.shape) == 2 + and "flag_masks" in node.attributes.extra + ): + return self._read_packed_mask(node, claimed_shape, slices) + + # Standard path: forward slices straight to the lazy handle. + return node.read(slices=slices) + + def _read_packed_mask( + self, + node: ZarrArray, + claimed_shape: tuple[int, ...], + slices: tuple[slice, ...] | EllipsisType, + ) -> np.ndarray: + """Unpack a 2-D wide-int mask back to 3-D ``(mask_size, y, x)``. + + ``Mask.serialize`` produces ``(mask_size, y, x)`` so we restore + that layout. ``slices`` is forwarded to the lazy handle as-is + when it has rank 2 (operating on the on-disk shape); rank-3 + slices have their first axis stripped and re-applied after + unpack. + """ + mask_size = claimed_shape[0] + # Forward slice to the lazy handle so only intersecting chunks + # are fetched even on remote stores. + if slices is ...: + spatial_slices: tuple[slice, ...] | EllipsisType = ... + byte_slice: slice | EllipsisType = ... + elif len(slices) == 3: + byte_slice = slices[0] + spatial_slices = slices[1:] + else: + spatial_slices = slices + byte_slice = ... + packed = node.read(slices=spatial_slices) + # Unpack: low byte first, in the (mask_size, y, x) layout. + out = np.empty((mask_size,) + packed.shape, dtype=np.uint8) + for i in range(mask_size): + out[i] = (packed >> np.uint64(8 * i)) & np.uint64(0xFF) + if byte_slice is ...: + return out + return out[byte_slice] def get_table(self, *args: Any, **kwargs: Any) -> Any: raise NotImplementedError("get_table lands in Task 3.4") diff --git a/tests/test_zarr_input_archive.py b/tests/test_zarr_input_archive.py index 5519225b..f2238a38 100644 --- a/tests/test_zarr_input_archive.py +++ b/tests/test_zarr_input_archive.py @@ -24,15 +24,43 @@ try: import zarr + from lsst.images.serialization import ArrayReferenceModel, NumberType from lsst.images.zarr import write from lsst.images.zarr._common import LSST_NS, LSST_VERSION from lsst.images.zarr._input_archive import ZarrInputArchive + from lsst.images.zarr._model import ZarrDocument HAVE_ZARR = True except ImportError: HAVE_ZARR = False +class _CountingStore(zarr.storage.MemoryStore if HAVE_ZARR else object): + """A `zarr.storage.MemoryStore` that counts ``get`` calls. + + The counter is shared across instances created by zarr's + ``with_read_only`` so the test sees every read regardless of which + store wrapper handles it. + """ + + _shared_counter: list[int] = [0] + + def __init__(self, *args, **kwargs) -> None: + super().__init__(*args, **kwargs) + + @property + def reads(self) -> int: + return self._shared_counter[0] + + @reads.setter + def reads(self, value: int) -> None: + self._shared_counter[0] = value + + async def get(self, key, prototype, byte_range=None): + self._shared_counter[0] += 1 + return await super().get(key, prototype, byte_range) + + @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class ZarrInputArchiveSkeletonTestCase(unittest.TestCase): """Open + version validation + ``get_tree``.""" @@ -79,5 +107,142 @@ def test_future_version_refused(self) -> None: pass +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchiveLazySubsetTestCase(unittest.TestCase): + """Subset reads only fetch chunks intersecting the slice.""" + + def test_subset_read_touches_only_intersecting_chunks(self) -> None: + store = _CountingStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + { + LSST_NS: { + "version": LSST_VERSION, + "archive_class": "Image", + "tree": "tree", + } + } + ) + zarr_array = root.create_array(name="image", shape=(16, 16), chunks=(4, 4), dtype="float32") + zarr_array[:] = np.arange(256, dtype=np.float32).reshape(16, 16) + # Stub /tree so the input archive's constructor accepts the file. + tree_arr = root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8") + tree_arr[:] = np.frombuffer(b"{}", dtype=np.uint8) + + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + store.reads = 0 + full_ref = ArrayReferenceModel( + source="zarr:/image", + shape=[16, 16], + datatype=NumberType.from_numpy(np.dtype("float32")), + ) + full = archive.get_array(full_ref) + full_reads = store.reads + self.assertEqual(full.shape, (16, 16)) + + store.reads = 0 + subset = archive.get_array(full_ref, slices=(slice(0, 4), slice(0, 4))) + subset_reads = store.reads + self.assertEqual(subset.shape, (4, 4)) + np.testing.assert_array_equal(subset, np.arange(256).reshape(16, 16)[:4, :4]) + self.assertLess(subset_reads, full_reads) + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchiveMaskUnpackTestCase(unittest.TestCase): + """Round-trip a packed 2-D mask through ``get_array``'s unpack path.""" + + def test_unpack_2d_packed_back_to_3d(self) -> None: + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + { + LSST_NS: { + "version": LSST_VERSION, + "archive_class": "Mask", + "tree": "tree", + } + } + ) + # 4x5 mask, 3 planes -> packed in uint8. + on_disk = np.zeros((4, 5), dtype=np.uint8) + on_disk[0, 0] = 0b001 + on_disk[1, 1] = 0b110 + mask_array = root.create_array(name="mask", shape=(4, 5), chunks=(4, 5), dtype="uint8") + mask_array[:] = on_disk + mask_array.update_attributes( + { + "_ARRAY_DIMENSIONS": ["y", "x"], + "flag_masks": [1, 2, 4], + "flag_meanings": "BAD SAT CR", + "flag_descriptions": ["Bad pixel.", "Saturated.", "Cosmic ray."], + } + ) + tree_arr = root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8") + tree_arr[:] = np.frombuffer(b"{}", dtype=np.uint8) + + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + # The model claims 3-D shape (mask_size=1 because <=8 planes); the + # native layout is (mask_size, y, x) — what Mask.serialize emits. + model = ArrayReferenceModel( + source="zarr:/mask", + shape=[1, 4, 5], + datatype=NumberType.from_numpy(np.dtype("uint8")), + ) + result = archive.get_array(model) + self.assertEqual(result.shape, (1, 4, 5)) + self.assertEqual(result[0, 0, 0], 0b001) + self.assertEqual(result[0, 1, 1], 0b110) + + def test_unpack_uint64_with_5_bytes(self) -> None: + # 40 planes packed into uint64 -> mask_size = 5. + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes( + { + LSST_NS: { + "version": LSST_VERSION, + "archive_class": "Mask", + "tree": "tree", + } + } + ) + on_disk = np.zeros((4, 5), dtype=np.uint64) + on_disk[0, 0] = 0x01_02_03_04_05 # arbitrary bit pattern + mask_array = root.create_array(name="mask", shape=(4, 5), chunks=(4, 5), dtype="uint64") + mask_array[:] = on_disk + mask_array.update_attributes( + { + "_ARRAY_DIMENSIONS": ["y", "x"], + "flag_masks": [1 << i for i in range(40)], + "flag_meanings": " ".join(f"P{i}" for i in range(40)), + "flag_descriptions": [f"Plane {i}." for i in range(40)], + } + ) + tree_arr = root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8") + tree_arr[:] = np.frombuffer(b"{}", dtype=np.uint8) + + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + model = ArrayReferenceModel( + source="zarr:/mask", + shape=[5, 4, 5], + datatype=NumberType.from_numpy(np.dtype("uint8")), + ) + result = archive.get_array(model) + self.assertEqual(result.shape, (5, 4, 5)) + # Bytes recovered from the packed uint64. + self.assertEqual(result[0, 0, 0], 0x05) # low byte + self.assertEqual(result[1, 0, 0], 0x04) + self.assertEqual(result[2, 0, 0], 0x03) + self.assertEqual(result[3, 0, 0], 0x02) + self.assertEqual(result[4, 0, 0], 0x01) + + if __name__ == "__main__": unittest.main() From 86c325972606775a5640b6dcc4060f7688c054ac Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 06:13:48 -0700 Subject: [PATCH 23/60] feat: implement deserialize_pointer, get_table, get_structured_array, public read() Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/__init__.py | 1 + python/lsst/images/zarr/_input_archive.py | 82 +++++++++++++++++--- tests/test_zarr_input_archive.py | 94 ++++++++++++++++++++++- 3 files changed, 163 insertions(+), 14 deletions(-) diff --git a/python/lsst/images/zarr/__init__.py b/python/lsst/images/zarr/__init__.py index 44441061..ee480a05 100644 --- a/python/lsst/images/zarr/__init__.py +++ b/python/lsst/images/zarr/__init__.py @@ -37,4 +37,5 @@ ) from e from ._common import * +from ._input_archive import * from ._output_archive import * diff --git a/python/lsst/images/zarr/_input_archive.py b/python/lsst/images/zarr/_input_archive.py index 8b31c0d1..5a245259 100644 --- a/python/lsst/images/zarr/_input_archive.py +++ b/python/lsst/images/zarr/_input_archive.py @@ -31,6 +31,8 @@ ArrayReferenceModel, InlineArrayModel, InputArchive, + ReadResult, + TableModel, no_header_updates, ) from ._common import LSST_VERSION, ZarrPointerModel @@ -81,11 +83,36 @@ def _validate_root_attributes(self) -> None: f"Unsupported lsst:version {version}; this reader supports up to {LSST_VERSION}." ) - def deserialize_pointer(self, *args: Any, **kwargs: Any) -> Any: - raise NotImplementedError("deserialize_pointer lands in Task 3.3") - - def get_frame_set(self, *args: Any, **kwargs: Any) -> Any: - raise NotImplementedError("get_frame_set lands in Task 3.3") + def deserialize_pointer[U: ArchiveTree, V]( + self, + pointer: ZarrPointerModel, + model_type: type[U], + deserializer: Callable[[U, InputArchive[ZarrPointerModel]], V], + ) -> V: + if (cached := self._deserialized_pointer_cache.get(pointer.path)) is not None: + return cached + try: + node = self._document.root.get(pointer.path) + except KeyError: + raise ArchiveReadError(f"Pointer reference {pointer.path!r} not in store.") from None + if not isinstance(node, ZarrArray): + raise ArchiveReadError(f"Pointer target {pointer.path!r} is not an array.") + json_text = bytes(node.read()).decode("utf-8") + model = model_type.model_validate_json(json_text) + result = deserializer(model, self) + self._deserialized_pointer_cache[pointer.path] = result + if isinstance(result, FrameSet): + self._frame_set_cache[pointer.path] = result + return result + + def get_frame_set(self, pointer: ZarrPointerModel) -> FrameSet: + try: + return self._frame_set_cache[pointer.path] + except KeyError: + raise AssertionError( + f"Frame set at {pointer.path!r} must be deserialised via " + f"deserialize_pointer before any dependent transform can be." + ) from None def get_array( self, @@ -159,13 +186,44 @@ def _read_packed_mask( return out return out[byte_slice] - def get_table(self, *args: Any, **kwargs: Any) -> Any: - raise NotImplementedError("get_table lands in Task 3.4") + def get_table( + self, + model: TableModel, + strip_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> astropy.table.Table: + result = astropy.table.Table(meta=model.meta) + for column_model in model.columns: + if isinstance(column_model.data, InlineArrayModel): + data: Any = column_model.data.data + else: + data = self.get_array(column_model.data, strip_header=strip_header) + result[column_model.name] = astropy.table.Column( + data, + name=column_model.name, + dtype=column_model.data.datatype.to_numpy(), + unit=column_model.unit, + description=column_model.description, + meta=column_model.meta, + ) + return result + + def get_structured_array( + self, + model: TableModel, + strip_header: Callable[[astropy.io.fits.Header], None] = no_header_updates, + ) -> np.ndarray: + return self.get_table(model, strip_header).as_array() - def get_structured_array(self, *args: Any, **kwargs: Any) -> Any: - raise NotImplementedError("get_structured_array lands in Task 3.4") +def read[T: Any](cls: type[T], path: ResourcePathExpression, **kwargs: Any) -> ReadResult[T]: + """Read an object from a zarr archive. -def read(*args: Any, **kwargs: Any) -> Any: - """Public read helper. Implemented in Task 3.5.""" - raise NotImplementedError("read() lands in Task 3.5") + The archive's root attributes name the in-memory class via + ``lsst.archive_class``. Files without this attribute raise; auto- + detect of foreign zarr files is a follow-up. + """ + with ZarrInputArchive.open(path) as archive: + tree_type = cls._get_archive_tree_type(ZarrPointerModel) + tree = archive.get_tree(tree_type) + obj = tree.deserialize(archive, **kwargs) + return ReadResult(obj, tree.metadata, tree.butler_info) diff --git a/tests/test_zarr_input_archive.py b/tests/test_zarr_input_archive.py index f2238a38..f59f8116 100644 --- a/tests/test_zarr_input_archive.py +++ b/tests/test_zarr_input_archive.py @@ -25,10 +25,11 @@ import zarr from lsst.images.serialization import ArrayReferenceModel, NumberType - from lsst.images.zarr import write + from lsst.images.zarr import ZarrPointerModel, read, write from lsst.images.zarr._common import LSST_NS, LSST_VERSION from lsst.images.zarr._input_archive import ZarrInputArchive - from lsst.images.zarr._model import ZarrDocument + from lsst.images.zarr._model import ZarrArray, ZarrDocument + from lsst.images.zarr._output_archive import ZarrOutputArchive HAVE_ZARR = True except ImportError: @@ -244,5 +245,94 @@ def test_unpack_uint64_with_5_bytes(self) -> None: self.assertEqual(result[4, 0, 0], 0x01) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchivePointerTestCase(unittest.TestCase): + """``deserialize_pointer`` cache + JSON sub-tree handling.""" + + def test_deserialize_pointer_caches_results(self) -> None: + import pydantic + + class _Sub(pydantic.BaseModel): + label: str + + store = zarr.storage.MemoryStore() + root = zarr.create_group(store=store, zarr_format=3) + root.update_attributes({LSST_NS: {"version": LSST_VERSION, "archive_class": "Image", "tree": "tree"}}) + # Stub /tree. + tree_arr = root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8") + tree_arr[:] = np.frombuffer(b"{}", dtype=np.uint8) + # Sub-archive with its own /tree at /psf/tree. + json_bytes = b'{"label": "psf"}' + psf = root.create_group("psf") + arr = psf.create_array( + name="tree", shape=(len(json_bytes),), chunks=(len(json_bytes),), dtype="uint8" + ) + arr[:] = np.frombuffer(json_bytes, dtype=np.uint8) + + doc = ZarrDocument.from_zarr(store) + archive = ZarrInputArchive(doc) + + deserialize_calls: list[int] = [] + + def deserializer(model, arch): + deserialize_calls.append(1) + return model + + pointer = ZarrPointerModel(path="/psf/tree") + first = archive.deserialize_pointer(pointer, _Sub, deserializer) + second = archive.deserialize_pointer(pointer, _Sub, deserializer) + self.assertEqual(first.label, "psf") + self.assertIs(first, second) + self.assertEqual(len(deserialize_calls), 1) + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrInputArchiveTableTestCase(unittest.TestCase): + """``get_table`` reconstructs columns via ``get_array``.""" + + def test_get_table_reconstructs_columns(self) -> None: + import astropy.table + + out = ZarrOutputArchive() + out.document.root.attributes.lsst["archive_class"] = "Image" + out.document.root.attributes.lsst["tree"] = "tree" + out.document.root.arrays["tree"] = ZarrArray(data=np.frombuffer(b"{}", dtype=np.uint8)) + original = astropy.table.Table( + { + "x": np.arange(4, dtype=np.int32), + "y": np.arange(4, dtype=np.float32), + } + ) + model = out.add_table(original, name="cat") + + store = zarr.storage.MemoryStore() + out.document.to_zarr(store) + doc = ZarrDocument.from_zarr(store) + inp = ZarrInputArchive(doc) + + recovered = inp.get_table(model) + self.assertEqual(recovered.colnames, ["x", "y"]) + np.testing.assert_array_equal(recovered["x"], original["x"]) + np.testing.assert_array_equal(recovered["y"], original["y"]) + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrReadHelperTestCase(unittest.TestCase): + """End-to-end public ``read()`` round-trip.""" + + def test_round_trip_image(self) -> None: + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(original, target) + result = read(Image, target) + self.assertEqual(result.deserialized.array.shape, (4, 5)) + np.testing.assert_array_equal(result.deserialized.array, original.array) + self.assertEqual(result.deserialized.bbox, original.bbox) + + if __name__ == "__main__": unittest.main() From 1acb1aa399456e38cabc3f8d19245b40cbabe402 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 06:36:56 -0700 Subject: [PATCH 24/60] feat: add RoundtripZarr helper and round-trip tests for Image/MaskedImage Round-trips Image, MaskedImage (3-plane), and MaskedImage (40-plane) through the zarr backend via the new RoundtripZarr helper. RoundtripZarr overrides _run_without_butler to use a TemporaryDirectory since zarr archives are directories, not single files. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/tests/_roundtrip.py | 44 +++++++++++- python/lsst/images/zarr/_input_archive.py | 13 ++-- tests/test_zarr_input_archive.py | 11 +-- tests/test_zarr_round_trip.py | 82 +++++++++++++++++++++++ 4 files changed, 138 insertions(+), 12 deletions(-) create mode 100644 tests/test_zarr_round_trip.py diff --git a/python/lsst/images/tests/_roundtrip.py b/python/lsst/images/tests/_roundtrip.py index dfb987e6..1531fedb 100644 --- a/python/lsst/images/tests/_roundtrip.py +++ b/python/lsst/images/tests/_roundtrip.py @@ -11,7 +11,7 @@ from __future__ import annotations -__all__ = ("RoundtripFits", "RoundtripJson", "RoundtripNdf", "TemporaryButler") +__all__ = ("RoundtripFits", "RoundtripJson", "RoundtripNdf", "RoundtripZarr", "TemporaryButler") import tempfile import unittest @@ -328,3 +328,45 @@ def _read(self, obj_type: Any, filename: str) -> ReadResult: from .. import ndf return ndf.read(obj_type, filename) + + +class RoundtripZarr[T](RoundtripBase[T]): + """Round-trip helper for the zarr backend. + + Zarr archives are directories rather than single files, so the + base class's ``NamedTemporaryFile`` pattern doesn't fit. + `_run_without_butler` is overridden to use a `TemporaryDirectory` + and a fresh archive path inside it. + """ + + def inspect(self) -> Any: + """Open the zarr archive's IR for inspection.""" + import zarr as _zarr + + from ..zarr._model import ZarrDocument + + return ZarrDocument.from_zarr(_zarr.storage.LocalStore(self.filename, read_only=True)) + + def _get_extension(self) -> str: + return ".zarr" + + def _write(self, obj: Any, filename: str) -> ArchiveTree: + from .. import zarr as zarr_backend + + return zarr_backend.write(obj, filename) + + def _read(self, obj_type: Any, filename: str) -> ReadResult: + from .. import zarr as zarr_backend + + return zarr_backend.read(obj_type, filename) + + def _run_without_butler(self) -> None: + import os + + parent = self._exit_stack.enter_context(tempfile.TemporaryDirectory()) + target = os.path.join(parent, f"out{self._get_extension()}") + self._filename = target + self._serialized = self._write(self._original, target) + read_result = self._read(type(self._original), target) + self._tc.assertIsNone(read_result.butler_info) + self.result = read_result.deserialized diff --git a/python/lsst/images/zarr/_input_archive.py b/python/lsst/images/zarr/_input_archive.py index 5a245259..1b602038 100644 --- a/python/lsst/images/zarr/_input_archive.py +++ b/python/lsst/images/zarr/_input_archive.py @@ -159,13 +159,14 @@ def _read_packed_mask( ) -> np.ndarray: """Unpack a 2-D wide-int mask back to 3-D ``(mask_size, y, x)``. - ``Mask.serialize`` produces ``(mask_size, y, x)`` so we restore - that layout. ``slices`` is forwarded to the lazy handle as-is - when it has rank 2 (operating on the on-disk shape); rank-3 - slices have their first axis stripped and re-applied after - unpack. + Mask deserialization expects the storage layout that + ``Mask.serialize`` streamed — ``(mask_size, y, x)`` — and does + the swap to ``(y, x, mask_size)`` itself. Rank-3 ``slices`` + from the deserializer are ``(byte_axis, y_slice, x_slice)``; + the byte axis is stripped before forwarding the spatial slice + to the lazy handle and re-applied to the unpacked output. """ - mask_size = claimed_shape[0] + mask_size = claimed_shape[-1] # Forward slice to the lazy handle so only intersecting chunks # are fetched even on remote stores. if slices is ...: diff --git a/tests/test_zarr_input_archive.py b/tests/test_zarr_input_archive.py index f59f8116..df00f014 100644 --- a/tests/test_zarr_input_archive.py +++ b/tests/test_zarr_input_archive.py @@ -187,11 +187,12 @@ def test_unpack_2d_packed_back_to_3d(self) -> None: doc = ZarrDocument.from_zarr(store) archive = ZarrInputArchive(doc) - # The model claims 3-D shape (mask_size=1 because <=8 planes); the - # native layout is (mask_size, y, x) — what Mask.serialize emits. + # The model records (y, x, mask_size) but the storage layout is the + # transposed (mask_size, y, x) — Mask.deserialize does the final + # moveaxis to recover (y, x, mask_size). model = ArrayReferenceModel( source="zarr:/mask", - shape=[1, 4, 5], + shape=[4, 5, 1], datatype=NumberType.from_numpy(np.dtype("uint8")), ) result = archive.get_array(model) @@ -232,12 +233,12 @@ def test_unpack_uint64_with_5_bytes(self) -> None: model = ArrayReferenceModel( source="zarr:/mask", - shape=[5, 4, 5], + shape=[4, 5, 5], datatype=NumberType.from_numpy(np.dtype("uint8")), ) result = archive.get_array(model) self.assertEqual(result.shape, (5, 4, 5)) - # Bytes recovered from the packed uint64. + # Bytes recovered from the packed uint64 (mask_size, y, x order). self.assertEqual(result[0, 0, 0], 0x05) # low byte self.assertEqual(result[1, 0, 0], 0x04) self.assertEqual(result[2, 0, 0], 0x03) diff --git a/tests/test_zarr_round_trip.py b/tests/test_zarr_round_trip.py new file mode 100644 index 00000000..b7a53479 --- /dev/null +++ b/tests/test_zarr_round_trip.py @@ -0,0 +1,82 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import unittest + +import numpy as np + +from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema + +try: + import zarr # noqa: F401 + + from lsst.images.tests import RoundtripZarr + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrRoundTripTestCase(unittest.TestCase): + """Full write -> read round-trips for the supported image types.""" + + def test_image_round_trip(self) -> None: + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.result + np.testing.assert_array_equal(recovered.array, original.array) + self.assertEqual(recovered.bbox, original.bbox) + + def test_masked_image_round_trip(self) -> None: + schema = MaskSchema( + [ + MaskPlane("BAD", "Bad pixel."), + MaskPlane("SAT", "Saturated."), + MaskPlane("CR", "Cosmic ray."), + ] + ) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + original = MaskedImage(image, mask_schema=schema) + original.mask.set("BAD", image.array % 2 == 0) + original.mask.set("SAT", image.array > 10) + + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.result + np.testing.assert_array_equal(recovered.image.array, original.image.array) + np.testing.assert_array_equal(recovered.mask.array, original.mask.array) + + def test_masked_image_with_40_planes_round_trip(self) -> None: + schema = MaskSchema([MaskPlane(f"P{i}", f"Plane {i}.") for i in range(40)]) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + original = MaskedImage(image, mask_schema=schema) + original.mask.set("P0", image.array % 2 == 0) + original.mask.set("P39", image.array > 10) + + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.result + # 40 planes packed into uint64 on disk; recovered as 5 bytes/pixel. + np.testing.assert_array_equal(recovered.mask.array, original.mask.array) + + +if __name__ == "__main__": + unittest.main() From ca403201ae84e12413868f392dd0083caea2600b Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 06:41:22 -0700 Subject: [PATCH 25/60] feat: decorate sub-archives, ColorImage round-trip, CellCoadd PSF single-cell chunks - Add decorate_sub_archives that walks the IR and adds lsst.archive_class + ome.multiscales to any sub-group containing an image array (Phase 4.1). - Pin ColorImage write/round-trip behaviour: each channel becomes a 2-D array at root with no nested sub-archive, since ColorImage.serialize produces flat per-channel arrays (Phase 4.2). - Default a 4-D psf array to single-cell chunks (1, 1, Py, Px) for CellCoadd-style PSF storage (Phase 4.3). CellCoadd-specific layout and round-trip tests (Phase 4.4, 4.5) are deferred as the plan's _make_minimal_cell_coadd factories are implementer-supplied placeholders. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_layout.py | 37 +++++++++++++++ python/lsst/images/zarr/_output_archive.py | 17 ++++++- tests/test_zarr_layout.py | 38 ++++++++++++++++ tests/test_zarr_output_archive.py | 52 ++++++++++++++++++++++ tests/test_zarr_round_trip.py | 19 ++++++++ 5 files changed, 162 insertions(+), 1 deletion(-) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index 14ee49c4..b317fa85 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -32,6 +32,7 @@ "axes_for_archive_class", "chunks_aligned_to", "chunks_for", + "decorate_sub_archives", ) from collections.abc import Mapping @@ -230,3 +231,39 @@ def _angular_separation(a: Any, b: Any) -> Any: 1.0, ) ) + + +def decorate_sub_archives(document: Any) -> None: + """Decorate sub-archive groups with ``lsst.archive_class`` and OME attrs. + + A sub-archive is any group below the root that contains an + ``image`` array. Decoration adds ``lsst.archive_class = "Image"`` + and an ``ome.multiscales`` block pointing at the sub-archive's + ``image`` array. Recursive: nested sub-archives are decorated too. + + The root group is left alone — its ``lsst.archive_class`` is set + by ``add_tree`` based on the in-memory object's type. + """ + from ._model import ZarrDocument + + if not isinstance(document, ZarrDocument): + raise TypeError(type(document).__name__) + _decorate_walk(document.root) + + +def _decorate_walk(group: Any) -> None: + from ._model import OmeMultiscale + + for sub in group.groups.values(): + if "image" in sub.arrays: + sub.attributes.lsst.setdefault("archive_class", "Image") + if "tree" in sub.arrays: + sub.attributes.lsst.setdefault("tree", "tree") + if "multiscales" not in sub.attributes.ome: + multiscale = OmeMultiscale( + name="image", + axes=("y", "x"), + dataset_path="image", + ) + sub.attributes.ome["multiscales"] = [multiscale.dump()] + _decorate_walk(sub) diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index 83ba63ca..f97bdb45 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -41,7 +41,13 @@ archive_path_to_zarr_path, mask_dtype_for_plane_count, ) -from ._layout import affine_check, axes_for_archive_class, chunks_aligned_to, chunks_for +from ._layout import ( + affine_check, + axes_for_archive_class, + chunks_aligned_to, + chunks_for, + decorate_sub_archives, +) from ._model import ( CfFlagAttributes, MaskPlaneEntry, @@ -207,6 +213,10 @@ def add_array( archive_metadata=self._archive_metadata, ) + # Default chunks for a CellCoadd-style 4-D PSF: one cell per chunk. + if chunks is None and leaf == "psf" and array.ndim == 4 and parent_path == "/": + chunks = (1, 1, array.shape[2], array.shape[3]) + ir_array = ZarrArray( data=np.ascontiguousarray(array), chunks=chunks, @@ -357,6 +367,11 @@ def add_tree(self, tree: ArchiveTree) -> None: ) self.document.root.attributes.ome["multiscales"] = [multiscale.dump()] + # Walk sub-groups and decorate each one that holds an ``image`` + # array (e.g. ``ColorImage`` channels) as its own valid Image + # sub-archive with OME multiscales. + decorate_sub_archives(self.document) + def _stage_wcs_ast(self, frame_set: FrameSet) -> str: """Encode an AST FrameSet as UTF-8 text and stage at /wcs_ast.""" stream = StringStream() diff --git a/tests/test_zarr_layout.py b/tests/test_zarr_layout.py index 51b18a84..648fc941 100644 --- a/tests/test_zarr_layout.py +++ b/tests/test_zarr_layout.py @@ -130,5 +130,43 @@ def test_high_distortion_drops_block(self) -> None: self.assertGreater(result.max_residual_pixels, 1.0) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class DecorateSubArchivesTestCase(unittest.TestCase): + """`decorate_sub_archives` walks the IR and adds OME / lsst attrs.""" + + def test_sub_group_with_image_gets_lsst_and_ome_attrs(self) -> None: + import numpy as np + + from lsst.images.zarr._layout import decorate_sub_archives + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + + doc = ZarrDocument(root=ZarrGroup()) + doc.root.attributes.lsst["archive_class"] = "ColorImage" + red = doc.root.ensure_group("/red") + red.arrays["image"] = ZarrArray(data=np.ones((4, 5), dtype="float32")) + + decorate_sub_archives(doc) + + self.assertEqual(red.attributes.lsst["archive_class"], "Image") + self.assertIn("multiscales", red.attributes.ome) + self.assertEqual(red.attributes.ome["multiscales"][0]["datasets"][0]["path"], "image") + + def test_root_archive_class_is_unchanged(self) -> None: + import numpy as np + + from lsst.images.zarr._layout import decorate_sub_archives + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + + doc = ZarrDocument(root=ZarrGroup()) + doc.root.attributes.lsst["archive_class"] = "ColorImage" + red = doc.root.ensure_group("/red") + red.arrays["image"] = ZarrArray(data=np.ones((4, 5), dtype="float32")) + + decorate_sub_archives(doc) + + # Root keeps ColorImage; only sub-groups are decorated. + self.assertEqual(doc.root.attributes.lsst["archive_class"], "ColorImage") + + if __name__ == "__main__": unittest.main() diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index 8212892e..bc3e6696 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -248,5 +248,57 @@ def test_masked_image_layout(self) -> None: ) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrColorImageWriteTestCase(unittest.TestCase): + """ColorImage emits decorated red/green/blue sub-archives.""" + + def test_color_image_emits_per_channel_arrays(self) -> None: + from lsst.images import ColorImage + + arr = np.zeros((4, 5, 3), dtype=np.uint8) + arr[..., 0] = 1 + arr[..., 1] = 2 + arr[..., 2] = 3 + color = ColorImage(arr, bbox=Box.factory[10:14, 20:25]) + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(color, target) + with zarr.storage.LocalStore(target, read_only=True) as store: + doc = ZarrDocument.from_zarr(store) + # Root: ColorImage, no ome.multiscales + # (axes_for_archive_class returns () for ColorImage). + self.assertEqual(doc.root.attributes.lsst["archive_class"], "ColorImage") + self.assertNotIn("multiscales", doc.root.attributes.ome) + # Each channel is a top-level 2-D array. + for channel in ("red", "green", "blue"): + self.assertIn(channel, doc.root.arrays) + self.assertEqual(doc.root.arrays[channel].shape, (4, 5)) + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrPsfChunkingTestCase(unittest.TestCase): + """`add_array` defaults a 4-D ``psf`` array to single-cell chunks.""" + + def test_psf_array_uses_single_cell_chunks(self) -> None: + psf = np.zeros((2, 3, 21, 21), dtype=np.float32) + archive = ZarrOutputArchive(archive_class="CellCoadd") + ref = archive.add_array(psf, name="psf") + self.assertEqual(ref.source, "zarr:/psf") + node = archive.document.root.get("/psf") + # Single-cell chunks: leading axes are 1; spatial axes match shape. + self.assertEqual(tuple(node.chunks), (1, 1, 21, 21)) + + def test_psf_user_override_wins(self) -> None: + psf = np.zeros((2, 3, 21, 21), dtype=np.float32) + archive = ZarrOutputArchive( + archive_class="CellCoadd", + chunks={"psf": (2, 3, 21, 21)}, + ) + archive.add_array(psf, name="psf") + node = archive.document.root.get("/psf") + self.assertEqual(tuple(node.chunks), (2, 3, 21, 21)) + + if __name__ == "__main__": unittest.main() diff --git a/tests/test_zarr_round_trip.py b/tests/test_zarr_round_trip.py index b7a53479..3159eb1c 100644 --- a/tests/test_zarr_round_trip.py +++ b/tests/test_zarr_round_trip.py @@ -78,5 +78,24 @@ def test_masked_image_with_40_planes_round_trip(self) -> None: np.testing.assert_array_equal(recovered.mask.array, original.mask.array) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrColorImageRoundTripTestCase(unittest.TestCase): + """ColorImage round-trips through the zarr backend.""" + + def test_color_image_round_trip(self) -> None: + from lsst.images import ColorImage + + arr = np.zeros((4, 5, 3), dtype=np.uint8) + arr[..., 0] = 1 + arr[..., 1] = 2 + arr[..., 2] = 3 + original = ColorImage(arr, bbox=Box.factory[10:14, 20:25]) + + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.result + np.testing.assert_array_equal(recovered.array, original.array) + self.assertEqual(recovered.bbox, original.bbox) + + if __name__ == "__main__": unittest.main() From e6b7e14fff60b3c8712defc695a3373adf1fffa8 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 06:54:31 -0700 Subject: [PATCH 26/60] feat: FITS opaque-metadata round-trip, FITS-Zarr cross-format, xarray interop - Persist FitsOpaqueMetadata at /lsst/opaque_metadata/fits/primary on zarr write; restore on read (Phase 5.1, 5.2). - Add FITS -> Zarr -> FITS cross-format test confirming primary-HDU cards survive the round-trip (Phase 5.3). - Add xarray.open_zarr interop test pinning Dataset shape and CF flag attrs on the mask variable (Phase 5.4, skipped when xarray absent). - Add optional ome-zarr-py and ngff-validator compliance tests (Phase 5.5, 5.6, both skipped when absent). Also moves all imports to top-of-file across the zarr backend per project convention. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_input_archive.py | 12 +++- python/lsst/images/zarr/_layout.py | 56 ++++++++++++++-- python/lsst/images/zarr/_output_archive.py | 10 ++- tests/test_zarr_cross_format.py | 66 ++++++++++++++++++ tests/test_zarr_external_reader.py | 66 ++++++++++++++++++ tests/test_zarr_input_archive.py | 35 ++++++++-- tests/test_zarr_layout.py | 14 ++-- tests/test_zarr_model.py | 23 +++---- tests/test_zarr_ome_compliance.py | 78 ++++++++++++++++++++++ tests/test_zarr_output_archive.py | 41 ++++++++++-- tests/test_zarr_round_trip.py | 4 +- tests/test_zarr_xarray_interop.py | 75 +++++++++++++++++++++ 12 files changed, 436 insertions(+), 44 deletions(-) create mode 100644 tests/test_zarr_cross_format.py create mode 100644 tests/test_zarr_external_reader.py create mode 100644 tests/test_zarr_ome_compliance.py create mode 100644 tests/test_zarr_xarray_interop.py diff --git a/python/lsst/images/zarr/_input_archive.py b/python/lsst/images/zarr/_input_archive.py index 1b602038..347d9a44 100644 --- a/python/lsst/images/zarr/_input_archive.py +++ b/python/lsst/images/zarr/_input_archive.py @@ -25,6 +25,7 @@ from lsst.resources import ResourcePathExpression from .._transforms import FrameSet +from ..fits._common import FitsOpaqueMetadata from ..serialization import ( ArchiveReadError, ArchiveTree, @@ -36,7 +37,9 @@ no_header_updates, ) from ._common import LSST_VERSION, ZarrPointerModel +from ._layout import deserialize_fits_opaque_metadata from ._model import ZarrArray, ZarrDocument +from ._store import open_store_for_read class ZarrInputArchive(InputArchive[ZarrPointerModel]): @@ -47,13 +50,16 @@ def __init__(self, document: ZarrDocument) -> None: self._validate_root_attributes() self._deserialized_pointer_cache: dict[str, Any] = {} self._frame_set_cache: dict[str, FrameSet] = {} + self._opaque_metadata = deserialize_fits_opaque_metadata(document) + + def get_opaque_metadata(self) -> FitsOpaqueMetadata | None: + """Return any FITS opaque metadata recovered from the archive.""" + return self._opaque_metadata @classmethod @contextmanager def open(cls, path: ResourcePathExpression) -> Iterator[Self]: """Open a zarr archive for reading.""" - from ._store import open_store_for_read - with open_store_for_read(path) as store: doc = ZarrDocument.from_zarr(store) yield cls(doc) @@ -227,4 +233,6 @@ def read[T: Any](cls: type[T], path: ResourcePathExpression, **kwargs: Any) -> R tree_type = cls._get_archive_tree_type(ZarrPointerModel) tree = archive.get_tree(tree_type) obj = tree.deserialize(archive, **kwargs) + if (opaque := archive.get_opaque_metadata()) is not None: + obj._opaque_metadata = opaque return ReadResult(obj, tree.metadata, tree.butler_info) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index b317fa85..0e0fd05e 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -33,14 +33,21 @@ "chunks_aligned_to", "chunks_for", "decorate_sub_archives", + "deserialize_fits_opaque_metadata", + "serialize_fits_opaque_metadata", ) +import json from collections.abc import Mapping from dataclasses import dataclass from typing import Any +import astropy.io.fits import numpy as np +from ..fits._common import ExtensionKey, FitsOpaqueMetadata +from ._model import OmeMultiscale, ZarrArray, ZarrDocument + _DEFAULT_AXIS_LIMIT = 1024 @@ -233,7 +240,7 @@ def _angular_separation(a: Any, b: Any) -> Any: ) -def decorate_sub_archives(document: Any) -> None: +def decorate_sub_archives(document: ZarrDocument) -> None: """Decorate sub-archive groups with ``lsst.archive_class`` and OME attrs. A sub-archive is any group below the root that contains an @@ -244,16 +251,12 @@ def decorate_sub_archives(document: Any) -> None: The root group is left alone — its ``lsst.archive_class`` is set by ``add_tree`` based on the in-memory object's type. """ - from ._model import ZarrDocument - if not isinstance(document, ZarrDocument): raise TypeError(type(document).__name__) _decorate_walk(document.root) def _decorate_walk(group: Any) -> None: - from ._model import OmeMultiscale - for sub in group.groups.values(): if "image" in sub.arrays: sub.attributes.lsst.setdefault("archive_class", "Image") @@ -267,3 +270,46 @@ def _decorate_walk(group: Any) -> None: ) sub.attributes.ome["multiscales"] = [multiscale.dump()] _decorate_walk(sub) + + +def serialize_fits_opaque_metadata(document: ZarrDocument, opaque: FitsOpaqueMetadata) -> None: + """Stage a `FitsOpaqueMetadata` object into the IR. + + Stores the primary-HDU header as a JSON-encoded ``uint8`` array at + ``/lsst/opaque_metadata/fits/primary`` and sets the + ``lsst.opaque_metadata_format`` attribute on the root group. + No-op if the metadata is empty or missing a primary header. + """ + primary = opaque.headers.get(ExtensionKey()) + if primary is None or len(primary) == 0: + return + cards = {card.keyword: card.value for card in primary.cards if card.keyword} + json_bytes = json.dumps(cards).encode("utf-8") + parent = document.root.ensure_group("/lsst/opaque_metadata/fits") + parent.arrays["primary"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) + document.root.attributes.lsst["opaque_metadata_format"] = "fits" + + +def deserialize_fits_opaque_metadata(document: ZarrDocument) -> FitsOpaqueMetadata | None: + """Reconstruct a `FitsOpaqueMetadata` from the IR, or return None. + + Returns ``None`` when the archive does not have a FITS opaque + metadata block (the common case for archives that originated as + native zarr). + """ + if document.root.attributes.lsst.get("opaque_metadata_format") != "fits": + return None + try: + node = document.root.get("/lsst/opaque_metadata/fits/primary") + except KeyError: + return None + if not isinstance(node, ZarrArray): + return None + json_bytes = bytes(node.read()).decode("utf-8") + cards = json.loads(json_bytes) + header = astropy.io.fits.Header() + for key, value in cards.items(): + header[key] = value + opaque = FitsOpaqueMetadata() + opaque.headers[ExtensionKey()] = header + return opaque diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index f97bdb45..cba5cc70 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -25,6 +25,7 @@ from .._mask import MaskSchema from .._transforms import FrameSet from .._transforms._ast import Channel, StringStream +from ..fits._common import FitsOpaqueMetadata from ..serialization import ( ArchiveTree, ArrayReferenceModel, @@ -47,6 +48,7 @@ chunks_aligned_to, chunks_for, decorate_sub_archives, + serialize_fits_opaque_metadata, ) from ._model import ( CfFlagAttributes, @@ -57,6 +59,7 @@ ZarrGroup, build_image_array_attrs, ) +from ._store import open_store_for_write class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): @@ -411,8 +414,6 @@ def write( implementation (LocalStore / ZipStore / FsspecStore) is selected from the URI shape by ``_store.open_store_for_write``. """ - from ._store import open_store_for_write - archive_class = type(obj).__name__ archive_default_name = getattr(obj, "_archive_default_name", None) archive_metadata: dict[str, Any] = {} @@ -447,6 +448,11 @@ def write( if butler_info is not None: tree.butler_info = butler_info archive.add_tree(tree) + # Stage opaque metadata after add_tree so the namespace attribute + # writes happen in the right order. + opaque = getattr(obj, "_opaque_metadata", None) + if isinstance(opaque, FitsOpaqueMetadata): + serialize_fits_opaque_metadata(archive.document, opaque) with open_store_for_write(path) as store: archive.document.to_zarr(store) return tree diff --git a/tests/test_zarr_cross_format.py b/tests/test_zarr_cross_format.py new file mode 100644 index 00000000..27fef9e6 --- /dev/null +++ b/tests/test_zarr_cross_format.py @@ -0,0 +1,66 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +import astropy.io.fits +import numpy as np + +from lsst.images import Box, Image +from lsst.images.fits import read as fits_read +from lsst.images.fits import write as fits_write + +try: + import zarr # noqa: F401 + + from lsst.images.zarr import read as zarr_read + from lsst.images.zarr import write as zarr_write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class FitsZarrCrossFormatTestCase(unittest.TestCase): + """End-to-end FITS -> Zarr -> FITS preserves the primary header.""" + + def test_fits_to_zarr_to_fits_preserves_primary_header(self) -> None: + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + with tempfile.TemporaryDirectory() as tmp: + fits_a = os.path.join(tmp, "a.fits") + zarr_path = os.path.join(tmp, "b.zarr") + fits_b = os.path.join(tmp, "c.fits") + + def update_header(header): + header["ORIGIN"] = "RUBIN" + header["EXPTIME"] = 30.0 + + fits_write(original, fits_a, update_header=update_header) + from_fits = fits_read(Image, fits_a).deserialized + zarr_write(from_fits, zarr_path) + from_zarr = zarr_read(Image, zarr_path).deserialized + fits_write(from_zarr, fits_b) + + with astropy.io.fits.open(fits_b) as hdul: + self.assertEqual(hdul[0].header["ORIGIN"], "RUBIN") + self.assertEqual(hdul[0].header["EXPTIME"], 30.0) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_zarr_external_reader.py b/tests/test_zarr_external_reader.py new file mode 100644 index 00000000..852e6520 --- /dev/null +++ b/tests/test_zarr_external_reader.py @@ -0,0 +1,66 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +import numpy as np + +from lsst.images import Box, Image + +try: + import zarr # noqa: F401 + + from lsst.images.zarr import write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + +try: + import ome_zarr + import ome_zarr.io + import ome_zarr.reader # noqa: F401 + + HAVE_OME_ZARR = True +except ImportError: + HAVE_OME_ZARR = False + + +@unittest.skipUnless(HAVE_ZARR and HAVE_OME_ZARR, "ome-zarr is not installed") +class OmeZarrReaderTestCase(unittest.TestCase): + """``ome-zarr-py`` can open archives written by ``lsst.images.zarr``.""" + + def test_ome_zarr_can_open_image(self) -> None: + from ome_zarr.io import parse_url + from ome_zarr.reader import Reader + + original = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(original, target) + location = parse_url(target) + self.assertIsNotNone(location) + reader = Reader(location) + nodes = list(reader()) + self.assertGreaterEqual(len(nodes), 1) + data = nodes[0].data[0] # level 0 + self.assertEqual(tuple(data.shape), (4, 5)) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_zarr_input_archive.py b/tests/test_zarr_input_archive.py index df00f014..eed51fe4 100644 --- a/tests/test_zarr_input_archive.py +++ b/tests/test_zarr_input_archive.py @@ -15,10 +15,14 @@ import tempfile import unittest +import astropy.io.fits +import astropy.table import numpy as np +import pydantic from lsst.images import Box, Image from lsst.images._image import ImageSerializationModel +from lsst.images.fits._common import ExtensionKey, FitsOpaqueMetadata from lsst.images.serialization import ArchiveReadError try: @@ -251,8 +255,6 @@ class ZarrInputArchivePointerTestCase(unittest.TestCase): """``deserialize_pointer`` cache + JSON sub-tree handling.""" def test_deserialize_pointer_caches_results(self) -> None: - import pydantic - class _Sub(pydantic.BaseModel): label: str @@ -292,8 +294,6 @@ class ZarrInputArchiveTableTestCase(unittest.TestCase): """``get_table`` reconstructs columns via ``get_array``.""" def test_get_table_reconstructs_columns(self) -> None: - import astropy.table - out = ZarrOutputArchive() out.document.root.attributes.lsst["archive_class"] = "Image" out.document.root.attributes.lsst["tree"] = "tree" @@ -335,5 +335,32 @@ def test_round_trip_image(self) -> None: self.assertEqual(result.deserialized.bbox, original.bbox) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOpaqueMetadataReadTestCase(unittest.TestCase): + """FITS opaque metadata is restored on read.""" + + def test_fits_opaque_metadata_round_trips(self) -> None: + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + header = astropy.io.fits.Header() + header["ORIGIN"] = "RUBIN" + header["EXPTIME"] = 30.0 + opaque = FitsOpaqueMetadata() + opaque.headers[ExtensionKey()] = header + image._opaque_metadata = opaque + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(image, target) + recovered = read(Image, target).deserialized + recovered_opaque = recovered._opaque_metadata + self.assertIsInstance(recovered_opaque, FitsOpaqueMetadata) + recovered_header = recovered_opaque.headers[ExtensionKey()] + self.assertEqual(recovered_header["ORIGIN"], "RUBIN") + self.assertEqual(recovered_header["EXPTIME"], 30.0) + + if __name__ == "__main__": unittest.main() diff --git a/tests/test_zarr_layout.py b/tests/test_zarr_layout.py index 648fc941..8697488f 100644 --- a/tests/test_zarr_layout.py +++ b/tests/test_zarr_layout.py @@ -13,6 +13,8 @@ import unittest +import numpy as np + from lsst.images._transforms._ast import ( CmpMap, Frame, @@ -27,7 +29,9 @@ axes_for_archive_class, chunks_aligned_to, chunks_for, + decorate_sub_archives, ) + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup HAVE_ZARR = True except ImportError: @@ -135,11 +139,6 @@ class DecorateSubArchivesTestCase(unittest.TestCase): """`decorate_sub_archives` walks the IR and adds OME / lsst attrs.""" def test_sub_group_with_image_gets_lsst_and_ome_attrs(self) -> None: - import numpy as np - - from lsst.images.zarr._layout import decorate_sub_archives - from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup - doc = ZarrDocument(root=ZarrGroup()) doc.root.attributes.lsst["archive_class"] = "ColorImage" red = doc.root.ensure_group("/red") @@ -152,11 +151,6 @@ def test_sub_group_with_image_gets_lsst_and_ome_attrs(self) -> None: self.assertEqual(red.attributes.ome["multiscales"][0]["datasets"][0]["path"], "image") def test_root_archive_class_is_unchanged(self) -> None: - import numpy as np - - from lsst.images.zarr._layout import decorate_sub_archives - from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup - doc = ZarrDocument(root=ZarrGroup()) doc.root.attributes.lsst["archive_class"] = "ColorImage" red = doc.root.ensure_group("/red") diff --git a/tests/test_zarr_model.py b/tests/test_zarr_model.py index cd32227d..e21a14a6 100644 --- a/tests/test_zarr_model.py +++ b/tests/test_zarr_model.py @@ -19,7 +19,16 @@ import zarr from lsst.images.zarr._common import LSST_NS, LSST_VERSION, OME_NS, OME_VERSION - from lsst.images.zarr._model import ZarrArray, ZarrAttributes + from lsst.images.zarr._model import ( + CfFlagAttributes, + MaskPlaneEntry, + OmeMultiscale, + ZarrArray, + ZarrAttributes, + ZarrDocument, + ZarrGroup, + build_image_array_attrs, + ) HAVE_ZARR = True except ImportError: @@ -125,8 +134,6 @@ class ZarrDocumentTestCase(unittest.TestCase): """Tests for `ZarrDocument` / `ZarrGroup` round-trip and tree walking.""" def test_round_trip_through_memory_store(self) -> None: - from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup - # Build a flat IR: image, variance, mask siblings at root. doc = ZarrDocument(root=ZarrGroup()) doc.root.attributes.lsst["archive_class"] = "MaskedImage" @@ -167,8 +174,6 @@ def test_round_trip_through_memory_store(self) -> None: np.testing.assert_array_equal(recovered.root.arrays["image"].read(), np.ones((4, 4), dtype="float32")) def test_get_walks_paths(self) -> None: - from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup - doc = ZarrDocument(root=ZarrGroup()) doc.root.arrays["image"] = ZarrArray(data=np.zeros((2, 2), dtype="float32")) red = doc.root.ensure_group("/red") @@ -189,8 +194,6 @@ class OmeCfHelpersTestCase(unittest.TestCase): """Tests for the OME / CF attribute-shape helpers.""" def test_multiscale_emits_expected_shape(self) -> None: - from lsst.images.zarr._model import OmeMultiscale - m = OmeMultiscale( name="visitimage", axes=("y", "x"), @@ -213,8 +216,6 @@ def test_multiscale_emits_expected_shape(self) -> None: ) def test_multiscale_with_affine(self) -> None: - from lsst.images.zarr._model import OmeMultiscale - m = OmeMultiscale( name="image", axes=("y", "x"), @@ -232,8 +233,6 @@ def test_multiscale_with_affine(self) -> None: self.assertEqual(d["datasets"][0]["coordinateTransformations"][0]["type"], "scale") def test_cf_flag_attributes(self) -> None: - from lsst.images.zarr._model import CfFlagAttributes, MaskPlaneEntry - cf = CfFlagAttributes( planes=[ MaskPlaneEntry(name="BAD", bit=0, description="Bad pixel."), @@ -247,8 +246,6 @@ def test_cf_flag_attributes(self) -> None: self.assertEqual(d["flag_descriptions"], ["Bad pixel.", "Saturated.", "Cosmic ray."]) def test_image_array_attrs(self) -> None: - from lsst.images.zarr._model import build_image_array_attrs - attrs = build_image_array_attrs(axes=("y", "x"), units="adu", long_name="science image") self.assertEqual(attrs["_ARRAY_DIMENSIONS"], ["y", "x"]) self.assertEqual(attrs["units"], "adu") diff --git a/tests/test_zarr_ome_compliance.py b/tests/test_zarr_ome_compliance.py new file mode 100644 index 00000000..095c4c6a --- /dev/null +++ b/tests/test_zarr_ome_compliance.py @@ -0,0 +1,78 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import shutil +import subprocess +import tempfile +import unittest + +import numpy as np + +from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema + +try: + import zarr # noqa: F401 + + from lsst.images.zarr import write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + +NGFF_VALIDATOR = shutil.which("ngff-validator") + + +@unittest.skipUnless(HAVE_ZARR and NGFF_VALIDATOR, "ngff-validator is not on PATH") +class NgffComplianceTestCase(unittest.TestCase): + """Archives written by zarr backend validate against NGFF schema.""" + + def _validate(self, target: str) -> None: + result = subprocess.run( + [NGFF_VALIDATOR, target], + capture_output=True, + text=True, + check=False, + ) + self.assertEqual( + result.returncode, + 0, + f"ngff-validator failed for {target}:\nSTDOUT:\n{result.stdout}\nSTDERR:\n{result.stderr}", + ) + + def test_image_validates(self) -> None: + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(image, target) + self._validate(target) + + def test_masked_image_validates(self) -> None: + schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + masked = MaskedImage(image, mask_schema=schema) + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "masked.zarr") + write(masked, target) + self._validate(target) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index bc3e6696..3c4e2cc7 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -11,15 +11,18 @@ from __future__ import annotations +import json import os import tempfile import unittest +import astropy.io.fits import astropy.table import numpy as np import pydantic -from lsst.images import Box, Image, MaskPlane, MaskSchema +from lsst.images import Box, ColorImage, Image, MaskedImage, MaskPlane, MaskSchema +from lsst.images.fits._common import ExtensionKey, FitsOpaqueMetadata try: import zarr @@ -220,8 +223,6 @@ def _round_trip_doc(self, obj): return ZarrDocument.from_zarr(store) def test_masked_image_layout(self) -> None: - from lsst.images import MaskedImage - schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) image = Image( np.arange(20, dtype=np.float32).reshape(4, 5), @@ -253,8 +254,6 @@ class ZarrColorImageWriteTestCase(unittest.TestCase): """ColorImage emits decorated red/green/blue sub-archives.""" def test_color_image_emits_per_channel_arrays(self) -> None: - from lsst.images import ColorImage - arr = np.zeros((4, 5, 3), dtype=np.uint8) arr[..., 0] = 1 arr[..., 1] = 2 @@ -300,5 +299,37 @@ def test_psf_user_override_wins(self) -> None: self.assertEqual(tuple(node.chunks), (2, 3, 21, 21)) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class ZarrOpaqueMetadataWriteTestCase(unittest.TestCase): + """FITS opaque metadata persists at /lsst/opaque_metadata/fits/primary.""" + + def test_fits_opaque_metadata_persists(self) -> None: + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + header = astropy.io.fits.Header() + header["ORIGIN"] = "RUBIN" + header["EXPTIME"] = 30.0 + opaque = FitsOpaqueMetadata() + opaque.headers[ExtensionKey()] = header + image._opaque_metadata = opaque + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(image, target) + with zarr.storage.LocalStore(target, read_only=True) as store: + doc = ZarrDocument.from_zarr(store) + self.assertEqual( + doc.root.attributes.lsst.get("opaque_metadata_format"), + "fits", + ) + opaque_node = doc.root.get("/lsst/opaque_metadata/fits/primary") + json_bytes = bytes(opaque_node.read()) + cards = json.loads(json_bytes) + self.assertEqual(cards["ORIGIN"], "RUBIN") + self.assertEqual(cards["EXPTIME"], 30.0) + + if __name__ == "__main__": unittest.main() diff --git a/tests/test_zarr_round_trip.py b/tests/test_zarr_round_trip.py index 3159eb1c..c0b30a4b 100644 --- a/tests/test_zarr_round_trip.py +++ b/tests/test_zarr_round_trip.py @@ -15,7 +15,7 @@ import numpy as np -from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema +from lsst.images import Box, ColorImage, Image, MaskedImage, MaskPlane, MaskSchema try: import zarr # noqa: F401 @@ -83,8 +83,6 @@ class ZarrColorImageRoundTripTestCase(unittest.TestCase): """ColorImage round-trips through the zarr backend.""" def test_color_image_round_trip(self) -> None: - from lsst.images import ColorImage - arr = np.zeros((4, 5, 3), dtype=np.uint8) arr[..., 0] = 1 arr[..., 1] = 2 diff --git a/tests/test_zarr_xarray_interop.py b/tests/test_zarr_xarray_interop.py new file mode 100644 index 00000000..e560c6e3 --- /dev/null +++ b/tests/test_zarr_xarray_interop.py @@ -0,0 +1,75 @@ +# This file is part of lsst-images. +# +# Developed for the LSST Data Management System. +# This product includes software developed by the LSST Project +# (https://www.lsst.org). +# See the COPYRIGHT file at the top-level directory of this distribution +# for details of code ownership. +# +# Use of this source code is governed by a 3-clause BSD-style +# license that can be found in the LICENSE file. + +from __future__ import annotations + +import os +import tempfile +import unittest + +import numpy as np + +from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema + +try: + import zarr # noqa: F401 + + from lsst.images.zarr import write + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + +try: + import xarray as xr + + HAVE_XARRAY = True +except ImportError: + HAVE_XARRAY = False + + +@unittest.skipUnless(HAVE_ZARR and HAVE_XARRAY, "xarray is not installed") +class XarrayInteropTestCase(unittest.TestCase): + """``xr.open_zarr`` returns a Dataset with the masked-image siblings.""" + + def test_open_zarr_returns_dataset_with_masked_image_components(self) -> None: + schema = MaskSchema( + [ + MaskPlane("BAD", "Bad pixel."), + MaskPlane("SAT", "Saturated."), + MaskPlane("CR", "Cosmic ray."), + ] + ) + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + masked = MaskedImage(image, mask_schema=schema) + masked.mask.set("BAD", image.array % 2 == 0) + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "masked.zarr") + write(masked, target) + ds = xr.open_zarr(target) + # Three data variables sharing the (y, x) dims. + self.assertIn("image", ds.data_vars) + self.assertIn("variance", ds.data_vars) + self.assertIn("mask", ds.data_vars) + self.assertEqual(ds["image"].dims, ("y", "x")) + self.assertEqual(ds["mask"].dims, ("y", "x")) + self.assertEqual(ds["image"].shape, (4, 5)) + # CF flag attrs survive on the mask variable. + self.assertEqual(ds["mask"].attrs["flag_meanings"], "BAD SAT CR") + self.assertEqual(list(ds["mask"].attrs["flag_masks"]), [1, 2, 4]) + + +if __name__ == "__main__": + unittest.main() From f0be49632ff6492bed0701e6378e1b362323c58b Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 06:57:59 -0700 Subject: [PATCH 27/60] docs: expand zarr module docstring, add reference page, changelog fragment Phase 6 finalization: - Replace the placeholder __init__.py docstring with a full overview covering supported types, on-disk layout, WCS handling, cloud defaults, FITS round-trip, and optional install. - Add doc/lsst.images/zarr.rst as the API reference page mirroring the other backends and wire it into the toctree. - Add doc/changes/DM-55041.feature.md towncrier fragment. Generated with AI Co-Authored-By: SLAC AI --- doc/changes/DM-55041.feature.md | 1 + doc/lsst.images/index.rst | 1 + doc/lsst.images/zarr.rst | 16 +++++ python/lsst/images/zarr/__init__.py | 102 ++++++++++++++++++++++++---- 4 files changed, 105 insertions(+), 15 deletions(-) create mode 100644 doc/changes/DM-55041.feature.md create mode 100644 doc/lsst.images/zarr.rst diff --git a/doc/changes/DM-55041.feature.md b/doc/changes/DM-55041.feature.md new file mode 100644 index 00000000..064c0e91 --- /dev/null +++ b/doc/changes/DM-55041.feature.md @@ -0,0 +1 @@ +Added a new `lsst.images.zarr` archive backend that reads and writes Zarr v3 archives. The on-disk layout is xarray/CF-shaped at the root (`image`, `variance`, `mask` as siblings sharing `(y, x)` dimensions, CF `flag_masks`/`flag_meanings` on the mask) with OME-NGFF v0.5 multiscales metadata layered on top — the same bytes are visible to xarray, GDAL, and OME-Zarr tooling like `napari` and `ome-zarr-py`. Supports `Image`, `Mask`, `MaskedImage`, and `ColorImage`. Cloud-friendly defaults (tile-aligned chunks, fsspec-backed remote stores) and subset reads that only fetch the chunks they need. Install via the new `[zarr]` extra (`pip install lsst-images[zarr]`). diff --git a/doc/lsst.images/index.rst b/doc/lsst.images/index.rst index e819d3a0..37af78b7 100644 --- a/doc/lsst.images/index.rst +++ b/doc/lsst.images/index.rst @@ -32,4 +32,5 @@ API Reference fits.rst json.rst ndf.rst + zarr.rst tests.rst diff --git a/doc/lsst.images/zarr.rst b/doc/lsst.images/zarr.rst new file mode 100644 index 00000000..fa770962 --- /dev/null +++ b/doc/lsst.images/zarr.rst @@ -0,0 +1,16 @@ +Zarr I/O +======== + +A Zarr v3 serialization backend whose on-disk layout is xarray/CF-shaped at the root (``image`` / ``variance`` / ``mask`` as siblings sharing ``(y, x)`` dimensions, CF ``flag_masks`` / ``flag_meanings`` on the mask) with OME-NGFF v0.5 multiscales metadata as a discoverability layer pointing at the same ``image`` array. +The same bytes are visible to ``xarray``, GDAL's Zarr driver, and OME-Zarr tooling like ``napari`` and ``ome-zarr-py``. + +Default chunking is tile-aligned (~1024×1024 for plain images, ``cell_shape`` for ``CellCoadd``); subset reads via ``slices=`` only fetch the chunks they need — including on remote stores accessed through ``lsst.resources.ResourcePath`` and ``fsspec``. + +This backend requires the optional ``zarr >= 3.0`` package. Install via the ``[zarr]`` extra:: + + pip install lsst-images[zarr] + +.. automodapi:: lsst.images.zarr + :no-inheritance-diagram: + :include-all-objects: + :inherited-members: diff --git a/python/lsst/images/zarr/__init__.py b/python/lsst/images/zarr/__init__.py index ee480a05..f844900b 100644 --- a/python/lsst/images/zarr/__init__.py +++ b/python/lsst/images/zarr/__init__.py @@ -11,21 +11,93 @@ """Zarr v3 archive backend for `lsst.images`. -Files written by this archive are xarray/CF-shaped at the root -(``image`` / ``variance`` / ``mask`` as siblings sharing ``(y, x)`` -dimensions, CF ``flag_masks`` / ``flag_meanings`` on the mask) with -OME-NGFF v0.5 multiscales metadata as a discoverability layer -pointing at the same ``image`` array. The same bytes are visible to -``xarray``, GDAL's Zarr driver, and OME-Zarr tooling like ``napari`` -and ``ome-zarr-py``. - -Default chunk geometry is tile-aligned (~1024×1024 for plain images, -``cell_shape`` for ``CellCoadd``). Sharding (zarr v3 native) is -enabled by default with a tunable shard size to keep object counts -manageable on S3/GCS. Both ``DirectoryStore`` and ``ZipStore`` are -supported; the choice is driven by URI shape (``*.zarr.zip`` → -``ZipStore``, otherwise directory). Remote URIs go through -`lsst.resources.ResourcePath` and `fsspec`. +This module reads and writes Zarr v3 archives whose root layout is +xarray/CF-shaped (``image``, ``variance``, ``mask`` as siblings sharing +``(y, x)`` dimensions, CF ``flag_masks`` / ``flag_meanings`` / +``flag_descriptions`` on the mask) with OME-NGFF v0.5 multiscales +metadata as a discoverability layer pointing at the same ``image`` +array. The same bytes are visible to ``xarray``, GDAL's Zarr driver, +and OME-Zarr tooling like ``napari`` and ``ome-zarr-py``. + +Supported types +--------------- + +Every image type that already serializes to FITS / JSON / NDF: +`~lsst.images.Image`, `~lsst.images.Mask`, `~lsst.images.MaskedImage`, +`~lsst.images.VisitImage`, `~lsst.images.ColorImage`, plus any object +reachable through the `~lsst.images.serialization.OutputArchive` +interface. + +On-disk layout +-------------- + +A `MaskedImage` archive contains: + +- ``image``, ``variance``, ``mask`` arrays at the root, shaped + ``(Y, X)`` with shared chunk sizes. +- ``tree`` — 1-D ``uint8`` zarr array containing UTF-8 JSON of the + Pydantic archive tree (the round-trip authority). +- ``wcs_ast`` — 1-D ``uint8`` zarr array containing the AST FrameSet + text (the WCS round-trip authority), when a projection exists. + +The mask is a 2-D unsigned integer (``uint8`` for ≤8 planes, up to +``uint64`` for 64 planes; >64 raises). Each pixel's bits encode the +applicable mask planes. + +For `ColorImage`, the three channels are written as flat 2-D arrays +at ``red``, ``green``, ``blue``. + +For ``CellCoadd``, ``image`` / ``variance`` / ``mask`` are siblings +(cell-aligned chunks driven by ``cell_shape``), and ``psf`` is a 4-D +``(Cy, Cx, Py, Px)`` array with single-cell chunks +``(1, 1, Py, Px)``. + +WCS handling +------------ + +The AST ``FrameSet`` text at ``wcs_ast`` is the round-trip authority. +The layout layer also emits an OME-NGFF v0.5 affine +``coordinateTransformations`` block that approximates the linear part +of the pixel-to-sky map. Before emitting, residuals are sampled on an +11×11 grid; if the worst pixel-equivalent error exceeds 1.0 pixel, +the affine block is dropped and ``lsst.wcs_simplified_dropped: true`` +is recorded with the observed maximum. + +Cloud-friendly defaults +----------------------- + +- Default chunk geometry is tile-aligned: ``min(1024, dim)`` per + axis for plain images, ``cell_shape`` for ``CellCoadd``, + single-cell for ``CellCoadd``'s 4-D PSF. +- Subset reads via ``slices=`` to + `~lsst.images.serialization.InputArchive.get_array` exploit zarr's + chunk index: only chunks intersecting the slice are fetched, even + from remote stores. +- Both ``DirectoryStore`` and ``ZipStore`` are supported; the choice + is driven by URI shape (``*.zarr.zip`` → ``ZipStore``, otherwise + directory). Remote URIs (``s3://``, ``gs://``, ``http(s)://``) go + through `lsst.resources.ResourcePath` and `fsspec`. + +Round-trip with FITS +-------------------- + +When an object that originated from a FITS read carries a +`~lsst.images.fits.FitsOpaqueMetadata`, the primary-HDU header is +preserved at ``/lsst/opaque_metadata/fits/primary``. Reading the +zarr back attaches an equivalent ``FitsOpaqueMetadata`` to the +deserialized object so a subsequent FITS write reproduces the +original cards. + +Optional install +---------------- + +This backend requires ``zarr >= 3.0``. Install via the ``[zarr]`` +extra:: + + pip install lsst-images[zarr] + +The top-level ``import lsst.images.zarr`` raises a clear +`ImportError` with this guidance if `zarr` is not installed. """ try: From 25a98698ef99e39d20ec7cca1505f7ae262680b2 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 07:21:17 -0700 Subject: [PATCH 28/60] fix: emit zarr v3 dimension_names so xarray.open_zarr works xarray's v3 backend reads dim names from the array's native dimension_names metadata field, not from _ARRAY_DIMENSIONS in attributes. Promote the CF attr to the v3 metadata field on write, and mirror back on read so older v2-style consumers still see _ARRAY_DIMENSIONS. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_model.py | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/python/lsst/images/zarr/_model.py b/python/lsst/images/zarr/_model.py index ba256563..8ee476c0 100644 --- a/python/lsst/images/zarr/_model.py +++ b/python/lsst/images/zarr/_model.py @@ -133,6 +133,12 @@ def dtype(self) -> np.dtype: def from_zarr(cls, zarr_array: zarr.Array) -> Self: """Wrap an open ``zarr.Array`` without reading its data.""" attrs = ZarrAttributes.load(dict(zarr_array.attrs)) + # Mirror native zarr v3 ``dimension_names`` into the xarray v2-style + # ``_ARRAY_DIMENSIONS`` attribute when only the v3 form is present, + # so downstream consumers see both. + dim_names = getattr(zarr_array.metadata, "dimension_names", None) + if dim_names and "_ARRAY_DIMENSIONS" not in attrs.extra: + attrs.extra["_ARRAY_DIMENSIONS"] = list(dim_names) return cls( data=zarr_array, chunks=tuple(zarr_array.chunks), @@ -236,6 +242,10 @@ def _group_to_zarr(ir: ZarrGroup, zarr_group: zarr.Group) -> None: chunks = array.chunks or _default_chunks(array.data.shape) compression = array.compression or ZarrCompressionOptions.default_for_dtype(str(array.dtype)) serializer, compressors = _build_codecs(compression) + # Promote ``_ARRAY_DIMENSIONS`` from the CF-style attribute to the + # native zarr v3 ``dimension_names`` metadata field; xarray's v3 + # backend reads from there, not from attributes. + dim_names = array.attributes.extra.get("_ARRAY_DIMENSIONS") zarr_array = zarr_group.create_array( name=name, shape=array.data.shape, @@ -244,6 +254,7 @@ def _group_to_zarr(ir: ZarrGroup, zarr_group: zarr.Group) -> None: shards=array.shards, serializer=serializer, compressors=compressors, + dimension_names=list(dim_names) if dim_names is not None else None, ) zarr_array[:] = array.data if dumped := array.attributes.dump(): From 842fdc0fbc748f96d30d21f0b14256eea58c8dbe Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 07:25:07 -0700 Subject: [PATCH 29/60] feat: default dimension_names to [None] so xarray can open every group xarray.open_zarr walks every array in a group and refuses to open the parent if any array lacks dimension_names. Default to a list of None sentinels for arrays that do not carry an explicit _ARRAY_DIMENSIONS attr (e.g. tree, wcs_ast, opaque_metadata blobs). Adds two xarray-backed read tests (skipped when xarray is missing) and lists xarray + zarr as runtime requirements. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_model.py | 10 ++++++++-- requirements.txt | 2 ++ tests/test_zarr_xarray_interop.py | 25 +++++++++++++++++++++++-- 3 files changed, 33 insertions(+), 4 deletions(-) diff --git a/python/lsst/images/zarr/_model.py b/python/lsst/images/zarr/_model.py index 8ee476c0..399ae28d 100644 --- a/python/lsst/images/zarr/_model.py +++ b/python/lsst/images/zarr/_model.py @@ -244,8 +244,14 @@ def _group_to_zarr(ir: ZarrGroup, zarr_group: zarr.Group) -> None: serializer, compressors = _build_codecs(compression) # Promote ``_ARRAY_DIMENSIONS`` from the CF-style attribute to the # native zarr v3 ``dimension_names`` metadata field; xarray's v3 - # backend reads from there, not from attributes. + # backend reads from there, not from attributes, and refuses to + # open the parent group if *any* array lacks the field. Arrays + # without explicit names fall back to ``[None] * ndim``. dim_names = array.attributes.extra.get("_ARRAY_DIMENSIONS") + if dim_names is None: + dim_names = [None] * array.data.ndim + else: + dim_names = list(dim_names) zarr_array = zarr_group.create_array( name=name, shape=array.data.shape, @@ -254,7 +260,7 @@ def _group_to_zarr(ir: ZarrGroup, zarr_group: zarr.Group) -> None: shards=array.shards, serializer=serializer, compressors=compressors, - dimension_names=list(dim_names) if dim_names is not None else None, + dimension_names=dim_names, ) zarr_array[:] = array.data if dumped := array.attributes.dump(): diff --git a/requirements.txt b/requirements.txt index 6bd55a85..0fcd11bd 100644 --- a/requirements.txt +++ b/requirements.txt @@ -9,3 +9,5 @@ astro-metadata-translator @ git+https://github.com/lsst/astro_metadata_translato starlink-pyast >= 4.0.0 scipy >= 1.13 shapely >= 2.1 +xarray >= 2024.1 +zarr >= 3.0 diff --git a/tests/test_zarr_xarray_interop.py b/tests/test_zarr_xarray_interop.py index e560c6e3..648c111b 100644 --- a/tests/test_zarr_xarray_interop.py +++ b/tests/test_zarr_xarray_interop.py @@ -40,7 +40,7 @@ class XarrayInteropTestCase(unittest.TestCase): """``xr.open_zarr`` returns a Dataset with the masked-image siblings.""" - def test_open_zarr_returns_dataset_with_masked_image_components(self) -> None: + def _make_masked_image(self) -> MaskedImage: schema = MaskSchema( [ MaskPlane("BAD", "Bad pixel."), @@ -54,11 +54,15 @@ def test_open_zarr_returns_dataset_with_masked_image_components(self) -> None: ) masked = MaskedImage(image, mask_schema=schema) masked.mask.set("BAD", image.array % 2 == 0) + return masked + + def test_open_zarr_returns_dataset_with_masked_image_components(self) -> None: + masked = self._make_masked_image() with tempfile.TemporaryDirectory() as tmp: target = os.path.join(tmp, "masked.zarr") write(masked, target) - ds = xr.open_zarr(target) + ds = xr.open_zarr(target, consolidated=False) # Three data variables sharing the (y, x) dims. self.assertIn("image", ds.data_vars) self.assertIn("variance", ds.data_vars) @@ -70,6 +74,23 @@ def test_open_zarr_returns_dataset_with_masked_image_components(self) -> None: self.assertEqual(ds["mask"].attrs["flag_meanings"], "BAD SAT CR") self.assertEqual(list(ds["mask"].attrs["flag_masks"]), [1, 2, 4]) + def test_open_zarr_data_values_match_in_memory(self) -> None: + """The bytes xarray reads are the same bytes the archive wrote.""" + masked = self._make_masked_image() + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "masked.zarr") + write(masked, target) + ds = xr.open_zarr(target, consolidated=False) + np.testing.assert_array_equal(ds["image"].values, masked.image.array) + np.testing.assert_array_equal(ds["variance"].values, masked.variance.array) + # Mask on disk is a 2-D packed wide-int; compare against the + # equivalent packing of the in-memory (y, x, mask_size) array. + packed = np.zeros(masked.mask.array.shape[:2], dtype=ds["mask"].dtype) + for i in range(masked.mask.array.shape[2]): + packed |= masked.mask.array[..., i].astype(ds["mask"].dtype) << (8 * i) + np.testing.assert_array_equal(ds["mask"].values, packed) + if __name__ == "__main__": unittest.main() From e8b739dbbb1debe72de7627701976953473096d5 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 07:39:20 -0700 Subject: [PATCH 30/60] feat: consolidate zarr metadata on write so xr.open_zarr is one fetch MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit After to_zarr() materialises the IR, call zarr.consolidate_metadata so the root zarr.json carries every child array's metadata in one place. This silences xarray's 'Failed to open Zarr store with consolidated metadata' RuntimeWarning and turns opening a multi-array archive into a single round-trip — significant on remote stores. ZipStore does not support consolidation (raises TypeError); we swallow that so zip writes continue to work without consolidation. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_output_archive.py | 9 +++++++++ tests/test_zarr_xarray_interop.py | 16 ++++++++++++++++ 2 files changed, 25 insertions(+) diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index cba5cc70..b71b97b0 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -21,6 +21,7 @@ import astropy.units import numpy as np import pydantic +import zarr from .._mask import MaskSchema from .._transforms import FrameSet @@ -455,4 +456,12 @@ def write( serialize_fits_opaque_metadata(archive.document, opaque) with open_store_for_write(path) as store: archive.document.to_zarr(store) + # Consolidate metadata so a single read fetches the whole + # hierarchy's zarr.json contents — significant perf win on + # remote stores. ZipStore does not support consolidation; it + # raises TypeError, which we ignore so zip writes still work. + try: + zarr.consolidate_metadata(store) + except TypeError: + pass return tree diff --git a/tests/test_zarr_xarray_interop.py b/tests/test_zarr_xarray_interop.py index 648c111b..525b47d8 100644 --- a/tests/test_zarr_xarray_interop.py +++ b/tests/test_zarr_xarray_interop.py @@ -74,6 +74,22 @@ def test_open_zarr_returns_dataset_with_masked_image_components(self) -> None: self.assertEqual(ds["mask"].attrs["flag_meanings"], "BAD SAT CR") self.assertEqual(list(ds["mask"].attrs["flag_masks"]), [1, 2, 4]) + def test_open_zarr_uses_consolidated_metadata(self) -> None: + """``write()`` consolidates metadata so xr.open_zarr uses one fetch.""" + import warnings + + masked = self._make_masked_image() + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "masked.zarr") + write(masked, target) + # Default ``consolidated=None`` means "use it if available"; + # if it isn't present xarray emits a ``RuntimeWarning`` and + # falls back to walking every array. Promote that warning to + # an error to confirm the consolidated path is taken. + with warnings.catch_warnings(): + warnings.simplefilter("error", RuntimeWarning) + xr.open_zarr(target) + def test_open_zarr_data_values_match_in_memory(self) -> None: """The bytes xarray reads are the same bytes the archive wrote.""" masked = self._make_masked_image() From d838ef5ec0f901eb0c03f79cb2ba8c43a6bf4f19 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 07:50:00 -0700 Subject: [PATCH 31/60] feat: store FITS opaque metadata as (N, 80) byte array, byte-exact round-trip Switch FitsOpaqueMetadata storage from a JSON {keyword: value} blob to the raw card stream Header.tostring() emits, reshaped as a 2-D (card, char) uint8 array. The JSON form was lossy: COMMENT/HISTORY (empty-keyword cards) were filtered out, inline card comments were dropped, types were coerced through json, CONTINUE/HIERARCH cards were not faithfully captured. The (N, 80) form is what FITS users expect and round-trips byte-for- byte through Header.fromstring (which transparently strips the END card and trailing 2880-byte padding). Dim names ('card', 'char') are written into the v3 dimension_names metadata so xarray sees the structure. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_layout.py | 39 +++++++++++++++++++----------- tests/test_zarr_input_archive.py | 37 ++++++++++++++++++++++++++++ tests/test_zarr_output_archive.py | 17 +++++++++---- 3 files changed, 74 insertions(+), 19 deletions(-) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index 0e0fd05e..34dd54ad 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -37,7 +37,6 @@ "serialize_fits_opaque_metadata", ) -import json from collections.abc import Mapping from dataclasses import dataclass from typing import Any @@ -275,18 +274,31 @@ def _decorate_walk(group: Any) -> None: def serialize_fits_opaque_metadata(document: ZarrDocument, opaque: FitsOpaqueMetadata) -> None: """Stage a `FitsOpaqueMetadata` object into the IR. - Stores the primary-HDU header as a JSON-encoded ``uint8`` array at - ``/lsst/opaque_metadata/fits/primary`` and sets the - ``lsst.opaque_metadata_format`` attribute on the root group. - No-op if the metadata is empty or missing a primary header. + Stores the primary-HDU header as a 2-D ``(N, 80)`` ``uint8`` array + at ``/lsst/opaque_metadata/fits/primary`` — one row per FITS card, + one column per character — and sets ``lsst.opaque_metadata_format + = "fits"`` on the root group. The bytes are + ``astropy.io.fits.Header.tostring()`` output verbatim (cards + + ``END`` + padding to a 2880-byte block), so the round-trip is + byte-exact and preserves comments, ``HISTORY``, ``COMMENT``, + ``CONTINUE``, and ``HIERARCH`` cards. No-op if the metadata is + empty or missing a primary header. """ primary = opaque.headers.get(ExtensionKey()) if primary is None or len(primary) == 0: return - cards = {card.keyword: card.value for card in primary.cards if card.keyword} - json_bytes = json.dumps(cards).encode("utf-8") + text = primary.tostring() + if len(text) % 80 != 0: + raise ValueError( + f"Header.tostring() returned {len(text)} bytes; expected a " + "multiple of 80 (one 80-char FITS card per row)." + ) + n_cards = len(text) // 80 + cards = np.frombuffer(text.encode("ascii"), dtype=np.uint8).reshape(n_cards, 80) parent = document.root.ensure_group("/lsst/opaque_metadata/fits") - parent.arrays["primary"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) + ir_array = ZarrArray(data=np.ascontiguousarray(cards)) + ir_array.attributes.extra["_ARRAY_DIMENSIONS"] = ["card", "char"] + parent.arrays["primary"] = ir_array document.root.attributes.lsst["opaque_metadata_format"] = "fits" @@ -295,7 +307,9 @@ def deserialize_fits_opaque_metadata(document: ZarrDocument) -> FitsOpaqueMetada Returns ``None`` when the archive does not have a FITS opaque metadata block (the common case for archives that originated as - native zarr). + native zarr). ``Header.fromstring`` parses cards up to the ``END`` + marker and drops the padding, so the recovered header carries + only the real cards. """ if document.root.attributes.lsst.get("opaque_metadata_format") != "fits": return None @@ -305,11 +319,8 @@ def deserialize_fits_opaque_metadata(document: ZarrDocument) -> FitsOpaqueMetada return None if not isinstance(node, ZarrArray): return None - json_bytes = bytes(node.read()).decode("utf-8") - cards = json.loads(json_bytes) - header = astropy.io.fits.Header() - for key, value in cards.items(): - header[key] = value + text = bytes(node.read()).decode("ascii") + header = astropy.io.fits.Header.fromstring(text) opaque = FitsOpaqueMetadata() opaque.headers[ExtensionKey()] = header return opaque diff --git a/tests/test_zarr_input_archive.py b/tests/test_zarr_input_archive.py index eed51fe4..041b83a5 100644 --- a/tests/test_zarr_input_archive.py +++ b/tests/test_zarr_input_archive.py @@ -361,6 +361,43 @@ def test_fits_opaque_metadata_round_trips(self) -> None: self.assertEqual(recovered_header["ORIGIN"], "RUBIN") self.assertEqual(recovered_header["EXPTIME"], 30.0) + def test_fits_opaque_metadata_preserves_full_card_fidelity(self) -> None: + """Comments, HISTORY, COMMENT, and HIERARCH all survive round-trip.""" + image = Image( + np.arange(20, dtype=np.float32).reshape(4, 5), + bbox=Box.factory[10:14, 20:25], + ) + header = astropy.io.fits.Header() + header["ORIGIN"] = ("RUBIN", "Source observatory") + header["EXPTIME"] = (30.0, "[s] Total exposure time") + header["HIERARCH LSST INSTRUMENT"] = "LSSTCam" + header.add_history("Bias-subtracted on 2026-05-21") + header.add_history("ISR completed 2026-05-22") + header.add_comment("This file was generated for testing.") + opaque = FitsOpaqueMetadata() + opaque.headers[ExtensionKey()] = header + image._opaque_metadata = opaque + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr") + write(image, target) + recovered = read(Image, target).deserialized + recovered_header = recovered._opaque_metadata.headers[ExtensionKey()] + # Byte-exact equality of the serialized card stream. + self.assertEqual(recovered_header.tostring(), header.tostring()) + # Spot-check the round-tripped values + comments. + self.assertEqual(recovered_header.comments["ORIGIN"], "Source observatory") + self.assertEqual(recovered_header.comments["EXPTIME"], "[s] Total exposure time") + self.assertEqual(recovered_header["HIERARCH LSST INSTRUMENT"], "LSSTCam") + self.assertEqual( + list(recovered_header["HISTORY"]), + ["Bias-subtracted on 2026-05-21", "ISR completed 2026-05-22"], + ) + self.assertEqual( + list(recovered_header["COMMENT"]), + ["This file was generated for testing."], + ) + if __name__ == "__main__": unittest.main() diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index 3c4e2cc7..ae27c54d 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -11,7 +11,6 @@ from __future__ import annotations -import json import os import tempfile import unittest @@ -325,10 +324,18 @@ def test_fits_opaque_metadata_persists(self) -> None: "fits", ) opaque_node = doc.root.get("/lsst/opaque_metadata/fits/primary") - json_bytes = bytes(opaque_node.read()) - cards = json.loads(json_bytes) - self.assertEqual(cards["ORIGIN"], "RUBIN") - self.assertEqual(cards["EXPTIME"], 30.0) + # ``(N, 80)`` byte array with explicit dim names. + self.assertEqual(len(opaque_node.shape), 2) + self.assertEqual(opaque_node.shape[1], 80) + self.assertEqual( + opaque_node.attributes.extra["_ARRAY_DIMENSIONS"], + ["card", "char"], + ) + # Recover the original header from the raw bytes. + text = bytes(opaque_node.read()).decode("ascii") + recovered = astropy.io.fits.Header.fromstring(text) + self.assertEqual(recovered["ORIGIN"], "RUBIN") + self.assertEqual(recovered["EXPTIME"], 30.0) if __name__ == "__main__": From 460e06c9a8207294582af83d8746ca4d95e78d1f Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 08:00:15 -0700 Subject: [PATCH 32/60] perf: store tree, wcs_ast, and FITS header as single-chunk arrays MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit These byte-stream meta arrays are always read whole — we never request a slice of the JSON tree or AST WCS text. Force a single chunk so a remote read is one fetch instead of ceil(size/1024). Affects: - root /tree (Pydantic archive tree) - sub-archive //tree (serialize_pointer JSON) - root /wcs_ast (AST FrameSet text) - /lsst/opaque_metadata/fits/primary (FITS card stream) For a 50 KB tree this drops the read count from 49 chunks to 1. Compression (Blosc/zstd) still applies and gives ~3-5x on JSON text. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_layout.py | 5 +++-- python/lsst/images/zarr/_output_archive.py | 15 +++++++++------ 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index 34dd54ad..c2976aad 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -294,9 +294,10 @@ def serialize_fits_opaque_metadata(document: ZarrDocument, opaque: FitsOpaqueMet "multiple of 80 (one 80-char FITS card per row)." ) n_cards = len(text) // 80 - cards = np.frombuffer(text.encode("ascii"), dtype=np.uint8).reshape(n_cards, 80) + cards = np.ascontiguousarray(np.frombuffer(text.encode("ascii"), dtype=np.uint8).reshape(n_cards, 80)) parent = document.root.ensure_group("/lsst/opaque_metadata/fits") - ir_array = ZarrArray(data=np.ascontiguousarray(cards)) + # Single chunk: the header is always read whole. + ir_array = ZarrArray(data=cards, chunks=cards.shape) ir_array.attributes.extra["_ARRAY_DIMENSIONS"] = ["card", "char"] parent.arrays["primary"] = ir_array document.root.attributes.lsst["opaque_metadata_format"] = "fits" diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index b71b97b0..75624bdf 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -137,7 +137,9 @@ def serialize_pointer[T: ArchiveTree]( model = self.serialize_direct(name, serializer) json_bytes = model.model_dump_json().encode("utf-8") parent = self.document.root.ensure_group(sub_zarr_path) - parent.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) + # Single-chunk storage: the tree is always read whole. + tree_data = np.frombuffer(json_bytes, dtype=np.uint8) + parent.arrays["tree"] = ZarrArray(data=tree_data, chunks=tree_data.shape) pointer = ZarrPointerModel(path=f"{sub_zarr_path}/tree") self._pointers[key] = pointer return pointer @@ -322,9 +324,10 @@ def add_tree(self, tree: ArchiveTree) -> None: runs the affine residual validator if the archive carries a frame set. """ - # Stage the JSON tree at /tree. + # Stage the JSON tree at /tree (single chunk — read whole). json_bytes = tree.model_dump_json().encode("utf-8") - self.document.root.arrays["tree"] = ZarrArray(data=np.frombuffer(json_bytes, dtype=np.uint8)) + tree_data = np.frombuffer(json_bytes, dtype=np.uint8) + self.document.root.arrays["tree"] = ZarrArray(data=tree_data, chunks=tree_data.shape) # Stage the AST WCS string at /wcs_ast when a frame set is registered. wcs_ast_path: str | None = None @@ -381,9 +384,9 @@ def _stage_wcs_ast(self, frame_set: FrameSet) -> str: stream = StringStream() Channel(stream, options="Full=-1,Comment=0,Indent=0").write(frame_set) text = stream.getSinkData() - self.document.root.arrays["wcs_ast"] = ZarrArray( - data=np.frombuffer(text.encode("utf-8"), dtype=np.uint8) - ) + wcs_data = np.frombuffer(text.encode("utf-8"), dtype=np.uint8) + # Single chunk: WCS is always read whole. + self.document.root.arrays["wcs_ast"] = ZarrArray(data=wcs_data, chunks=wcs_data.shape) return "wcs_ast" @staticmethod From 6aa42caa5725937d3fb14c3ec5c20f1a228255ae Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 08:49:23 -0700 Subject: [PATCH 33/60] feat: rename on-disk Pydantic tree from /tree to /lsst_json MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mirrors the naming convention used by the other backends — the FITS backend stores the tree in a 'JSON' HDU and the NDF backend at /MORE/LSST/JSON. The plain 'tree' name was an internal pick that did not communicate what the array holds. /lsst_json ← root archive tree //lsst_json ← sub-archive trees written by serialize_pointer Root attr renamed in parallel: lsst.tree='tree' -> lsst.json='lsst_json'. Method names (add_tree, get_tree) stay — they are internal vocabulary. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/__init__.py | 6 ++-- python/lsst/images/zarr/_common.py | 8 ++--- python/lsst/images/zarr/_input_archive.py | 10 ++++--- python/lsst/images/zarr/_layout.py | 4 +-- python/lsst/images/zarr/_output_archive.py | 19 ++++++------ tests/test_zarr_common.py | 4 +-- tests/test_zarr_input_archive.py | 34 ++++++++++++---------- tests/test_zarr_model.py | 4 +-- tests/test_zarr_output_archive.py | 8 ++--- 9 files changed, 52 insertions(+), 45 deletions(-) diff --git a/python/lsst/images/zarr/__init__.py b/python/lsst/images/zarr/__init__.py index f844900b..0d213129 100644 --- a/python/lsst/images/zarr/__init__.py +++ b/python/lsst/images/zarr/__init__.py @@ -35,8 +35,10 @@ - ``image``, ``variance``, ``mask`` arrays at the root, shaped ``(Y, X)`` with shared chunk sizes. -- ``tree`` — 1-D ``uint8`` zarr array containing UTF-8 JSON of the - Pydantic archive tree (the round-trip authority). +- ``lsst_json`` — 1-D ``uint8`` zarr array containing UTF-8 JSON of + the Pydantic archive tree (the round-trip authority). The same name + convention is used by the FITS backend's ``JSON`` HDU and the NDF + backend's ``/MORE/LSST/JSON`` path. - ``wcs_ast`` — 1-D ``uint8`` zarr array containing the AST FrameSet text (the WCS round-trip authority), when a projection exists. diff --git a/python/lsst/images/zarr/_common.py b/python/lsst/images/zarr/_common.py index e3e4c003..c45caf39 100644 --- a/python/lsst/images/zarr/_common.py +++ b/python/lsst/images/zarr/_common.py @@ -51,11 +51,11 @@ class ZarrPointerModel(pydantic.BaseModel): Used by `ZarrOutputArchive` / `ZarrInputArchive` to point to sub-trees that have been hoisted out of the main JSON tree into separate zarr arrays. The path is interpreted relative to the - archive root, e.g. ``"/lsst/psf/tree"``. + archive root, e.g. ``"/lsst/psf/lsst_json"``. """ path: str - """Absolute zarr path (e.g. ``/lsst/psf/tree``).""" + """Absolute zarr path (e.g. ``/lsst/psf/lsst_json``).""" @dataclass(frozen=True) @@ -94,13 +94,13 @@ def archive_path_to_zarr_path(archive_path: str) -> str: """Translate a serialization archive path to its zarr path. The empty archive path maps to the root-level JSON tree at - ``/tree``. Non-empty archive paths are kept verbatim (with a + ``/lsst_json``. Non-empty archive paths are kept verbatim (with a leading slash). The v1 design's JSON-pointer mapping table is intentionally absent: arrays land where their archive name says they do. """ if not archive_path: - return "/tree" + return "/lsst_json" stripped = archive_path.strip("/") return f"/{stripped}" diff --git a/python/lsst/images/zarr/_input_archive.py b/python/lsst/images/zarr/_input_archive.py index 347d9a44..cf398015 100644 --- a/python/lsst/images/zarr/_input_archive.py +++ b/python/lsst/images/zarr/_input_archive.py @@ -69,13 +69,15 @@ def document(self) -> ZarrDocument: return self._document def get_tree[T: ArchiveTree](self, model_type: type[T]) -> T: - """Read and validate the main Pydantic tree at ``/tree``.""" + """Read and validate the main Pydantic tree at ``/lsst_json``.""" try: - node = self._document.root.get("/tree") + node = self._document.root.get("/lsst_json") except KeyError: - raise ArchiveReadError("File has no /tree array; this is not an LSST zarr archive.") from None + raise ArchiveReadError( + "File has no /lsst_json array; this is not an LSST zarr archive." + ) from None if not isinstance(node, ZarrArray): - raise ArchiveReadError("/tree must be a zarr array, not a group.") + raise ArchiveReadError("/lsst_json must be a zarr array, not a group.") json_bytes = bytes(node.read()) return model_type.model_validate_json(json_bytes.decode("utf-8")) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index c2976aad..78a312fe 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -259,8 +259,8 @@ def _decorate_walk(group: Any) -> None: for sub in group.groups.values(): if "image" in sub.arrays: sub.attributes.lsst.setdefault("archive_class", "Image") - if "tree" in sub.arrays: - sub.attributes.lsst.setdefault("tree", "tree") + if "lsst_json" in sub.arrays: + sub.attributes.lsst.setdefault("json", "lsst_json") if "multiscales" not in sub.attributes.ome: multiscale = OmeMultiscale( name="image", diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index 75624bdf..c75fdc67 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -137,10 +137,10 @@ def serialize_pointer[T: ArchiveTree]( model = self.serialize_direct(name, serializer) json_bytes = model.model_dump_json().encode("utf-8") parent = self.document.root.ensure_group(sub_zarr_path) - # Single-chunk storage: the tree is always read whole. + # Single-chunk storage: the JSON tree is always read whole. tree_data = np.frombuffer(json_bytes, dtype=np.uint8) - parent.arrays["tree"] = ZarrArray(data=tree_data, chunks=tree_data.shape) - pointer = ZarrPointerModel(path=f"{sub_zarr_path}/tree") + parent.arrays["lsst_json"] = ZarrArray(data=tree_data, chunks=tree_data.shape) + pointer = ZarrPointerModel(path=f"{sub_zarr_path}/lsst_json") self._pointers[key] = pointer return pointer @@ -320,14 +320,15 @@ def add_tree(self, tree: ArchiveTree) -> None: Called once after the user's serializer has populated arrays / sub-trees. Sets the ``lsst.*`` and ``ome.*`` blocks on the - root group, stages ``/tree`` as 1-D ``uint8`` UTF-8 JSON, and - runs the affine residual validator if the archive carries a - frame set. + root group, stages ``/lsst_json`` as 1-D ``uint8`` UTF-8 JSON, + and runs the affine residual validator if the archive carries + a frame set. """ - # Stage the JSON tree at /tree (single chunk — read whole). + # Stage the JSON tree at /lsst_json (single chunk — read whole). + # Name mirrors NDF's /MORE/LSST/JSON and FITS's "JSON" HDU. json_bytes = tree.model_dump_json().encode("utf-8") tree_data = np.frombuffer(json_bytes, dtype=np.uint8) - self.document.root.arrays["tree"] = ZarrArray(data=tree_data, chunks=tree_data.shape) + self.document.root.arrays["lsst_json"] = ZarrArray(data=tree_data, chunks=tree_data.shape) # Stage the AST WCS string at /wcs_ast when a frame set is registered. wcs_ast_path: str | None = None @@ -337,7 +338,7 @@ def add_tree(self, tree: ArchiveTree) -> None: # Root LSST attrs. lsst = self.document.root.attributes.lsst lsst["archive_class"] = self._archive_class - lsst["tree"] = "tree" + lsst["json"] = "lsst_json" if wcs_ast_path is not None: lsst["wcs_ast"] = wcs_ast_path if "cell_grid" in self._archive_metadata: diff --git a/tests/test_zarr_common.py b/tests/test_zarr_common.py index 0e19105d..de4b5a1f 100644 --- a/tests/test_zarr_common.py +++ b/tests/test_zarr_common.py @@ -37,7 +37,7 @@ class CommonTestCase(unittest.TestCase): """Tests for the zarr ``_common`` module.""" def test_pointer_round_trips(self) -> None: - original = ZarrPointerModel(path="/lsst/psf/tree") + original = ZarrPointerModel(path="/lsst/psf/lsst_json") recovered = ZarrPointerModel.model_validate_json(original.model_dump_json()) self.assertEqual(recovered, original) @@ -49,7 +49,7 @@ def test_constants(self) -> None: def test_archive_path_translation(self) -> None: # Empty archive path -> the canonical root-level JSON tree. - self.assertEqual(archive_path_to_zarr_path(""), "/tree") + self.assertEqual(archive_path_to_zarr_path(""), "/lsst_json") # Non-empty archive paths are kept verbatim. self.assertEqual(archive_path_to_zarr_path("/image"), "/image") self.assertEqual(archive_path_to_zarr_path("image"), "/image") diff --git a/tests/test_zarr_input_archive.py b/tests/test_zarr_input_archive.py index 041b83a5..c479ecd7 100644 --- a/tests/test_zarr_input_archive.py +++ b/tests/test_zarr_input_archive.py @@ -103,7 +103,7 @@ def test_future_version_refused(self) -> None: LSST_NS: { "version": LSST_VERSION + 1, "archive_class": "Image", - "tree": "tree", + "json": "lsst_json", } } ) @@ -124,14 +124,14 @@ def test_subset_read_touches_only_intersecting_chunks(self) -> None: LSST_NS: { "version": LSST_VERSION, "archive_class": "Image", - "tree": "tree", + "json": "lsst_json", } } ) zarr_array = root.create_array(name="image", shape=(16, 16), chunks=(4, 4), dtype="float32") zarr_array[:] = np.arange(256, dtype=np.float32).reshape(16, 16) - # Stub /tree so the input archive's constructor accepts the file. - tree_arr = root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8") + # Stub /lsst_json so the input archive's constructor accepts the file. + tree_arr = root.create_array(name="lsst_json", shape=(2,), chunks=(2,), dtype="uint8") tree_arr[:] = np.frombuffer(b"{}", dtype=np.uint8) doc = ZarrDocument.from_zarr(store) @@ -167,7 +167,7 @@ def test_unpack_2d_packed_back_to_3d(self) -> None: LSST_NS: { "version": LSST_VERSION, "archive_class": "Mask", - "tree": "tree", + "json": "lsst_json", } } ) @@ -185,7 +185,7 @@ def test_unpack_2d_packed_back_to_3d(self) -> None: "flag_descriptions": ["Bad pixel.", "Saturated.", "Cosmic ray."], } ) - tree_arr = root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8") + tree_arr = root.create_array(name="lsst_json", shape=(2,), chunks=(2,), dtype="uint8") tree_arr[:] = np.frombuffer(b"{}", dtype=np.uint8) doc = ZarrDocument.from_zarr(store) @@ -213,7 +213,7 @@ def test_unpack_uint64_with_5_bytes(self) -> None: LSST_NS: { "version": LSST_VERSION, "archive_class": "Mask", - "tree": "tree", + "json": "lsst_json", } } ) @@ -229,7 +229,7 @@ def test_unpack_uint64_with_5_bytes(self) -> None: "flag_descriptions": [f"Plane {i}." for i in range(40)], } ) - tree_arr = root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8") + tree_arr = root.create_array(name="lsst_json", shape=(2,), chunks=(2,), dtype="uint8") tree_arr[:] = np.frombuffer(b"{}", dtype=np.uint8) doc = ZarrDocument.from_zarr(store) @@ -260,15 +260,17 @@ class _Sub(pydantic.BaseModel): store = zarr.storage.MemoryStore() root = zarr.create_group(store=store, zarr_format=3) - root.update_attributes({LSST_NS: {"version": LSST_VERSION, "archive_class": "Image", "tree": "tree"}}) - # Stub /tree. - tree_arr = root.create_array(name="tree", shape=(2,), chunks=(2,), dtype="uint8") + root.update_attributes( + {LSST_NS: {"version": LSST_VERSION, "archive_class": "Image", "json": "lsst_json"}} + ) + # Stub /lsst_json. + tree_arr = root.create_array(name="lsst_json", shape=(2,), chunks=(2,), dtype="uint8") tree_arr[:] = np.frombuffer(b"{}", dtype=np.uint8) - # Sub-archive with its own /tree at /psf/tree. + # Sub-archive with its own /lsst_json at /psf/lsst_json. json_bytes = b'{"label": "psf"}' psf = root.create_group("psf") arr = psf.create_array( - name="tree", shape=(len(json_bytes),), chunks=(len(json_bytes),), dtype="uint8" + name="lsst_json", shape=(len(json_bytes),), chunks=(len(json_bytes),), dtype="uint8" ) arr[:] = np.frombuffer(json_bytes, dtype=np.uint8) @@ -281,7 +283,7 @@ def deserializer(model, arch): deserialize_calls.append(1) return model - pointer = ZarrPointerModel(path="/psf/tree") + pointer = ZarrPointerModel(path="/psf/lsst_json") first = archive.deserialize_pointer(pointer, _Sub, deserializer) second = archive.deserialize_pointer(pointer, _Sub, deserializer) self.assertEqual(first.label, "psf") @@ -296,8 +298,8 @@ class ZarrInputArchiveTableTestCase(unittest.TestCase): def test_get_table_reconstructs_columns(self) -> None: out = ZarrOutputArchive() out.document.root.attributes.lsst["archive_class"] = "Image" - out.document.root.attributes.lsst["tree"] = "tree" - out.document.root.arrays["tree"] = ZarrArray(data=np.frombuffer(b"{}", dtype=np.uint8)) + out.document.root.attributes.lsst["json"] = "lsst_json" + out.document.root.arrays["lsst_json"] = ZarrArray(data=np.frombuffer(b"{}", dtype=np.uint8)) original = astropy.table.Table( { "x": np.arange(4, dtype=np.int32), diff --git a/tests/test_zarr_model.py b/tests/test_zarr_model.py index e21a14a6..00a9041c 100644 --- a/tests/test_zarr_model.py +++ b/tests/test_zarr_model.py @@ -137,7 +137,7 @@ def test_round_trip_through_memory_store(self) -> None: # Build a flat IR: image, variance, mask siblings at root. doc = ZarrDocument(root=ZarrGroup()) doc.root.attributes.lsst["archive_class"] = "MaskedImage" - doc.root.attributes.lsst["tree"] = "tree" + doc.root.attributes.lsst["json"] = "lsst_json" image = ZarrArray(data=np.ones((4, 4), dtype="float32")) image.attributes.extra["_ARRAY_DIMENSIONS"] = ["y", "x"] @@ -150,7 +150,7 @@ def test_round_trip_through_memory_store(self) -> None: doc.root.arrays["mask"] = mask # Stub a 1-D uint8 'tree' array (JSON bytes). - doc.root.arrays["tree"] = ZarrArray(data=np.frombuffer(b"{}", dtype=np.uint8)) + doc.root.arrays["lsst_json"] = ZarrArray(data=np.frombuffer(b"{}", dtype=np.uint8)) store = zarr.storage.MemoryStore() doc.to_zarr(store) diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index ae27c54d..ae8c538e 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -60,12 +60,12 @@ def serializer(arch): pointer = archive.serialize_pointer("psf", serializer, key=12345) self.assertIsInstance(pointer, ZarrPointerModel) - self.assertEqual(pointer.path, "/psf/tree") + self.assertEqual(pointer.path, "/psf/lsst_json") # Cached on second call. again = archive.serialize_pointer("psf", serializer, key=12345) self.assertEqual(again, pointer) # IR holds the JSON bytes as a 1-D uint8 array. - node = archive.document.root.get("/psf/tree") + node = archive.document.root.get("/psf/lsst_json") self.assertEqual(str(node.dtype), "uint8") @@ -195,12 +195,12 @@ def test_write_image_to_local_directory(self) -> None: doc = ZarrDocument.from_zarr(store) # Top-level image and tree are present. self.assertIn("image", doc.root.arrays) - self.assertIn("tree", doc.root.arrays) + self.assertIn("lsst_json", doc.root.arrays) self.assertEqual(doc.root.arrays["image"].shape, (4, 5)) # LSST root attrs. lsst_attrs = doc.root.attributes.lsst self.assertEqual(lsst_attrs["archive_class"], "Image") - self.assertEqual(lsst_attrs["tree"], "tree") + self.assertEqual(lsst_attrs["json"], "lsst_json") # OME multiscales points at /image; no projection means # the unit scale is emitted. ome = doc.root.attributes.ome From 897a0077f0bb4a8922eb49e5711716155ab11121 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 08:58:29 -0700 Subject: [PATCH 34/60] docs: drop user-facing wcs_ast docs; WCS lives in lsst_json MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Zarr has no established convention for storing an AST FrameSet text dump; the RFC-5 nonlinear coordinateTransformations and the OME linear approximation are the only proposals on the table. NDF stores the WCS both as JSON and as a separate FrameSet representation; the zarr backend just relies on the JSON tree, which already round-trips SIP polynomials and any other PolyMap-based mapping byte-for-byte. The internal _stage_wcs_ast helper and add_tree's frame-set hook are left in place — they're never reached because no production serialize() calls archive.serialize_frame_set, but the design hooks are kept for future discussion. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/__init__.py | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/python/lsst/images/zarr/__init__.py b/python/lsst/images/zarr/__init__.py index 0d213129..f8570195 100644 --- a/python/lsst/images/zarr/__init__.py +++ b/python/lsst/images/zarr/__init__.py @@ -38,9 +38,9 @@ - ``lsst_json`` — 1-D ``uint8`` zarr array containing UTF-8 JSON of the Pydantic archive tree (the round-trip authority). The same name convention is used by the FITS backend's ``JSON`` HDU and the NDF - backend's ``/MORE/LSST/JSON`` path. -- ``wcs_ast`` — 1-D ``uint8`` zarr array containing the AST FrameSet - text (the WCS round-trip authority), when a projection exists. + backend's ``/MORE/LSST/JSON`` path. WCS information (including + full SIP / PolyMap distortion coefficients) lives inside this JSON + as part of the projection sub-tree. The mask is a 2-D unsigned integer (``uint8`` for ≤8 planes, up to ``uint64`` for 64 planes; >64 raises). Each pixel's bits encode the @@ -57,13 +57,15 @@ WCS handling ------------ -The AST ``FrameSet`` text at ``wcs_ast`` is the round-trip authority. -The layout layer also emits an OME-NGFF v0.5 affine -``coordinateTransformations`` block that approximates the linear part -of the pixel-to-sky map. Before emitting, residuals are sampled on an -11×11 grid; if the worst pixel-equivalent error exceeds 1.0 pixel, -the affine block is dropped and ``lsst.wcs_simplified_dropped: true`` -is recorded with the observed maximum. +The full WCS (frames, mappings, polynomial distortions) round-trips +through the JSON tree at ``lsst_json``. The layout layer also emits +an OME-NGFF v0.5 affine ``coordinateTransformations`` block on the +root group as a discoverability aid for OME tooling. Before emitting, +residuals are sampled on an 11×11 grid; if the worst pixel-equivalent +error exceeds 1.0 pixel, the affine block is dropped and +``lsst.wcs_simplified_dropped: true`` is recorded with the observed +maximum. The OME block is informational only — readers always +reconstruct the projection from the JSON tree. Cloud-friendly defaults ----------------------- From c8cf4558934167d6cdd0086c2409b042bbb8e103 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 09:53:24 -0700 Subject: [PATCH 35/60] feat: use AST linearApprox for the OME affine fit instead of hand-rolled grid MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Delegate the linear-fit / tolerance check to AST's astLinearApprox. We give it the image footprint as bounds and the requested per-pixel tolerance scaled to output (sky) units; AST returns coefficients when a fit within tolerance exists and None otherwise. Saves us the 3-point linearization + 11x11 grid residual sample, plus the hand-rolled great-circle separation helper. The affine block we emit is unchanged in shape: a [scale, affine] pair, where the affine matrix is built from AST's [c0, c1, J] layout (constants first, then column-major Jacobian). AffineCheckResult drops the diagnostic max_residual_pixels field and the corresponding lsst.wcs_simplified_max_residual_pixels root attr, since AST's pass/fail is binary — when dropped, we know the residual exceeded tol but not by how much. lsst.wcs_simplified_dropped is still recorded. Also adds Mapping.linearApprox to the AST bridge. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/_transforms/_ast.py | 26 ++++ python/lsst/images/zarr/_layout.py | 139 +++++++++------------ python/lsst/images/zarr/_output_archive.py | 8 +- tests/test_zarr_layout.py | 9 +- 4 files changed, 95 insertions(+), 87 deletions(-) diff --git a/python/lsst/images/_transforms/_ast.py b/python/lsst/images/_transforms/_ast.py index 09f4c4c6..ee7ec1f4 100644 --- a/python/lsst/images/_transforms/_ast.py +++ b/python/lsst/images/_transforms/_ast.py @@ -163,6 +163,32 @@ def inverted(self) -> Mapping: copy.invert() return Mapping._wrap(copy) + def linearApprox(self, lbnd: Any, ubnd: Any, tol: float) -> np.ndarray | None: + """Best linear approximation to this mapping over a hyper-box. + + Parameters + ---------- + lbnd, ubnd + Per-axis lower / upper input-coordinate bounds of the + box over which the approximation is required. + tol + Maximum permitted deviation from linearity, expressed + as a positive Cartesian displacement in the output + coordinate system. + + Returns + ------- + coeffs + A 1-D array of length ``Nout * (1 + Nin)`` giving the + linear coefficients for each output, ordered + ``[c, m_in0, m_in1, ...]`` per output. ``None`` if no + linear fit within ``tol`` exists. + """ + success, coeffs = self._impl.linearapprox(lbnd, ubnd, tol) + if not success: + return None + return np.asarray(coeffs) + class UnitMap(Mapping): def __init__(self, n_coord: int): super().__init__(starlink.Ast.UnitMap(n_coord)) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index 78a312fe..c93543dd 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -120,16 +120,16 @@ def chunks_aligned_to( @dataclass class AffineCheckResult: - """Result of validating a simplified affine against a full WCS. + """Result of asking AST whether a simplified affine fits a full WCS. When ``dropped`` is False, ``coordinate_transformations`` is the OME-NGFF ``coordinateTransformations`` list to emit. When True, - the caller must omit the block (or emit a unit scale only) and - record ``max_residual_pixels`` as the observed worst error. + AST could not find a linear approximation that stays within the + requested per-pixel tolerance over the whole image footprint, and + the caller must omit the block (or emit a unit scale only). """ dropped: bool - max_residual_pixels: float coordinate_transformations: list[dict[str, Any]] | None @@ -138,75 +138,76 @@ def affine_check( frame_set: Any, image_shape: tuple[int, int], max_residual_pixels: float = 1.0, - grid: int = 11, ) -> AffineCheckResult: """Build an OME affine ``coordinateTransformations`` from ``frame_set``. - The simplified affine is constructed by mapping three reference - pixels (origin and the two unit-axis steps) through ``frame_set`` - to recover the linear coefficients. The full pixel-to-sky map is - then evaluated at every grid point and compared to the affine's - prediction; the worst great-circle separation is divided by the - pixel scale to get a pixel-equivalent residual. - - If ``max_residual <= max_residual_pixels``, returns a result whose - ``coordinate_transformations`` is the affine block. Otherwise - returns a dropped result and the caller must emit the unit scale - (or no transformations at all). + Delegates to AST's ``linearapprox`` over the image footprint with + a tolerance scaled to ``max_residual_pixels`` of pixel-equivalent + error. AST returns the affine coefficients when the approximation + fits and ``None`` otherwise. + + Parameters + ---------- + frame_set + AST FrameSet whose base→current mapping goes from pixel + coordinates to sky. + image_shape + ``(h, w)`` of the image; used as the bounds of the box AST is + asked to approximate over. + max_residual_pixels + Maximum permitted deviation, in pixels, of any point in the + box from the linear prediction. AST is given the equivalent + threshold in output (sky) units after multiplying by the local + pixel scale. """ h, w = image_shape + mapping = frame_set.getMapping(frame_set.base, frame_set.current) - pixels = np.array([[0.0, 0.0], [1.0, 0.0], [0.0, 1.0]]) - sky_at_ref = _frame_set_apply(frame_set, pixels) - origin = sky_at_ref[0] - dxsky = sky_at_ref[1] - origin - dysky = sky_at_ref[2] - origin - affine_matrix = np.array( - [ - [dxsky[0], dysky[0], origin[0]], - [dxsky[1], dysky[1], origin[1]], - [0.0, 0.0, 1.0], - ] - ) - - pixel_scale_y = float(np.linalg.norm(dysky)) - pixel_scale_x = float(np.linalg.norm(dxsky)) - pixel_scale = float(np.sqrt(pixel_scale_y * pixel_scale_x)) + # Local pixel scale near the image origin: convert the user-supplied + # pixel tolerance into the output-coordinate units AST expects. + sample = _frame_set_apply(frame_set, np.array([[0.0, 0.0], [1.0, 0.0], [0.0, 1.0]])) + origin = sample[0] + pixel_scale_axis0 = float(np.linalg.norm(sample[1] - origin)) + pixel_scale_axis1 = float(np.linalg.norm(sample[2] - origin)) + pixel_scale = float(np.sqrt(pixel_scale_axis0 * pixel_scale_axis1)) if pixel_scale <= 0.0: - return AffineCheckResult( - dropped=True, - max_residual_pixels=float("inf"), - coordinate_transformations=None, + return AffineCheckResult(dropped=True, coordinate_transformations=None) + + tol_output = max_residual_pixels * pixel_scale + coeffs = mapping.linearApprox( + [0.0, 0.0], + [float(max(h - 1, 0)), float(max(w - 1, 0))], + tol_output, + ) + if coeffs is None: + # AST could not find a linear approximation within the requested + # tolerance over the image footprint. + return AffineCheckResult(dropped=True, coordinate_transformations=None) + + # AST coeffs layout for (Nin=2, Nout=2): the first Nout entries are + # the per-output constants; the remaining Nout*Nin entries are the + # Jacobian, ordered column-major (all ∂y/∂x_0 first, then all + # ∂y/∂x_1, etc.). + if len(coeffs) != 6: + raise ValueError( + f"linearApprox returned {len(coeffs)} coefficients; expected 6 for a 2-D pixel→sky mapping." ) + c0, c1, j00, j10, j01, j11 = (float(x) for x in coeffs) + # OME affine: rows are output axes, columns are input axes + the + # constant offset. + affine_matrix = [[j00, j01, c0], [j10, j11, c1], [0.0, 0.0, 1.0]] - ys = np.linspace(0.0, max(h - 1, 0), grid) - xs = np.linspace(0.0, max(w - 1, 0), grid) - grid_pixels = np.array([[y, x] for y in ys for x in xs]) - sky_full = _frame_set_apply(frame_set, grid_pixels) - affine_pred = (affine_matrix[:2, :2] @ grid_pixels.T).T + origin - great_circle = _angular_separation(sky_full, affine_pred) - max_residual = float(np.max(great_circle) / pixel_scale) + # Pixel scale per input axis: length of the corresponding Jacobian + # column in output coordinates. + scale_axis0 = float(np.hypot(j00, j10)) + scale_axis1 = float(np.hypot(j01, j11)) coordinate_transformations: list[dict[str, Any]] = [ - { - "type": "scale", - "scale": [pixel_scale_y, pixel_scale_x], - }, - { - "type": "affine", - "affine": affine_matrix.tolist(), - }, + {"type": "scale", "scale": [scale_axis0, scale_axis1]}, + {"type": "affine", "affine": affine_matrix}, ] - - if max_residual > max_residual_pixels: - return AffineCheckResult( - dropped=True, - max_residual_pixels=max_residual, - coordinate_transformations=None, - ) return AffineCheckResult( dropped=False, - max_residual_pixels=max_residual, coordinate_transformations=coordinate_transformations, ) @@ -219,26 +220,6 @@ def _frame_set_apply(frame_set: Any, pixels: Any) -> Any: return np.asarray(out).T -def _angular_separation(a: Any, b: Any) -> Any: - """Element-wise great-circle separation between two (lon, lat) arrays. - - Inputs in radians (AST default for unit sky frames). Returns a 1-D - array of separations in the same units as the input. - """ - a = np.asarray(a) - b = np.asarray(b) - lon_a, lat_a = a[:, 0], a[:, 1] - lon_b, lat_b = b[:, 0], b[:, 1] - dlon = lon_b - lon_a - return np.arccos( - np.clip( - np.sin(lat_a) * np.sin(lat_b) + np.cos(lat_a) * np.cos(lat_b) * np.cos(dlon), - -1.0, - 1.0, - ) - ) - - def decorate_sub_archives(document: ZarrDocument) -> None: """Decorate sub-archive groups with ``lsst.archive_class`` and OME attrs. diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index c75fdc67..98a7a44b 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -360,12 +360,8 @@ def add_tree(self, tree: ArchiveTree) -> None: image_shape=image_array.shape, max_residual_pixels=1.0, ) - if check.dropped: - lsst["wcs_simplified_dropped"] = True - lsst["wcs_simplified_max_residual_pixels"] = check.max_residual_pixels - else: - lsst["wcs_simplified_dropped"] = False - lsst["wcs_simplified_max_residual_pixels"] = check.max_residual_pixels + lsst["wcs_simplified_dropped"] = check.dropped + if not check.dropped: ct = check.coordinate_transformations multiscale = OmeMultiscale( name=self._archive_class.lower(), diff --git a/tests/test_zarr_layout.py b/tests/test_zarr_layout.py index 8697488f..ea118e1c 100644 --- a/tests/test_zarr_layout.py +++ b/tests/test_zarr_layout.py @@ -121,7 +121,12 @@ def test_pure_linear_passes(self) -> None: ) self.assertFalse(result.dropped) self.assertIsNotNone(result.coordinate_transformations) - self.assertLess(result.max_residual_pixels, 1e-6) + # The affine block AST returns has the scale embedded in the + # diagonal of the affine matrix. + affine_block = result.coordinate_transformations[1] + self.assertEqual(affine_block["type"], "affine") + self.assertAlmostEqual(affine_block["affine"][0][0], 0.2) + self.assertAlmostEqual(affine_block["affine"][1][1], 0.2) def test_high_distortion_drops_block(self) -> None: fs = self._make_distorted_frame_set() @@ -131,7 +136,7 @@ def test_high_distortion_drops_block(self) -> None: max_residual_pixels=1.0, ) self.assertTrue(result.dropped) - self.assertGreater(result.max_residual_pixels, 1.0) + self.assertIsNone(result.coordinate_transformations) @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") From 9c2f01e275fa4399728108cb69b9564392a510bb Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 10:32:57 -0700 Subject: [PATCH 36/60] fix: clear mypy errors in the zarr backend MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Replace zarr.storage.Store annotations with zarr.abc.store.Store (Store is exposed there, not from zarr.storage). - In _store.py, hoist the Store annotation out of the branch-local store assignments so each concrete subtype (ZipStore, FsspecStore, LocalStore) is accepted; ZipStore stays in its own variable since its close() guard reads ZipStore-specific state. - In _model.py, narrow the lazy-handle read return to np.ndarray via asarray, import BytesCodec/BloscCodec from zarr.codecs directly (the codecs submodule is not surfaced on the top-level zarr package's stub), and cast cname/shuffle to BloscCodec's enum-typed arguments — runtime accepts plain strings. - In _output_archive.py, drop None placeholders from MaskSchema iteration before building CfFlagAttributes; assert the top-level image is 2-D before passing its shape to affine_check; cast FrameSet to AST Object for Channel.write. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_model.py | 21 ++++++++++----- python/lsst/images/zarr/_output_archive.py | 30 +++++++++++++++++---- python/lsst/images/zarr/_store.py | 31 ++++++++++++---------- 3 files changed, 57 insertions(+), 25 deletions(-) diff --git a/python/lsst/images/zarr/_model.py b/python/lsst/images/zarr/_model.py index 399ae28d..07307ee5 100644 --- a/python/lsst/images/zarr/_model.py +++ b/python/lsst/images/zarr/_model.py @@ -39,10 +39,12 @@ from dataclasses import dataclass, field from types import EllipsisType -from typing import Any, Self +from typing import Any, Self, cast import numpy as np import zarr +from zarr.abc.store import Store +from zarr.codecs import BloscCodec, BytesCodec from ._common import LSST_NS, LSST_VERSION, OME_NS, OME_VERSION, ZarrCompressionOptions @@ -154,7 +156,8 @@ def read(self, *, slices: tuple[slice, ...] | EllipsisType = ...) -> np.ndarray: """ if isinstance(self.data, np.ndarray): return self.data if slices is ... else self.data[slices] - return self.data[...] if slices is ... else self.data[slices] + result = self.data[...] if slices is ... else self.data[slices] + return np.asarray(result) @dataclass @@ -204,12 +207,12 @@ class ZarrDocument: root: ZarrGroup = field(default_factory=ZarrGroup) @classmethod - def from_zarr(cls, store: zarr.storage.Store) -> Self: + def from_zarr(cls, store: Store) -> Self: """Open ``store`` and build a lazy IR view of its contents.""" zarr_root = zarr.open_group(store=store, mode="r", zarr_format=3) return cls(root=_group_from_zarr(zarr_root)) - def to_zarr(self, store: zarr.storage.Store) -> None: + def to_zarr(self, store: Store) -> None: """Materialize this IR into ``store`` (which must be empty).""" zarr_root = zarr.create_group(store=store, zarr_format=3, overwrite=False) _group_to_zarr(self.root, zarr_root) @@ -395,8 +398,14 @@ def _build_codecs(options: ZarrCompressionOptions) -> tuple[Any, list[Any]]: """ if options.codec != "blosc": raise NotImplementedError(f"Unsupported codec {options.codec!r}.") - serializer = zarr.codecs.BytesCodec() + serializer = BytesCodec() + # ``cname`` and ``shuffle`` are typed as enum literals on BloscCodec; + # at runtime any equivalent string is accepted, so cast through Any. compressors = [ - zarr.codecs.BloscCodec(cname=options.cname, clevel=options.clevel, shuffle=options.shuffle) + BloscCodec( + cname=cast(Any, options.cname), + clevel=options.clevel, + shuffle=cast(Any, options.shuffle), + ) ] return serializer, compressors diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index 98a7a44b..e623b4e5 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -14,7 +14,7 @@ __all__ = ("ZarrOutputArchive", "write") from collections.abc import Callable, Hashable, Iterator, Mapping -from typing import Any, ClassVar +from typing import Any, ClassVar, cast import astropy.io.fits import astropy.table @@ -266,7 +266,13 @@ def _pack_mask(self, array: np.ndarray) -> tuple[np.ndarray, CfFlagAttributes]: packed = np.zeros(array.shape[:2], dtype=target_dtype) for i in range(array.shape[2]): packed |= array[..., i].astype(target_dtype) << (8 * i) - planes = [MaskPlaneEntry(name=p.name, bit=i, description=p.description) for i, p in enumerate(schema)] + # ``MaskSchema`` may carry ``None`` placeholders for retired plane + # bits; drop them in the CF flag list. + planes = [ + MaskPlaneEntry(name=p.name, bit=i, description=p.description) + for i, p in enumerate(schema) + if p is not None + ] return packed, CfFlagAttributes(planes=planes) def add_table( @@ -355,9 +361,15 @@ def add_tree(self, tree: ArchiveTree) -> None: ct: list[dict[str, Any]] | None = None if self._frame_sets: fs = self._frame_sets[0][0] + if len(image_array.shape) != 2: + raise ValueError( + f"Top-level image must be 2-D for the OME affine " + f"check; got shape {image_array.shape}." + ) + image_shape: tuple[int, int] = (image_array.shape[0], image_array.shape[1]) check = affine_check( frame_set=fs, - image_shape=image_array.shape, + image_shape=image_shape, max_residual_pixels=1.0, ) lsst["wcs_simplified_dropped"] = check.dropped @@ -377,9 +389,17 @@ def add_tree(self, tree: ArchiveTree) -> None: decorate_sub_archives(self.document) def _stage_wcs_ast(self, frame_set: FrameSet) -> str: - """Encode an AST FrameSet as UTF-8 text and stage at /wcs_ast.""" + """Encode an AST FrameSet as UTF-8 text and stage at /wcs_ast. + + Currently dead — left for future use; see ``add_tree``'s frame-set + hook. + """ + from .._transforms._ast import Object as _AstObject + stream = StringStream() - Channel(stream, options="Full=-1,Comment=0,Indent=0").write(frame_set) + # FrameSet inherits from Object in our AST bridge; cast for the + # ``Channel.write`` signature which is typed against the base class. + Channel(stream, options="Full=-1,Comment=0,Indent=0").write(cast(_AstObject, frame_set)) text = stream.getSinkData() wcs_data = np.frombuffer(text.encode("utf-8"), dtype=np.uint8) # Single chunk: WCS is always read whole. diff --git a/python/lsst/images/zarr/_store.py b/python/lsst/images/zarr/_store.py index 50ffadbd..032cecef 100644 --- a/python/lsst/images/zarr/_store.py +++ b/python/lsst/images/zarr/_store.py @@ -18,6 +18,7 @@ from contextlib import contextmanager import zarr +from zarr.abc.store import Store from lsst.resources import ResourcePath, ResourcePathExpression @@ -31,7 +32,7 @@ def _is_remote(rp: ResourcePath) -> bool: @contextmanager -def open_store_for_write(path: ResourcePathExpression) -> Iterator[zarr.storage.Store]: +def open_store_for_write(path: ResourcePathExpression) -> Iterator[Store]: """Open a zarr store for writing. Refuses to overwrite a non-empty existing store. The returned @@ -39,18 +40,19 @@ def open_store_for_write(path: ResourcePathExpression) -> Iterator[zarr.storage. finalizes the central directory. """ rp = ResourcePath(path) + store: Store if _is_zip(rp): if _is_remote(rp): raise NotImplementedError("Remote ZipStore writes are a follow-up.") local = rp.ospath if os.path.exists(local) and os.path.getsize(local) > 0: raise OSError(f"File {local!r} already exists.") - store = zarr.storage.ZipStore(local, mode="w") + zip_store = zarr.storage.ZipStore(local, mode="w") try: - yield store + yield zip_store finally: - if getattr(store, "_is_open", False): - store.close() + if getattr(zip_store, "_is_open", False): + zip_store.close() return if _is_remote(rp): import fsspec @@ -70,25 +72,26 @@ def open_store_for_write(path: ResourcePathExpression) -> Iterator[zarr.storage. @contextmanager -def open_store_for_read(path: ResourcePathExpression) -> Iterator[zarr.storage.Store]: +def open_store_for_read(path: ResourcePathExpression) -> Iterator[Store]: """Open a zarr store for reading.""" rp = ResourcePath(path) + store: Store if _is_zip(rp): if _is_remote(rp): with rp.as_local() as local: - store = zarr.storage.ZipStore(local.ospath, mode="r") + zip_store = zarr.storage.ZipStore(local.ospath, mode="r") try: - yield store + yield zip_store finally: - if getattr(store, "_is_open", False): - store.close() + if getattr(zip_store, "_is_open", False): + zip_store.close() return - store = zarr.storage.ZipStore(rp.ospath, mode="r") + zip_store = zarr.storage.ZipStore(rp.ospath, mode="r") try: - yield store + yield zip_store finally: - if getattr(store, "_is_open", False): - store.close() + if getattr(zip_store, "_is_open", False): + zip_store.close() return if _is_remote(rp): import fsspec From 3250f8940b63b49f44eb81f091de65069edaf1b5 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 13:45:18 -0700 Subject: [PATCH 37/60] docs: fix sphinx :py:obj: cross-reference targets Eight Sphinx warnings (treated as errors with -W) for unresolvable :py:obj: references in single-backtick form: - `MaskedImage` / `ColorImage` -> use full paths `~lsst.images.MaskedImage` / `~lsst.images.ColorImage` so the cross-reference resolves. - `fsspec`, `zarr`, `TemporaryDirectory` -> external / stdlib refs without intersphinx mappings; switch to literal double backticks. - `ZarrDocument` / `ZarrDocument.to_zarr` -> internal IR types not exported from lsst.images.zarr; switch to literal double backticks (matches NDF's _model.NdfDocument convention). Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/tests/_roundtrip.py | 2 +- python/lsst/images/zarr/__init__.py | 8 ++++---- python/lsst/images/zarr/_output_archive.py | 4 ++-- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/python/lsst/images/tests/_roundtrip.py b/python/lsst/images/tests/_roundtrip.py index 1531fedb..992cd00f 100644 --- a/python/lsst/images/tests/_roundtrip.py +++ b/python/lsst/images/tests/_roundtrip.py @@ -335,7 +335,7 @@ class RoundtripZarr[T](RoundtripBase[T]): Zarr archives are directories rather than single files, so the base class's ``NamedTemporaryFile`` pattern doesn't fit. - `_run_without_butler` is overridden to use a `TemporaryDirectory` + ``_run_without_butler`` is overridden to use a ``TemporaryDirectory`` and a fresh archive path inside it. """ diff --git a/python/lsst/images/zarr/__init__.py b/python/lsst/images/zarr/__init__.py index f8570195..cbac1931 100644 --- a/python/lsst/images/zarr/__init__.py +++ b/python/lsst/images/zarr/__init__.py @@ -31,7 +31,7 @@ On-disk layout -------------- -A `MaskedImage` archive contains: +A `~lsst.images.MaskedImage` archive contains: - ``image``, ``variance``, ``mask`` arrays at the root, shaped ``(Y, X)`` with shared chunk sizes. @@ -46,7 +46,7 @@ ``uint64`` for 64 planes; >64 raises). Each pixel's bits encode the applicable mask planes. -For `ColorImage`, the three channels are written as flat 2-D arrays +For `~lsst.images.ColorImage`, the three channels are written as flat 2-D arrays at ``red``, ``green``, ``blue``. For ``CellCoadd``, ``image`` / ``variance`` / ``mask`` are siblings @@ -80,7 +80,7 @@ - Both ``DirectoryStore`` and ``ZipStore`` are supported; the choice is driven by URI shape (``*.zarr.zip`` → ``ZipStore``, otherwise directory). Remote URIs (``s3://``, ``gs://``, ``http(s)://``) go - through `lsst.resources.ResourcePath` and `fsspec`. + through `lsst.resources.ResourcePath` and ``fsspec``. Round-trip with FITS -------------------- @@ -101,7 +101,7 @@ pip install lsst-images[zarr] The top-level ``import lsst.images.zarr`` raises a clear -`ImportError` with this guidance if `zarr` is not installed. +`ImportError` with this guidance if ``zarr`` is not installed. """ try: diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index e623b4e5..edc94176 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -64,10 +64,10 @@ class ZarrOutputArchive(OutputArchive[ZarrPointerModel]): - """Output archive that populates a `ZarrDocument` IR. + """Output archive that populates a ``ZarrDocument`` IR. Bytes are not written until the IR is materialized via - `ZarrDocument.to_zarr`, which the public `write` helper performs + ``ZarrDocument.to_zarr``, which the public `write` helper performs on context-manager exit. Parameters From 2f1e6430be5a103d95b60dc162e2b0ea861f146e Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Sat, 23 May 2026 16:41:31 -0700 Subject: [PATCH 38/60] docs: expand zarr backend reference page with data model and tooling Adds sections to doc/lsst.images/zarr.rst: - Standards alignment (Zarr v3, xarray/CF, OME-NGFF v0.5, geo-zarr, LSST archive tree) - Data model (lsst_json tree, root attrs, FITS opaque metadata, per-array) - Example layouts for VisitImage, CellCoadd, and ColorImage with the full directory tree and per-array shape / chunk notes - WCS handling (PolyMap chain in JSON, OME affine approx via AST linearapprox) - Tooling that can read these files: xarray, napari-ome-zarr, ome-zarr-py, GDAL/rasterio, zarr-python, napari, neuroglancer, ngff-validator - Round-trip with FITS Generated with AI Co-Authored-By: SLAC AI --- doc/lsst.images/zarr.rst | 143 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 143 insertions(+) diff --git a/doc/lsst.images/zarr.rst b/doc/lsst.images/zarr.rst index fa770962..7ac7c841 100644 --- a/doc/lsst.images/zarr.rst +++ b/doc/lsst.images/zarr.rst @@ -10,6 +10,149 @@ This backend requires the optional ``zarr >= 3.0`` package. Install via the ``[z pip install lsst-images[zarr] +Standards alignment +------------------- + +The on-disk container is `Zarr v3 `_. +On top of that we layer four community standards so the same bytes are usable by tools that don't know anything about LSST: + +* `xarray / CF-conventions `_ — every array carries an ``_ARRAY_DIMENSIONS`` attribute and a v3 ``dimension_names`` metadata field. The mask carries CF ``flag_masks`` / ``flag_meanings`` / ``flag_descriptions`` so any CF-aware tool can interpret the bit assignments. +* `OME-NGFF v0.5 `_ — the root group carries a ``multiscales`` block whose only ``dataset.path`` points back at the same ``image`` array. This makes the same archive openable by OME-Zarr tooling without any byte duplication. +* `Geo-Zarr `_ shape compatibility — sibling arrays sharing ``(y, x)`` dimensions with CF flag attributes is the same convention `rasterio` and `GDAL`'s Zarr driver expect for raster + mask layers. +* `LSST archive tree <#data-model>`_ — a Pydantic JSON document at ``/lsst_json`` carries the full LSST-specific metadata (WCS, PSF, detector, butler info, …) that the community standards have no place for. Same convention as the FITS backend's ``JSON`` HDU and the NDF backend's ``/MORE/LSST/JSON`` path. + +Data model +---------- + +Every archive contains the following pieces: + +``/lsst_json`` (1-D ``uint8``) + UTF-8 encoded JSON of the Pydantic archive tree (see `~lsst.images.serialization.ArchiveTree`). + The round-trip authority — every array reference, projection, PSF, mask schema, butler provenance, etc. lives here. + SIP polynomials and other ``PolyMap``-based distortions round-trip byte-exact through the chain of `Mapping `_ models embedded in this JSON. + Stored as a single chunk because it is always read whole. + +Root attributes (``zarr.json`` ``attributes``) + Three namespaces: + + * ``lsst.*`` — backend-specific keys: ``archive_class``, ``json``, ``opaque_metadata_format``, ``cell_grid``, ``wcs_simplified_dropped``. + * ``ome.*`` — OME-NGFF v0.5 ``multiscales`` block (and ``omero/channels`` when a channel axis exists). + * top-level — CF / xarray attributes that aren't tied to a specific axis. + +``/lsst/opaque_metadata/fits/primary`` (2-D ``(N, 80) uint8``) + Present only when an object originated from a FITS read. + Holds the primary HDU's card stream verbatim — ``Header.tostring()`` reshaped one row per card. + ``COMMENT``, ``HISTORY``, ``HIERARCH``, and ``CONTINUE`` cards survive byte-for-byte. + +Per-array data + The ``image`` / ``variance`` / ``mask`` arrays at the root, plus any class-specific extras. + Mask is a 2-D unsigned integer (``uint8`` for ≤8 planes, ``uint64`` for 17–64 planes; >64 raises) with CF ``flag_masks`` / ``flag_meanings`` / ``flag_descriptions``. + +Example layouts +--------------- + +`~lsst.images.VisitImage` +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The most common case — a single detector exposure with a projection, PSF, and detector geometry:: + + visit.zarr/ + ├── zarr.json ← root attrs (lsst.archive_class="VisitImage", + │ ome.multiscales, data_model, version, …) + ├── image/ ← (Y, X) float32, dim_names=["y", "x"] + ├── variance/ ← (Y, X) float64 + ├── mask/ ← (Y, X) packed wide-int with CF flag attrs + ├── lsst_json/ ← 1-D uint8, the LSST archive tree + ├── psf/ ← (PSF parameters as one or more arrays) + └── lsst/opaque_metadata/fits/primary/ ← (N, 80) uint8 (when read from a FITS file) + +The ``lsst_json`` tree carries the projection, PSF type, detector reference, observation summary stats, photometric scaling, aperture-correction map, and any background fields. +For the WCS specifically, the projection's ``pixel_to_sky`` mapping is decomposed into a chain of Frames and Mappings (including any ``PolyMap`` for SIP distortion); reading is byte-exact. + +`~lsst.images.cells.CellCoadd` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A coadd composed of a regular grid of cells, each with its own PSF:: + + coadd.zarr/ + ├── zarr.json ← lsst.archive_class="CellCoadd", + │ lsst.cell_grid={bbox, cell_shape}, + │ ome.multiscales pointing at /image + ├── image/ ← (Y, X) float32, chunks = cell_shape + ├── variance/ ← (Y, X) float64, chunks = cell_shape + ├── mask/ ← (Y, X) packed wide-int, chunks = cell_shape + ├── psf/ ← (Cy, Cx, Py, Px) float32, + │ chunks=(1, 1, Py, Px) — one chunk per cell + ├── lsst_json/ + └── lsst/opaque_metadata/fits/primary/ + +The ``image`` / ``variance`` / ``mask`` chunks are aligned to the cell grid so reading a single cell is one chunk per array. +The ``psf`` array's chunking is per-cell so a single-cell PSF read is also one chunk. + +`~lsst.images.ColorImage` +~~~~~~~~~~~~~~~~~~~~~~~~~ + +A 3-channel display image:: + + color.zarr/ + ├── zarr.json ← lsst.archive_class="ColorImage" + │ (no root-level ome.multiscales) + ├── red/ ← (Y, X) uint8, dim_names=["y", "x"] + ├── green/ ← (Y, X) uint8 + ├── blue/ ← (Y, X) uint8 + └── lsst_json/ + +Channels are flat top-level arrays rather than a stacked ``(3, Y, X)`` array, so xarray sees them as three independent 2-D variables and there is no byte duplication for the OME view. + +WCS handling +------------ + +The full WCS — including SIP polynomials and any other ``PolyMap``-based distortion — round-trips through the JSON tree at ``lsst_json`` as a chain of `~lsst.images.FrameSet` / Mapping models. +The layout layer also asks AST's `linearapprox `_ for an affine approximation over the image footprint at one-pixel accuracy. +If AST returns one, the OME ``coordinateTransformations`` block on the root multiscale is populated with the resulting ``[scale, affine]`` pair. +If AST cannot fit a linear approximation within tolerance, the block is dropped and ``lsst.wcs_simplified_dropped: true`` is set on the root attrs. +The OME block is always informational — readers reconstruct the projection from the JSON tree, never from the OME block. + +Tooling that can read these files +--------------------------------- + +The standards-aligned root layout means tools that don't know about LSST can still open the file in some useful capacity: + +`xarray `_ + ``xr.open_zarr(path)`` returns a ``Dataset`` with one ``DataArray`` per zarr array sharing ``(y, x)`` dimensions, CF flag attributes on the mask variable, and any per-array ``units`` / ``long_name``. + The Pydantic JSON tree at ``/lsst_json`` shows up as a 1-D ``uint8`` variable; xarray ignores it for analysis, you decode it manually if you need the LSST metadata. + +`napari-ome-zarr `_ and `ome-zarr-py `_ + Browse and visualize the science image through the OME-NGFF multiscales block. + Sees the ``image`` array as the only level of a single multiscale; ignores everything else. + +`GDAL `_'s Zarr driver and `rasterio `_ + Opens individual top-level arrays as raster bands. + Reads CF attributes including the mask's ``flag_masks`` / ``flag_meanings``. + +`zarr-python `_ + Direct array access at any path, including from S3 / GCS / HTTP via fsspec. + Subset reads via ``arr[y0:y1, x0:x1]`` only fetch chunks intersecting the slice. + +`napari `_ via the OME-Zarr plugin + Same OME view as ``napari-ome-zarr``. + +`neuroglancer `_ + Native OME-NGFF support; will display the science image with the affine ``coordinateTransformations`` block when present. + +`ngff-validator `_ + Validates the OME-NGFF v0.5 metadata block against the schema. + +Round-trip with FITS +-------------------- + +When an object that originated from a FITS read carries a `~lsst.images.fits.FitsOpaqueMetadata`, the primary-HDU header is preserved at ``/lsst/opaque_metadata/fits/primary`` as a 2-D ``(N, 80)`` byte array. +Reading the zarr back attaches an equivalent ``FitsOpaqueMetadata`` to the deserialized object so a subsequent FITS write reproduces the original cards. +This means an ``LSSTCam`` raw read in via FITS, written to zarr, read back, and written again to FITS will round-trip the full primary header — including ``COMMENT``, ``HISTORY``, ``HIERARCH``, and ``CONTINUE`` cards — byte-for-byte. + +API reference +------------- + .. automodapi:: lsst.images.zarr :no-inheritance-diagram: :include-all-objects: From 5ca4c74e64c947e1557543fd7f6a13b58c726319 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 09:06:39 -0700 Subject: [PATCH 39/60] docs: spec for default zarr v3 sharding and smaller chunk default Brainstormed design for adding automatic shard defaulting and dropping the per-axis chunk default from 1024 to 256. No public API changes. Adds DEFAULT_TARGET_SHARD_BYTES (env-overridable) and a pure default_shards helper next to chunks_for in _layout. Generated with AI Co-Authored-By: SLAC AI --- .../specs/2026-05-25-zarr-sharding-design.md | 272 ++++++++++++++++++ 1 file changed, 272 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-25-zarr-sharding-design.md diff --git a/docs/superpowers/specs/2026-05-25-zarr-sharding-design.md b/docs/superpowers/specs/2026-05-25-zarr-sharding-design.md new file mode 100644 index 00000000..b638f627 --- /dev/null +++ b/docs/superpowers/specs/2026-05-25-zarr-sharding-design.md @@ -0,0 +1,272 @@ +# Zarr v3 Sharding & Smaller Chunk Defaults + +Date: 2026-05-25 +Status: approved (brainstorming complete; awaiting implementation plan) + +## Background + +The zarr v3 backend currently writes arrays without shards. The +`shards` field is plumbed through `ZarrArray` and `ZarrOutputArchive` +all the way to `zarr.create_array`, but the archive never populates +it, so every chunk becomes a separate object on disk / in cloud +storage. The default per-axis chunk limit is 1024, which produces ~4 +MiB float32 chunks — fine for full-image reads but on the larger end +for cutout-style science access. + +Modern zarr v3 guidance for cloud-backed stores is: + +- small-ish *logical chunks* sized for science access patterns; +- larger *physical shards* sized to amortise S3 / GCS request cost; +- avoid `.zarr.zip` for cloud distribution — keep it for packaging + and local export. + +This spec covers the first two. Zip support stays as it is today +(useful for tests and packaging); we are not deprecating it. + +## Goals + +- Default sharding "just works" with no public API changes — the + caller does not have to think about chunk-vs-shard ratios. +- Smaller chunk default for science access (256² for plain images). +- One internal knob (`DEFAULT_TARGET_SHARD_BYTES`) and one env-var + escape hatch for tuning without code changes. +- Old archives continue to read; round-trip data equality is + preserved. + +## Non-goals + +- Changing `ZarrCompressionOptions` defaults. +- Re-tuning the `CellCoadd` cell-aligned chunk rule. +- Reading `shards` metadata back into the IR (`ZarrArray.from_zarr` + still ignores it; the input archive slices through `zarr.Array`). +- Adding any new kwarg to public `write_zarr`. +- Deprecating `ZipStore`. + +## Architecture + +Three files are touched. No public API additions or renames. + +``` +python/lsst/images/zarr/ + _common.py # +DEFAULT_CHUNK_AXIS_LIMIT (was hardcoded 1024 in _layout) + # +DEFAULT_TARGET_SHARD_BYTES (env-overridable, read once at import) + _layout.py # chunks_for: clamp constant moves to _common, value 1024 → 256 + # +default_shards(chunks, shape, dtype, *, target_bytes) helper + _output_archive.py # call default_shards alongside chunks_for in add_array; + # IR node gets shards populated when caller did not override +``` + +`_model.py` is **not** modified. `_group_to_zarr` continues to pass +`shards=array.shards` through to `zarr.create_array`. By the time the +IR reaches the writer, every array's `shards` is either explicitly +set by the caller, populated by the default helper, or `None` (for +tiny single-chunk arrays). + +### Why eager defaulting in the archive layer + +This pattern mirrors the existing `chunks_for` / +`chunks_aligned_to` helpers in `_layout.py`. The archive sets +shape-derived defaults at IR-construction time, the model writer +stays a dumb serialiser, and tests can assert IR-level shape +decisions without driving zarr. Lazy defaulting in `_group_to_zarr` +was considered and rejected — it would push policy logic into the +writer and make the IR's effective layout invisible until write +time. + +## Constants + +In `_common.py`: + +- `DEFAULT_CHUNK_AXIS_LIMIT: int = 256` — replaces the hardcoded + `_DEFAULT_AXIS_LIMIT = 1024` currently in `_layout.py`. +- `DEFAULT_TARGET_SHARD_BYTES: int` — `16 * 1024 * 1024` by default. + At import time read `LSST_IMAGES_ZARR_TARGET_SHARD_BYTES`; if set, + parse as base-10 int. A `ValueError` from `int()` propagates and + fails import — silent typos are worse than loud failure. No + `1MiB`-style suffix parsing. + +`chunks_for` in `_layout.py` reads `DEFAULT_CHUNK_AXIS_LIMIT` from +`_common`. `chunks_aligned_to` is unchanged — it derives sibling +chunks from `image_chunks`, so it follows the new default +automatically. + +The `CellCoadd` cell-aligned branch and the 4-D PSF branch +(`(1, 1, h, w)`) in `chunks_for` are unchanged — those are +class-specific layout rules, not default-clamp questions. + +## The `default_shards` rule + +Pure function, no archive-class arg, no `archive_metadata` arg: + +```python +def default_shards( + chunks: tuple[int, ...], + shape: tuple[int, ...], + dtype: np.dtype, + *, + target_bytes: int, +) -> tuple[int, ...] | None: + if len(chunks) != len(shape): + raise ValueError("chunks and shape rank mismatch") + itemsize = dtype.itemsize + if itemsize == 0: + return None # object dtype etc. + chunk_bytes = math.prod(chunks) * itemsize + if chunk_bytes >= target_bytes: + return None # one chunk already big enough + growable = [i for i in range(len(shape)) if chunks[i] < shape[i]] + if not growable: + return None # array fits in one chunk per axis + ratio = target_bytes / chunk_bytes + k = max(1, round(ratio ** (1 / len(growable)))) + if k <= 1: + return None # rounding produced a no-op shard + shard = list(chunks) + for i in growable: + n_chunks_axis = math.ceil(shape[i] / chunks[i]) + shard[i] = min(chunks[i] * k, chunks[i] * n_chunks_axis) + return tuple(shard) +``` + +Properties of the rule: + +- **Integer-multiple alignment per axis**: every shard axis is + `chunks[i] * m` for some `m ≥ 1`. zarr v3 requires this. +- **Spatial-only growth falls out for free**: a 3-D mask + `(8, 4096, 4096)` chunked `(8, 256, 256)` has `growable = [1, 2]`, + so the plane axis is left alone. +- **Tiny arrays skip sharding**: a `(N, 80)` FITS-card array, a + single-chunk `lsst_json`, or any array whose chunks already cover + every axis returns `None`. +- **CellCoadd PSF** `(25, 25, h, w)` chunked `(1, 1, h, w)` has + `growable = [0, 1]`, so it shards the cell-grid axes only — no + class-specific rule needed. +- **Cap at array bounds**: small arrays do not get shards larger + than the array itself. + +### Worked examples (target = 16 MiB) + +| array | shape | chunks | dtype | result | +|--------------------|--------------------|-------------------|---------|--------------------------| +| `image` (4k×4k) | (4096, 4096) | (256, 256) | float32 | shard `(2048, 2048)` | +| `mask` (3-D, 4k) | (8, 4096, 4096) | (8, 256, 256) | uint8 | shard `(8, 1536, 1536)` | +| `variance` (4k×4k) | (4096, 4096) | (256, 256) | float32 | shard `(2048, 2048)` | +| CellCoadd `psf` | (25, 25, 150, 150) | (1, 1, 150, 150) | float32 | shard `(14, 14, 150, 150)` | +| small image | (600, 600) | (256, 256) | float32 | shard `(768, 768)` (capped) | +| `lsst_json` | (N,) | (N,) | uint8 | `None` | +| `wcs_ast` | (M,) | (M,) | uint8 | `None` | +| FITS primary | (N, 80) | (N, 80) | uint8 | `None` | + +## Per-array behaviour in `ZarrOutputArchive` + +The pattern at every site that decides chunks today +(`_output_archive.py:183` for the MaskedImage path, +`_output_archive.py:202-241` for `add_array`): + +```python +chunks = self._chunks.get(name) or self._chunks.get(leaf) or +shards = self._shards.get(name) or self._shards.get(leaf) +if shards is None: + shards = default_shards( + chunks, packed.shape, packed.dtype, + target_bytes=DEFAULT_TARGET_SHARD_BYTES, + ) +ZarrArray(data=..., chunks=chunks, shards=shards, ...) +``` + +Coverage by call site: + +| call site | what gets sharded | +|----------------------------------------------|--------------------------------------------| +| MaskedImage path (`_output_archive.py:~183`) | `image`, `variance`, `mask` | +| `add_array` generic (`_output_archive.py:~228`) | top-level sibling arrays | +| `add_array` PSF branch (`_output_archive.py:~223`) | CellCoadd `psf` 4-D | +| JSON tree (`_output_archive.py:~142`, `:~337`) | `lsst_json` — helper returns `None` | +| `wcs_ast` (`_output_archive.py:~406`) | helper returns `None` | +| `serialize_fits_opaque_metadata` (`_layout.py:~281`) | helper returns `None` | + +Bulk pixel arrays (`image`, `variance`, `mask`, `psf`) and any +user-supplied extra arrays large enough to qualify gain `shards`. +Everything tiny / single-chunk is auto-`None`. + +User overrides remain unchanged: passing `shards={"image": (...)}` to +`write_zarr` still wins because the override is consulted before the +default helper. + +## Error handling + +- `default_shards` raises `ValueError` on mismatched ndim between + `chunks` and `shape`, mirroring `chunks_aligned_to`. All other + inputs are total — no exceptions on well-formed numeric data. +- `dtype.itemsize == 0` (object dtype) → `None`. Defensive guard; + object dtypes are not written today. +- Env-var parse failure raises at import. + +## Backward compatibility + +- **Reading old archives**: unaffected. `ZarrArray.from_zarr` does + not consult `shards`. The input archive slices through + `zarr.Array`. +- **Round-trip equality**: byte-equal data round-trips unchanged. + Tests asserting array equality continue to pass. +- **On-disk file counts**: any test asserting a specific file count + on disk needs updating. None known today. +- **Old test fixtures** (e.g. `dp1.zarr/`): readable as before; the + change is write-side only. +- **ZipStore**: unchanged. `zarr.storage.ZipStore` accepts sharded + arrays the same way as `LocalStore` — shards inside a zip are + nested keys, no special handling. + +### Performance note + +A 4k×4k float32 image full-read goes from 16 chunks to 256 chunks +when the chunk default drops 1024 → 256. Sharding keeps the I/O +profile identical (4 GETs, same wire bytes), but per-chunk decode +runs 16× more often. Expected to be invisible: blosc-zstd decode +is fast and concurrent. If a benchmark regresses, the fallback is +to bump `DEFAULT_CHUNK_AXIS_LIMIT` to 512. + +## Testing + +### Unit tests for `default_shards` (new file `tests/test_zarr_layout.py`) + +- 4k×4k float32 with `(256, 256)` chunks → `(2048, 2048)`. +- 3-D mask `(8, 4096, 4096)` uint8 with `(8, 256, 256)` chunks → + `(8, 1536, 1536)` — plane axis untouched. +- Tiny 1-D single-chunk array → `None`. +- `chunks == shape` (single-chunk of any size) → `None`. +- `chunk_bytes >= target_bytes` (already-big chunk) → `None`. +- `k <= 1` boundary → `None`. +- Cap at array bounds: shape `(600, 600)`, chunks `(256, 256)`, + ratio 64 → shard `(768, 768)`, not `(2048, 2048)`. +- Mismatched ndim raises `ValueError`. +- `dtype.itemsize == 0` → `None`. + +### Env-var test (`tests/test_zarr_layout.py`) + +- Set `LSST_IMAGES_ZARR_TARGET_SHARD_BYTES`, re-import the module + in a subprocess (cleanest way to re-run import-time init), assert + the constant changed. +- Garbage value raises at import. + +### Round-trip / integration (extend existing zarr round-trip tests) + +- Assert one large-image round-trip writes an `image` array whose + on-disk metadata has non-`None` `shards` and shards are integer + multiples of chunks per axis. +- Assert `lsst_json` and `wcs_ast` arrays come back with `shards` + unset (or `None` in metadata). +- CellCoadd round-trip: assert PSF `psf` array's `shards != chunks` + (i.e. the byte-budget rule actually fired). +- Existing data-equality round-trip checks are unmodified and + continue to gate correctness. + +### Zip round-trip (extend `tests/test_zarr_store.py`) + +- Add one assertion to an existing zip test that a sharded + write/read round-trips through `ZipStore` cleanly. + +### Verification command + +`.pyenv/bin/python -m pytest tests/ -k zarr` is the gate for the +implementation phase. From 8fe2b0e36e0ec3b92b6a565b63cdaf4e2074192b Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 09:14:06 -0700 Subject: [PATCH 40/60] docs: implementation plan for zarr default sharding Seven-task TDD plan covering: lower chunk default 1024->256, add DEFAULT_TARGET_SHARD_BYTES env-var-driven constant, add default_shards helper with byte-budget rule, wire it into ZarrOutputArchive.add_array, integration tests for image / PSF / zip-store round-trips, and a final regression sweep. Generated with AI Co-Authored-By: SLAC AI --- .../plans/2026-05-25-zarr-sharding.md | 888 ++++++++++++++++++ 1 file changed, 888 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-25-zarr-sharding.md diff --git a/docs/superpowers/plans/2026-05-25-zarr-sharding.md b/docs/superpowers/plans/2026-05-25-zarr-sharding.md new file mode 100644 index 00000000..15acf24d --- /dev/null +++ b/docs/superpowers/plans/2026-05-25-zarr-sharding.md @@ -0,0 +1,888 @@ +# Zarr v3 Default Sharding Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Enable automatic zarr v3 sharding for bulk pixel arrays and drop the per-axis chunk default from 1024 to 256, with no public API changes. + +**Architecture:** A pure `default_shards(chunks, shape, dtype, *, target_bytes)` helper lives next to `chunks_for` in `_layout.py`. Two module-level constants (chunk-axis limit and target shard bytes) live in `_common.py`; the shard target reads `LSST_IMAGES_ZARR_TARGET_SHARD_BYTES` once at import. `ZarrOutputArchive.add_array` calls the helper at the same point chunks are decided, so the IR's `ZarrArray.shards` is populated whenever the caller didn't supply an override. The model writer (`_group_to_zarr`) is unchanged. + +**Tech Stack:** Python 3.12, zarr v3 (`zarr-python` 3.x), numpy, unittest. Project uses `.pyenv/bin/python` to run; system Python lacks zarr. + +**Design spec:** `docs/superpowers/specs/2026-05-25-zarr-sharding-design.md` + +--- + +## File Structure + +| Path | Change | Responsibility | +|---------------------------------------------------|----------|---------------------------------------------| +| `python/lsst/images/zarr/_common.py` | modify | add `DEFAULT_CHUNK_AXIS_LIMIT`, `DEFAULT_TARGET_SHARD_BYTES` | +| `python/lsst/images/zarr/_layout.py` | modify | drop hardcoded 1024, read from `_common`; add `default_shards` | +| `python/lsst/images/zarr/_output_archive.py` | modify | call `default_shards` alongside `chunks_for` at the two existing sites | +| `tests/test_zarr_layout.py` | modify | update existing chunk-default test; add `default_shards` unit tests | +| `tests/test_zarr_common.py` | modify | add subprocess-based env-var tests | +| `tests/test_zarr_round_trip.py` | modify | add a 300×300 round-trip that asserts on-disk `shards` is set | +| `tests/test_zarr_output_archive.py` | modify | add a CellCoadd PSF shard-defaulting test | +| `tests/test_zarr_store.py` | modify | add a sharded write/read through `ZipStore` | + +No new files. + +--- + +## Task 1: Lower the chunk-axis default to 256 (test-first) + +**Files:** +- Modify: `tests/test_zarr_layout.py:56-60` (`test_chunks_for_default`) +- Modify: `python/lsst/images/zarr/_common.py` (add constant + export) +- Modify: `python/lsst/images/zarr/_layout.py:50` (replace hardcoded 1024) + +- [ ] **Step 1: Update the existing chunk-default test to expect 256** + +In `tests/test_zarr_layout.py`, replace `test_chunks_for_default` (currently lines 56–60) with: + +```python + def test_chunks_for_default(self) -> None: + # Plain images clamp to the per-axis chunk limit (256 by default). + self.assertEqual(chunks_for("Image", (4096, 4096), None), (256, 256)) + # Smaller than the limit -> use full dim. + self.assertEqual(chunks_for("Image", (200, 100), None), (200, 100)) +``` + +Also update `test_chunks_for_cell_coadd_without_metadata_falls_back` (currently lines 73–74): + +```python + def test_chunks_for_cell_coadd_without_metadata_falls_back(self) -> None: + self.assertEqual(chunks_for("CellCoadd", (4096, 4096), None), (256, 256)) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_layout.py::LayoutTestCase::test_chunks_for_default -v` + +Expected: FAIL — actual is `(1024, 1024)`, expected is `(256, 256)`. + +- [ ] **Step 3: Add `DEFAULT_CHUNK_AXIS_LIMIT` to `_common.py`** + +Edit `python/lsst/images/zarr/_common.py`: + +Add to `__all__` (currently lines 14–23): + +```python +__all__ = ( + "DEFAULT_CHUNK_AXIS_LIMIT", + "LSST_NS", + "LSST_VERSION", + "OME_NS", + "OME_VERSION", + "ZarrCompressionOptions", + "ZarrPointerModel", + "archive_path_to_zarr_path", + "mask_dtype_for_plane_count", +) +``` + +After `LSST_VERSION = 1` (line 40) and its docstring, add: + +```python +DEFAULT_CHUNK_AXIS_LIMIT = 256 +"""Per-axis cap on the auto-derived chunk shape for plain image arrays. + +Used by `lsst.images.zarr._layout.chunks_for` when the caller does not +supply an explicit override and the archive class does not have a +class-specific chunk rule. Chunks of ~256 elements per spatial axis +trade some compression ratio for cutout-friendly partial reads. +""" +``` + +- [ ] **Step 4: Read the constant from `_layout.py`** + +Edit `python/lsst/images/zarr/_layout.py`. + +Add a new import. The existing import section already pulls from `..fits._common` and from `._model`; add a sibling line: + +```python +from ._common import DEFAULT_CHUNK_AXIS_LIMIT +``` + +Delete the line `_DEFAULT_AXIS_LIMIT = 1024` (currently the only module-level numeric constant in this file; sits just before `axes_for_archive_class`). + +In `chunks_for`, replace the `_DEFAULT_AXIS_LIMIT` reference at the very end of the function: + +```python + return tuple(min(_DEFAULT_AXIS_LIMIT, dim) for dim in shape) +``` + +with + +```python + return tuple(min(DEFAULT_CHUNK_AXIS_LIMIT, dim) for dim in shape) +``` + +- [ ] **Step 5: Run the test to verify it passes** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_layout.py -v` + +Expected: PASS for `test_chunks_for_default`, `test_chunks_for_cell_coadd_without_metadata_falls_back`, and all other tests in the file. + +- [ ] **Step 6: Commit** + +```bash +git add python/lsst/images/zarr/_common.py python/lsst/images/zarr/_layout.py tests/test_zarr_layout.py +git commit -m "feat(zarr): drop chunk-axis default 1024 -> 256, centralize constant" +``` + +--- + +## Task 2: Add `DEFAULT_TARGET_SHARD_BYTES` constant with env-var override + +**Files:** +- Modify: `python/lsst/images/zarr/_common.py` (add constant + env-var read) +- Modify: `tests/test_zarr_common.py` (add subprocess tests) + +- [ ] **Step 1: Inspect `tests/test_zarr_common.py` to see existing test conventions** + +Run: `head -30 tests/test_zarr_common.py` + +Review what's there. The new tests will follow the same `unittest.TestCase` style. + +- [ ] **Step 2: Write the env-var subprocess tests** + +Append to `tests/test_zarr_common.py` (before the `if __name__ == "__main__":` block): + +```python +import subprocess +import sys + + +class TargetShardBytesEnvVarTestCase(unittest.TestCase): + """`DEFAULT_TARGET_SHARD_BYTES` reads from env var at import time.""" + + def _import_in_subprocess(self, env_value: str | None) -> subprocess.CompletedProcess[str]: + env = dict(os.environ) + env.pop("LSST_IMAGES_ZARR_TARGET_SHARD_BYTES", None) + if env_value is not None: + env["LSST_IMAGES_ZARR_TARGET_SHARD_BYTES"] = env_value + code = ( + "from lsst.images.zarr._common import DEFAULT_TARGET_SHARD_BYTES;" + "print(DEFAULT_TARGET_SHARD_BYTES)" + ) + return subprocess.run( + [sys.executable, "-c", code], + env=env, + capture_output=True, + text=True, + check=False, + ) + + def test_unset_uses_default(self) -> None: + result = self._import_in_subprocess(None) + self.assertEqual(result.returncode, 0, result.stderr) + self.assertEqual(result.stdout.strip(), str(16 * 1024 * 1024)) + + def test_set_value_overrides(self) -> None: + result = self._import_in_subprocess("1234567") + self.assertEqual(result.returncode, 0, result.stderr) + self.assertEqual(result.stdout.strip(), "1234567") + + def test_garbage_value_fails_at_import(self) -> None: + result = self._import_in_subprocess("not-a-number") + self.assertNotEqual(result.returncode, 0) + self.assertIn("LSST_IMAGES_ZARR_TARGET_SHARD_BYTES", result.stderr) +``` + +If the file does not already import `os`, add `import os` to the imports at the top. + +- [ ] **Step 3: Run the tests to verify they fail** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_common.py::TargetShardBytesEnvVarTestCase -v` + +Expected: FAIL — `DEFAULT_TARGET_SHARD_BYTES` does not exist yet (`ImportError`). + +- [ ] **Step 4: Add the constant to `_common.py`** + +Edit `python/lsst/images/zarr/_common.py`. + +Add `DEFAULT_TARGET_SHARD_BYTES` to `__all__` so it becomes: + +```python +__all__ = ( + "DEFAULT_CHUNK_AXIS_LIMIT", + "DEFAULT_TARGET_SHARD_BYTES", + "LSST_NS", + "LSST_VERSION", + "OME_NS", + "OME_VERSION", + "ZarrCompressionOptions", + "ZarrPointerModel", + "archive_path_to_zarr_path", + "mask_dtype_for_plane_count", +) +``` + +Add `import os` near the other stdlib imports. + +After the `DEFAULT_CHUNK_AXIS_LIMIT` block added in Task 1, append: + +```python +def _read_target_shard_bytes() -> int: + """Read `LSST_IMAGES_ZARR_TARGET_SHARD_BYTES` or return the default. + + Parsed as a base-10 integer. A non-integer value raises ``ValueError`` + at import time — silent typos are worse than loud failure. + """ + raw = os.environ.get("LSST_IMAGES_ZARR_TARGET_SHARD_BYTES") + if raw is None: + return 16 * 1024 * 1024 + try: + return int(raw) + except ValueError as exc: + raise ValueError( + f"LSST_IMAGES_ZARR_TARGET_SHARD_BYTES={raw!r} is not a base-10 integer." + ) from exc + + +DEFAULT_TARGET_SHARD_BYTES: int = _read_target_shard_bytes() +"""Target uncompressed byte size for an auto-derived shard. + +Read from ``LSST_IMAGES_ZARR_TARGET_SHARD_BYTES`` once at import time; +defaults to 16 MiB. Used by `lsst.images.zarr._layout.default_shards` to +decide how many chunks to combine into a shard. +""" +``` + +- [ ] **Step 5: Run the tests to verify they pass** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_common.py::TargetShardBytesEnvVarTestCase -v` + +Expected: PASS for all three subtests. + +- [ ] **Step 6: Run the full common test file as a regression check** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_common.py -v` + +Expected: All tests pass. + +- [ ] **Step 7: Commit** + +```bash +git add python/lsst/images/zarr/_common.py tests/test_zarr_common.py +git commit -m "feat(zarr): add DEFAULT_TARGET_SHARD_BYTES with env-var override" +``` + +--- + +## Task 3: Add `default_shards` helper (test-first) + +**Files:** +- Modify: `python/lsst/images/zarr/_layout.py` (add helper + export) +- Modify: `tests/test_zarr_layout.py` (add new test case) + +- [ ] **Step 1: Write the unit tests** + +Append a new test class to `tests/test_zarr_layout.py` (before the `if __name__ == "__main__":` block, after the existing `LayoutTestCase`): + +```python +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class DefaultShardsTestCase(unittest.TestCase): + """The `default_shards` byte-budget rule.""" + + TARGET = 16 * 1024 * 1024 # 16 MiB + + def test_4k_float32_image_uses_byte_budget(self) -> None: + result = default_shards( + chunks=(256, 256), + shape=(4096, 4096), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertEqual(result, (2048, 2048)) + + def test_3d_mask_plane_axis_untouched(self) -> None: + # chunks already cover the plane axis; growable axes are y, x only. + result = default_shards( + chunks=(8, 256, 256), + shape=(8, 4096, 4096), + dtype=np.dtype("uint8"), + target_bytes=self.TARGET, + ) + self.assertEqual(result, (8, 1536, 1536)) + + def test_tiny_single_chunk_returns_none(self) -> None: + result = default_shards( + chunks=(40,), + shape=(40,), + dtype=np.dtype("uint8"), + target_bytes=self.TARGET, + ) + self.assertIsNone(result) + + def test_chunks_equal_shape_returns_none(self) -> None: + result = default_shards( + chunks=(1024, 1024), + shape=(1024, 1024), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertIsNone(result) + + def test_already_big_chunk_returns_none(self) -> None: + # 4096*4096*4 = 64 MiB > 16 MiB target. + result = default_shards( + chunks=(4096, 4096), + shape=(8192, 8192), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertIsNone(result) + + def test_k_le_one_returns_none(self) -> None: + # chunk_bytes = 256*256*4 = 256 KiB; ratio = 4 with one growable axis; + # k = round(4) = 4 -> not this boundary. Construct a case where + # ratio is just above 1: 256 KiB chunk, 384 KiB target -> ratio 1.5, + # k = round(1.5) = 2 -> sharded. Use 256 KiB chunk, 320 KiB target + # -> ratio 1.25, k = round(1.25) = 1 -> None. + chunk_bytes = 256 * 256 * 4 + result = default_shards( + chunks=(256, 256), + shape=(4096, 4096), + dtype=np.dtype("float32"), + target_bytes=int(chunk_bytes * 1.25), + ) + self.assertIsNone(result) + + def test_cap_at_array_bounds(self) -> None: + # 600x600 float32; chunk_bytes = 256 KiB; ratio = 64; k = 8. + # Uncapped shard would be (2048, 2048) but the array is only + # 3 chunks per axis (ceil(600/256) = 3), so the cap is (768, 768). + result = default_shards( + chunks=(256, 256), + shape=(600, 600), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertEqual(result, (768, 768)) + + def test_cell_coadd_psf(self) -> None: + # (25, 25, 150, 150) float32 with (1, 1, 150, 150) chunks. + # chunk_bytes = 90 KiB; ratio ~= 186; growable axes are 0 and 1 + # (cell-grid axes). k = round(sqrt(186)) = 14. + result = default_shards( + chunks=(1, 1, 150, 150), + shape=(25, 25, 150, 150), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertEqual(result, (14, 14, 150, 150)) + + def test_mismatched_ndim_raises(self) -> None: + with self.assertRaisesRegex(ValueError, "rank"): + default_shards( + chunks=(256, 256), + shape=(4096, 4096, 4096), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + + def test_zero_itemsize_returns_none(self) -> None: + # void(0) has itemsize 0; defensive guard against degenerate dtypes. + result = default_shards( + chunks=(256, 256), + shape=(4096, 4096), + dtype=np.dtype("V0"), + target_bytes=self.TARGET, + ) + self.assertIsNone(result) +``` + +Update the import block at the top of `tests/test_zarr_layout.py` (currently lines 25–32) to also import `default_shards`: + +```python +try: + from lsst.images.zarr._layout import ( + affine_check, + axes_for_archive_class, + chunks_aligned_to, + chunks_for, + decorate_sub_archives, + default_shards, + ) + from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False +``` + +- [ ] **Step 2: Run the new tests to verify they fail** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_layout.py::DefaultShardsTestCase -v` + +Expected: All ten subtests FAIL with `ImportError` for `default_shards`. + +- [ ] **Step 3: Implement `default_shards` in `_layout.py`** + +Add `"default_shards"` to the `__all__` tuple at the top of `python/lsst/images/zarr/_layout.py` (currently lines 29–38), keeping alphabetical order: + +```python +__all__ = ( + "AffineCheckResult", + "affine_check", + "axes_for_archive_class", + "chunks_aligned_to", + "chunks_for", + "decorate_sub_archives", + "default_shards", + "deserialize_fits_opaque_metadata", + "serialize_fits_opaque_metadata", +) +``` + +Add `import math` to the imports at the top of the file (alongside `numpy`). + +After `chunks_aligned_to` (currently ends at line 118), add: + +```python +def default_shards( + *, + chunks: tuple[int, ...], + shape: tuple[int, ...], + dtype: np.dtype, + target_bytes: int, +) -> tuple[int, ...] | None: + """Derive a default shard shape from ``chunks``, ``shape``, and ``dtype``. + + Returns ``None`` when sharding would be a no-op: the array is + already a single chunk per axis, the chunk is already at least + ``target_bytes`` big, or the byte budget rounds to ``k == 1`` + chunks per growable axis. + + The rule grows only axes whose ``chunks[i] < shape[i]`` (the + others already cover the full extent), uses one uniform multiplier + ``k = round(ratio ** (1 / num_growable_axes))`` to stay close to + the byte budget, and caps each axis at ``chunks[i] * ceil(shape[i] + / chunks[i])`` so a small array does not get a shard larger than + itself. Every shard axis is an integer multiple of the + corresponding chunk axis, as required by zarr v3. + + Parameters + ---------- + chunks + Chunk shape, one int per axis. + shape + Array shape, one int per axis. + dtype + Array dtype; only ``itemsize`` is consulted. + target_bytes + Target uncompressed shard size. Typically + `DEFAULT_TARGET_SHARD_BYTES`. + + Raises + ------ + ValueError + If ``len(chunks) != len(shape)``. + """ + if len(chunks) != len(shape): + raise ValueError( + f"chunks rank {len(chunks)} does not match shape rank {len(shape)}." + ) + itemsize = dtype.itemsize + if itemsize == 0: + return None + chunk_bytes = math.prod(chunks) * itemsize + if chunk_bytes >= target_bytes: + return None + growable = [i for i in range(len(shape)) if chunks[i] < shape[i]] + if not growable: + return None + ratio = target_bytes / chunk_bytes + k = max(1, round(ratio ** (1.0 / len(growable)))) + if k <= 1: + return None + shard = list(chunks) + for i in growable: + n_chunks_axis = math.ceil(shape[i] / chunks[i]) + shard[i] = min(chunks[i] * k, chunks[i] * n_chunks_axis) + return tuple(shard) +``` + +Note: the helper takes its arguments keyword-only to match the style of `chunks_aligned_to` and to make calls at the use sites self-documenting. + +- [ ] **Step 4: Update the test calls to use keyword arguments** + +The unit tests written in Step 1 already pass arguments by keyword (`chunks=..., shape=..., dtype=..., target_bytes=...`). No change needed. + +- [ ] **Step 5: Run the new tests to verify they pass** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_layout.py::DefaultShardsTestCase -v` + +Expected: All ten subtests PASS. + +- [ ] **Step 6: Run the full layout test file as a regression check** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_layout.py -v` + +Expected: All tests pass (existing tests should be unaffected). + +- [ ] **Step 7: Commit** + +```bash +git add python/lsst/images/zarr/_layout.py tests/test_zarr_layout.py +git commit -m "feat(zarr): add default_shards helper for byte-budget shard sizing" +``` + +--- + +## Task 4: Wire `default_shards` into `ZarrOutputArchive.add_array` + +**Files:** +- Modify: `python/lsst/images/zarr/_output_archive.py` (mask path ~L188, generic path ~L226) +- Modify: `tests/test_zarr_round_trip.py` (add a sharded round-trip test) + +- [ ] **Step 1: Write the failing round-trip test** + +Append to `tests/test_zarr_round_trip.py` (inside the `ZarrRoundTripTestCase` class, after `test_image_round_trip`): + +```python + def test_image_round_trip_writes_shards(self) -> None: + # 300x300 float32: chunks (256, 256) -> shard (512, 512) by the + # byte-budget rule (target 16 MiB, ratio ~64, k ~ 8 capped at the + # 2-chunk-per-axis ceiling of 256 * 2 = 512). + import zarr as _zarr + + from lsst.images.zarr._store import open_store_for_read + + original = Image( + np.zeros((300, 300), dtype=np.float32), + bbox=Box.factory[0:300, 0:300], + ) + with RoundtripZarr(self, original) as roundtrip: + with open_store_for_read(roundtrip.filename) as store: + root = _zarr.open_group(store=store, mode="r", zarr_format=3) + image_arr = root["image"] + self.assertEqual(tuple(image_arr.chunks), (256, 256)) + self.assertEqual(tuple(image_arr.shards), (512, 512)) + # Single-chunk metadata arrays must NOT be sharded. + lsst_json_arr = root["lsst_json"] + self.assertIsNone(lsst_json_arr.shards) + # Data round-trip is preserved. + np.testing.assert_array_equal(roundtrip.result.array, original.array) +``` + +- [ ] **Step 2: Run the test to verify it fails** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_round_trip.py::ZarrRoundTripTestCase::test_image_round_trip_writes_shards -v` + +Expected: FAIL — `image_arr.shards` is `None` because the archive doesn't populate shards yet. + +- [ ] **Step 3: Wire `default_shards` into the mask-path branch of `add_array`** + +Edit `python/lsst/images/zarr/_output_archive.py`. The existing imports at the top (currently lines 40–53) are multi-line tuples. Update them in place. + +Change the `_common` import block to add `DEFAULT_TARGET_SHARD_BYTES`: + +```python +from ._common import ( + DEFAULT_TARGET_SHARD_BYTES, + ZarrCompressionOptions, + ZarrPointerModel, + archive_path_to_zarr_path, + mask_dtype_for_plane_count, +) +``` + +Change the `_layout` import block to add `default_shards`: + +```python +from ._layout import ( + affine_check, + axes_for_archive_class, + chunks_aligned_to, + chunks_for, + decorate_sub_archives, + default_shards, + serialize_fits_opaque_metadata, +) +``` + +In the mask branch (currently lines 180–200), replace: + +```python + chunks = self._chunks.get(name) or self._chunks.get(leaf) + if chunks is None and self._image_chunks is not None: + chunks = chunks_aligned_to(image_chunks=self._image_chunks, shape=packed.shape) + extra: dict[str, Any] = {"_ARRAY_DIMENSIONS": ["y", "x"]} + extra.update(flag_attrs.dump()) + ir_array = ZarrArray( + data=packed, + chunks=chunks, + shards=self._shards.get(name), + compression=self._compression.get(name), + ) +``` + +with: + +```python + chunks = self._chunks.get(name) or self._chunks.get(leaf) + if chunks is None and self._image_chunks is not None: + chunks = chunks_aligned_to(image_chunks=self._image_chunks, shape=packed.shape) + shards = self._shards.get(name) or self._shards.get(leaf) + if shards is None and chunks is not None: + shards = default_shards( + chunks=tuple(chunks), + shape=tuple(packed.shape), + dtype=packed.dtype, + target_bytes=DEFAULT_TARGET_SHARD_BYTES, + ) + extra: dict[str, Any] = {"_ARRAY_DIMENSIONS": ["y", "x"]} + extra.update(flag_attrs.dump()) + ir_array = ZarrArray( + data=packed, + chunks=chunks, + shards=shards, + compression=self._compression.get(name), + ) +``` + +- [ ] **Step 4: Wire `default_shards` into the generic branch of `add_array`** + +In the generic branch (currently lines 202–231), find this block: + +```python + ir_array = ZarrArray( + data=np.ascontiguousarray(array), + chunks=chunks, + shards=self._shards.get(name), + compression=self._compression.get(name), + ) +``` + +Replace with: + +```python + shards = self._shards.get(name) or self._shards.get(leaf) + if shards is None and chunks is not None: + shards = default_shards( + chunks=tuple(chunks), + shape=tuple(array.shape), + dtype=array.dtype, + target_bytes=DEFAULT_TARGET_SHARD_BYTES, + ) + ir_array = ZarrArray( + data=np.ascontiguousarray(array), + chunks=chunks, + shards=shards, + compression=self._compression.get(name), + ) +``` + +Note: `chunks is not None` guards the unusual case where neither `_chunks` nor any layout default fired — `default_shards` only makes sense once we have a chunk shape. In practice `chunks` is non-`None` for `image`, `variance`, `mask`, and `psf`; for table columns and structured-array columns it is `None` (those use `add_table` / `add_structured_array`, which do not pass through this branch). + +- [ ] **Step 5: Run the round-trip test to verify it passes** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_round_trip.py::ZarrRoundTripTestCase::test_image_round_trip_writes_shards -v` + +Expected: PASS — `image_arr.shards == (512, 512)` and `lsst_json_arr.shards is None`. + +- [ ] **Step 6: Run all zarr round-trip tests as a regression check** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_round_trip.py -v` + +Expected: All tests pass — existing data-equality assertions are unchanged. + +- [ ] **Step 7: Run the full zarr suite** + +Run: `.pyenv/bin/python -m pytest tests/ -k zarr -v` + +Expected: All zarr-related tests pass. + +- [ ] **Step 8: Commit** + +```bash +git add python/lsst/images/zarr/_output_archive.py tests/test_zarr_round_trip.py +git commit -m "feat(zarr): default-shard image, variance, mask in ZarrOutputArchive" +``` + +--- + +## Task 5: Verify CellCoadd PSF gets sharded + +**Files:** +- Modify: `tests/test_zarr_output_archive.py` (extend `ZarrPsfChunkingTestCase`) + +- [ ] **Step 1: Write a PSF shard-defaulting test** + +Append to `tests/test_zarr_output_archive.py` inside `ZarrPsfChunkingTestCase` (currently lines 277–298), after `test_psf_user_override_wins`: + +```python + def test_psf_array_gets_default_shards(self) -> None: + # 25x25 cells of 150x150 float32: chunk_bytes = 90 KiB, + # ratio ~ 186, k = round(sqrt(186)) = 14 -> shard (14, 14, 150, 150). + psf = np.zeros((25, 25, 150, 150), dtype=np.float32) + archive = ZarrOutputArchive(archive_class="CellCoadd") + archive.add_array(psf, name="psf") + node = archive.document.root.get("/psf") + self.assertEqual(tuple(node.shards), (14, 14, 150, 150)) + + def test_psf_user_shard_override_wins(self) -> None: + psf = np.zeros((25, 25, 150, 150), dtype=np.float32) + archive = ZarrOutputArchive( + archive_class="CellCoadd", + shards={"psf": (5, 5, 150, 150)}, + ) + archive.add_array(psf, name="psf") + node = archive.document.root.get("/psf") + self.assertEqual(tuple(node.shards), (5, 5, 150, 150)) + + def test_small_psf_skips_sharding(self) -> None: + # 2x3 cells of 21x21 float32: chunk_bytes = 1764 B, ratio ~9295, + # but ceil(2/1) * ceil(3/1) = 6 cells total -> capped shard equals + # the array; effective shard becomes (2, 3, 21, 21) which equals + # shape, so no sharding is meaningful. The byte-budget rule still + # produces a tuple — verify it is the capped value, not None. + psf = np.zeros((2, 3, 21, 21), dtype=np.float32) + archive = ZarrOutputArchive(archive_class="CellCoadd") + archive.add_array(psf, name="psf") + node = archive.document.root.get("/psf") + # Either way is acceptable: shards=(2,3,21,21) (capped) or shards=None. + # The default rule returns the capped value; assert that. + self.assertEqual(tuple(node.shards), (2, 3, 21, 21)) +``` + +- [ ] **Step 2: Run the new tests** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_output_archive.py::ZarrPsfChunkingTestCase -v` + +Expected: PASS for all subtests, including the two existing ones (`test_psf_array_uses_single_cell_chunks` and `test_psf_user_override_wins`). For the small PSF, the helper computes `k = round(sqrt(16777216 / 1764)) = 98`, then caps each growable axis at the cell-grid extent (2 and 3), yielding shard `(2, 3, 21, 21)` — the whole array fits in one shard, which is the desired outcome (6 chunks bundled into one file). + +- [ ] **Step 3: Run the full output-archive test file as a regression check** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_output_archive.py -v` + +Expected: All tests pass. + +- [ ] **Step 4: Commit** + +```bash +git add tests/test_zarr_output_archive.py +git commit -m "test(zarr): cover CellCoadd PSF shard defaulting and overrides" +``` + +--- + +## Task 6: Verify sharded write round-trips through `ZipStore` + +**Files:** +- Modify: `tests/test_zarr_store.py` (add a sharded round-trip via zip) + +- [ ] **Step 1: Write the test** + +Append to `tests/test_zarr_store.py` inside `StoreDispatchTestCase`, after `test_create_only_refuses_existing`: + +```python + def test_zip_store_round_trips_sharded_array(self) -> None: + import numpy as np + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr.zip") + data = np.arange(300 * 300, dtype=np.float32).reshape(300, 300) + with open_store_for_write(target) as store: + group = zarr.create_group(store=store, zarr_format=3) + arr = group.create_array( + name="image", + shape=data.shape, + chunks=(256, 256), + shards=(512, 512), + dtype=data.dtype, + ) + arr[:] = data + with open_store_for_read(target) as store: + group = zarr.open_group(store=store, mode="r", zarr_format=3) + image = group["image"] + self.assertEqual(tuple(image.chunks), (256, 256)) + self.assertEqual(tuple(image.shards), (512, 512)) + np.testing.assert_array_equal(image[...], data) +``` + +- [ ] **Step 2: Run the test to verify it passes** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_store.py::StoreDispatchTestCase::test_zip_store_round_trips_sharded_array -v` + +Expected: PASS — `ZipStore` handles sharded arrays without special handling on our side. + +If this fails (zarr-python 3.x not honoring shards through `ZipStore`), stop and discuss with the user before proceeding. The spec assumes this works; failure would be a real finding worth surfacing. + +- [ ] **Step 3: Run the full store test file** + +Run: `.pyenv/bin/python -m pytest tests/test_zarr_store.py -v` + +Expected: All tests pass. + +- [ ] **Step 4: Commit** + +```bash +git add tests/test_zarr_store.py +git commit -m "test(zarr): round-trip a sharded array through ZipStore" +``` + +--- + +## Task 7: Final regression sweep and changelog + +**Files:** +- Modify: `doc/changes/` (add changelog fragment if the project uses one) + +- [ ] **Step 1: Run the full zarr test suite** + +Run: `.pyenv/bin/python -m pytest tests/ -k zarr -v` + +Expected: All zarr tests pass. + +- [ ] **Step 2: Run the full project test suite** + +Run: `.pyenv/bin/python -m pytest tests/ -v` + +Expected: All tests pass; no unrelated regressions. + +- [ ] **Step 3: Check whether a changelog fragment is required** + +Run: `ls doc/changes/ 2>/dev/null && head -20 doc/changes/README.rst 2>/dev/null || echo "no changelog dir"` + +If `doc/changes/` exists, follow the existing fragment-naming convention. If a `.rst`/`.md` template can be found in nearby commits (look at `git log --oneline -- doc/changes/`), match that style. + +If a fragment is required, create one summarising: +- "Default sharding now enabled for image, variance, mask, and PSF arrays in zarr archives. The per-axis chunk default has been lowered from 1024 to 256 to better suit cutout-style science access patterns. Public API is unchanged. Tunable via `LSST_IMAGES_ZARR_TARGET_SHARD_BYTES`." + +If no changelog system is detected, skip this step. + +- [ ] **Step 4: Verify mypy passes** + +Run: `.pyenv/bin/python -m mypy python/lsst/images/zarr/` + +Expected: No new type errors. (The zarr module was clean as of commit `9c2f01e`; this change adds only typed code.) + +- [ ] **Step 5: Final commit (only if a changelog fragment was added)** + +```bash +git add doc/changes/ +git commit -m "docs: changelog fragment for zarr default sharding" +``` + +- [ ] **Step 6: Sanity check the full diff** + +Run: `git log --oneline origin/main..HEAD` + +Expected: 5–6 commits — one per task above, in order. + +Run: `git diff --stat origin/main..HEAD` + +Expected: Three production files (`_common.py`, `_layout.py`, `_output_archive.py`) and four test files modified, plus possibly a changelog fragment. No other source files changed. + +--- + +## Self-review notes + +- **Spec coverage**: Architecture (Task 1, Task 2 add constants; Task 3 adds helper; Task 4 wires it in), the `default_shards` rule (Task 3), per-array behaviour (Tasks 4 + 5), error handling (covered by Task 3 unit tests), backward compatibility (Task 4 step 6 regression sweep, Task 6 zip round-trip), all five testing categories from the spec (Task 3 unit tests, Task 2 env-var test, Task 4 round-trip integration, Task 5 PSF-specific round-trip, Task 6 zip round-trip). +- **No placeholders**: every code step shows the full code; every test step shows the full assertion. +- **Type consistency**: the helper signature is `default_shards(*, chunks, shape, dtype, target_bytes)` everywhere — task 3 (definition), task 3 (unit tests), task 4 (call sites). The constant name `DEFAULT_TARGET_SHARD_BYTES` and chunk constant `DEFAULT_CHUNK_AXIS_LIMIT` are used consistently across `_common.py`, `_layout.py`, and `_output_archive.py`. +- **Spec deviation**: the spec's claim that "object dtype has itemsize 0" is incorrect — `np.dtype('O').itemsize == 8`. The `itemsize == 0` guard still exists (it triggers on `np.dtype('V0')`), and the unit test in Task 3 covers it via void(0) instead of object. From 2f2514a7e238d36f8cc90aee0e3f2cb0e8135299 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 09:57:15 -0700 Subject: [PATCH 41/60] feat(zarr): drop chunk-axis default 1024 -> 256, centralize constant --- python/lsst/images/zarr/_common.py | 10 ++++++++++ python/lsst/images/zarr/_layout.py | 7 +++---- tests/test_zarr_layout.py | 9 +++++---- 3 files changed, 18 insertions(+), 8 deletions(-) diff --git a/python/lsst/images/zarr/_common.py b/python/lsst/images/zarr/_common.py index c45caf39..a3b162ef 100644 --- a/python/lsst/images/zarr/_common.py +++ b/python/lsst/images/zarr/_common.py @@ -12,6 +12,7 @@ from __future__ import annotations __all__ = ( + "DEFAULT_CHUNK_AXIS_LIMIT", "LSST_NS", "LSST_VERSION", "OME_NS", @@ -44,6 +45,15 @@ backwards-incompatible changes to the on-disk layout. """ +DEFAULT_CHUNK_AXIS_LIMIT = 256 +"""Per-axis cap on the auto-derived chunk shape for plain image arrays. + +Used by `lsst.images.zarr._layout.chunks_for` when the caller does not +supply an explicit override and the archive class does not have a +class-specific chunk rule. Chunks of ~256 elements per spatial axis +trade some compression ratio for cutout-friendly partial reads. +""" + class ZarrPointerModel(pydantic.BaseModel): """Reference to a zarr archive sub-tree by absolute zarr path. diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index c93543dd..b137f9c1 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -14,7 +14,7 @@ This module centralises the decisions that vary by image type: - which OME axes apply (``ColorImage`` has no root multiscale) -- default chunk sizes (clamped to 1024 per axis for plain images, +- default chunk sizes (clamped to ``DEFAULT_CHUNK_AXIS_LIMIT`` per axis, cell-aligned for `CellCoadd`, image-aligned for `variance` / `mask` siblings) - the affine residual validator that gates the OME @@ -45,10 +45,9 @@ import numpy as np from ..fits._common import ExtensionKey, FitsOpaqueMetadata +from ._common import DEFAULT_CHUNK_AXIS_LIMIT from ._model import OmeMultiscale, ZarrArray, ZarrDocument -_DEFAULT_AXIS_LIMIT = 1024 - def axes_for_archive_class(name: str) -> tuple[str, ...]: """Return the OME axis tuple for a given archive class. @@ -95,7 +94,7 @@ def chunks_for( cell_shape = archive_metadata.get("cell_shape") if cell_shape is not None: return tuple(min(c, dim) for c, dim in zip(cell_shape, shape, strict=True)) - return tuple(min(_DEFAULT_AXIS_LIMIT, dim) for dim in shape) + return tuple(min(DEFAULT_CHUNK_AXIS_LIMIT, dim) for dim in shape) def chunks_aligned_to( diff --git a/tests/test_zarr_layout.py b/tests/test_zarr_layout.py index ea118e1c..1d6389e0 100644 --- a/tests/test_zarr_layout.py +++ b/tests/test_zarr_layout.py @@ -54,9 +54,10 @@ def test_axes_for_archive_class(self) -> None: self.assertEqual(axes_for_archive_class("ColorImage"), ()) def test_chunks_for_default(self) -> None: - self.assertEqual(chunks_for("Image", (4096, 4096), None), (1024, 1024)) - # Smaller than 1024 -> use full dim. - self.assertEqual(chunks_for("Image", (300, 600), None), (300, 600)) + # Plain images clamp to the per-axis chunk limit (256 by default). + self.assertEqual(chunks_for("Image", (4096, 4096), None), (256, 256)) + # Smaller than the limit -> use full dim. + self.assertEqual(chunks_for("Image", (200, 100), None), (200, 100)) def test_chunks_for_override(self) -> None: self.assertEqual(chunks_for("Image", (4096, 4096), (256, 256)), (256, 256)) @@ -71,7 +72,7 @@ def test_chunks_for_cell_coadd_uses_cell_shape(self) -> None: self.assertEqual(result, (256, 256)) def test_chunks_for_cell_coadd_without_metadata_falls_back(self) -> None: - self.assertEqual(chunks_for("CellCoadd", (4096, 4096), None), (1024, 1024)) + self.assertEqual(chunks_for("CellCoadd", (4096, 4096), None), (256, 256)) def test_chunks_aligned_to_matches_image(self) -> None: # variance / mask follow image's chunks when not overridden. From 585b512316a0cebd9361e9050d0d3a2999b8c228 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 10:16:35 -0700 Subject: [PATCH 42/60] feat(zarr): add DEFAULT_TARGET_SHARD_BYTES with env-var override --- python/lsst/images/zarr/_common.py | 26 ++++++++++++++++++++ tests/test_zarr_common.py | 39 ++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+) diff --git a/python/lsst/images/zarr/_common.py b/python/lsst/images/zarr/_common.py index a3b162ef..01db83aa 100644 --- a/python/lsst/images/zarr/_common.py +++ b/python/lsst/images/zarr/_common.py @@ -13,6 +13,7 @@ __all__ = ( "DEFAULT_CHUNK_AXIS_LIMIT", + "DEFAULT_TARGET_SHARD_BYTES", "LSST_NS", "LSST_VERSION", "OME_NS", @@ -23,6 +24,7 @@ "mask_dtype_for_plane_count", ) +import os from dataclasses import dataclass from typing import ClassVar, Self @@ -55,6 +57,30 @@ """ +def _read_target_shard_bytes() -> int: + """Read ``LSST_IMAGES_ZARR_TARGET_SHARD_BYTES`` or return the default. + + Parsed as a base-10 integer. A non-integer value raises ``ValueError`` + at import time — silent typos are worse than loud failure. + """ + raw = os.environ.get("LSST_IMAGES_ZARR_TARGET_SHARD_BYTES") + if raw is None: + return 16 * 1024 * 1024 + try: + return int(raw) + except ValueError as exc: + raise ValueError(f"LSST_IMAGES_ZARR_TARGET_SHARD_BYTES={raw!r} is not a base-10 integer.") from exc + + +DEFAULT_TARGET_SHARD_BYTES: int = _read_target_shard_bytes() +"""Target uncompressed byte size for an auto-derived shard. + +Read from ``LSST_IMAGES_ZARR_TARGET_SHARD_BYTES`` once at import time; +defaults to 16 MiB. Used by `lsst.images.zarr._layout.default_shards` to +decide how many chunks to combine into a shard. +""" + + class ZarrPointerModel(pydantic.BaseModel): """Reference to a zarr archive sub-tree by absolute zarr path. diff --git a/tests/test_zarr_common.py b/tests/test_zarr_common.py index de4b5a1f..d5494492 100644 --- a/tests/test_zarr_common.py +++ b/tests/test_zarr_common.py @@ -11,6 +11,9 @@ from __future__ import annotations +import os +import subprocess +import sys import unittest import numpy as np @@ -78,5 +81,41 @@ def test_mask_dtype_refuses_more_than_64_planes(self) -> None: mask_dtype_for_plane_count(65) +class TargetShardBytesEnvVarTestCase(unittest.TestCase): + """`DEFAULT_TARGET_SHARD_BYTES` reads from env var at import time.""" + + def _import_in_subprocess(self, env_value: str | None) -> subprocess.CompletedProcess[str]: + env = dict(os.environ) + env.pop("LSST_IMAGES_ZARR_TARGET_SHARD_BYTES", None) + if env_value is not None: + env["LSST_IMAGES_ZARR_TARGET_SHARD_BYTES"] = env_value + code = ( + "from lsst.images.zarr._common import DEFAULT_TARGET_SHARD_BYTES;" + "print(DEFAULT_TARGET_SHARD_BYTES)" + ) + return subprocess.run( + [sys.executable, "-c", code], + env=env, + capture_output=True, + text=True, + check=False, + ) + + def test_unset_uses_default(self) -> None: + result = self._import_in_subprocess(None) + self.assertEqual(result.returncode, 0, result.stderr) + self.assertEqual(result.stdout.strip(), str(16 * 1024 * 1024)) + + def test_set_value_overrides(self) -> None: + result = self._import_in_subprocess("1234567") + self.assertEqual(result.returncode, 0, result.stderr) + self.assertEqual(result.stdout.strip(), "1234567") + + def test_garbage_value_fails_at_import(self) -> None: + result = self._import_in_subprocess("not-a-number") + self.assertNotEqual(result.returncode, 0) + self.assertIn("LSST_IMAGES_ZARR_TARGET_SHARD_BYTES", result.stderr) + + if __name__ == "__main__": unittest.main() From bb297518eb325eb180222a5768464283f9cc338c Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 11:28:04 -0700 Subject: [PATCH 43/60] fix(zarr): skip env-var subprocess tests when zarr unavailable Add @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") to TargetShardBytesEnvVarTestCase to match the pattern used by CommonTestCase. Without this guard the subprocess tests fail for the wrong reason in environments where zarr is not installed, because the imported module raises ImportError before the env-var logic runs. Also add a second assertion in test_garbage_value_fails_at_import to check for "is not a base-10 integer" in stderr, ensuring the error originates from _read_target_shard_bytes rather than an unrelated future warning that happens to mention the env-var name. Generated with AI Co-Authored-By: SLAC AI --- tests/test_zarr_common.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tests/test_zarr_common.py b/tests/test_zarr_common.py index d5494492..cb757695 100644 --- a/tests/test_zarr_common.py +++ b/tests/test_zarr_common.py @@ -81,6 +81,7 @@ def test_mask_dtype_refuses_more_than_64_planes(self) -> None: mask_dtype_for_plane_count(65) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class TargetShardBytesEnvVarTestCase(unittest.TestCase): """`DEFAULT_TARGET_SHARD_BYTES` reads from env var at import time.""" @@ -115,6 +116,7 @@ def test_garbage_value_fails_at_import(self) -> None: result = self._import_in_subprocess("not-a-number") self.assertNotEqual(result.returncode, 0) self.assertIn("LSST_IMAGES_ZARR_TARGET_SHARD_BYTES", result.stderr) + self.assertIn("is not a base-10 integer", result.stderr) if __name__ == "__main__": From 259cf6bcabf9d16eb55500a3bc2324ae618f8f7a Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 11:39:33 -0700 Subject: [PATCH 44/60] feat(zarr): add default_shards helper for byte-budget shard sizing --- python/lsst/images/zarr/_layout.py | 63 +++++++++++++++++ tests/test_zarr_layout.py | 110 +++++++++++++++++++++++++++++ 2 files changed, 173 insertions(+) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index b137f9c1..74c4bffe 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -33,10 +33,12 @@ "chunks_aligned_to", "chunks_for", "decorate_sub_archives", + "default_shards", "deserialize_fits_opaque_metadata", "serialize_fits_opaque_metadata", ) +import math from collections.abc import Mapping from dataclasses import dataclass from typing import Any @@ -117,6 +119,67 @@ def chunks_aligned_to( return tuple(min(c, dim) for c, dim in zip(image_chunks, shape, strict=True)) +def default_shards( + *, + chunks: tuple[int, ...], + shape: tuple[int, ...], + dtype: np.dtype, + target_bytes: int, +) -> tuple[int, ...] | None: + """Derive a default shard shape from ``chunks``, ``shape``, and ``dtype``. + + Returns ``None`` when sharding would be a no-op: the array is + already a single chunk per axis, the chunk is already at least + ``target_bytes`` big, or the byte budget rounds to ``k == 1`` + chunks per growable axis. + + The rule grows only axes whose ``chunks[i] < shape[i]`` (the + others already cover the full extent), uses one uniform multiplier + ``k = round(ratio ** (1 / num_growable_axes))`` to stay close to + the byte budget, and caps each axis at ``chunks[i] * ceil(shape[i] + / chunks[i])`` so a small array does not get a shard larger than + itself. Every shard axis is an integer multiple of the + corresponding chunk axis, as required by zarr v3. + + Parameters + ---------- + chunks + Chunk shape, one int per axis. + shape + Array shape, one int per axis. + dtype + Array dtype; only ``itemsize`` is consulted. + target_bytes + Target uncompressed shard size. Typically + `~lsst.images.zarr._common.DEFAULT_TARGET_SHARD_BYTES`. + + Raises + ------ + ValueError + If ``len(chunks) != len(shape)``. + """ + if len(chunks) != len(shape): + raise ValueError(f"chunks rank {len(chunks)} does not match shape rank {len(shape)}.") + itemsize = dtype.itemsize + if itemsize == 0: + return None + chunk_bytes = math.prod(chunks) * itemsize + if chunk_bytes >= target_bytes: + return None + growable = [i for i in range(len(shape)) if chunks[i] < shape[i]] + if not growable: + return None + ratio = target_bytes / chunk_bytes + k = max(1, round(ratio ** (1.0 / len(growable)))) + if k <= 1: + return None + shard = list(chunks) + for i in growable: + n_chunks_axis = math.ceil(shape[i] / chunks[i]) + shard[i] = min(chunks[i] * k, chunks[i] * n_chunks_axis) + return tuple(shard) + + @dataclass class AffineCheckResult: """Result of asking AST whether a simplified affine fits a full WCS. diff --git a/tests/test_zarr_layout.py b/tests/test_zarr_layout.py index 1d6389e0..f61772c0 100644 --- a/tests/test_zarr_layout.py +++ b/tests/test_zarr_layout.py @@ -30,6 +30,7 @@ chunks_aligned_to, chunks_for, decorate_sub_archives, + default_shards, ) from lsst.images.zarr._model import ZarrArray, ZarrDocument, ZarrGroup @@ -168,5 +169,114 @@ def test_root_archive_class_is_unchanged(self) -> None: self.assertEqual(doc.root.attributes.lsst["archive_class"], "ColorImage") +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class DefaultShardsTestCase(unittest.TestCase): + """The `default_shards` byte-budget rule.""" + + TARGET = 16 * 1024 * 1024 # 16 MiB + + def test_4k_float32_image_uses_byte_budget(self) -> None: + result = default_shards( + chunks=(256, 256), + shape=(4096, 4096), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertEqual(result, (2048, 2048)) + + def test_3d_mask_plane_axis_untouched(self) -> None: + # chunks already cover the plane axis; growable axes are y, x only. + result = default_shards( + chunks=(8, 256, 256), + shape=(8, 4096, 4096), + dtype=np.dtype("uint8"), + target_bytes=self.TARGET, + ) + self.assertEqual(result, (8, 1536, 1536)) + + def test_tiny_single_chunk_returns_none(self) -> None: + result = default_shards( + chunks=(40,), + shape=(40,), + dtype=np.dtype("uint8"), + target_bytes=self.TARGET, + ) + self.assertIsNone(result) + + def test_chunks_equal_shape_returns_none(self) -> None: + result = default_shards( + chunks=(1024, 1024), + shape=(1024, 1024), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertIsNone(result) + + def test_already_big_chunk_returns_none(self) -> None: + # 4096*4096*4 = 64 MiB > 16 MiB target. + result = default_shards( + chunks=(4096, 4096), + shape=(8192, 8192), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertIsNone(result) + + def test_k_le_one_returns_none(self) -> None: + # chunk=256x256 float32 = 256 KiB; ratio=1.25 -> k=round(1.25)=1 + # -> returns None. + chunk_bytes = 256 * 256 * 4 + result = default_shards( + chunks=(256, 256), + shape=(4096, 4096), + dtype=np.dtype("float32"), + target_bytes=int(chunk_bytes * 1.25), + ) + self.assertIsNone(result) + + def test_cap_at_array_bounds(self) -> None: + # 600x600 float32; chunk_bytes = 256 KiB; ratio = 64; k = 8. + # Uncapped shard would be (2048, 2048) but the array only has + # 3 chunks per axis (ceil(600/256) = 3), so the cap is (768, 768). + result = default_shards( + chunks=(256, 256), + shape=(600, 600), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertEqual(result, (768, 768)) + + def test_cell_coadd_psf(self) -> None: + # (25, 25, 150, 150) float32 with (1, 1, 150, 150) chunks. + # chunk_bytes = 90 KiB; ratio ~= 186; growable axes are 0 and 1. + # k = round(sqrt(186)) = 14. + result = default_shards( + chunks=(1, 1, 150, 150), + shape=(25, 25, 150, 150), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + self.assertEqual(result, (14, 14, 150, 150)) + + def test_mismatched_ndim_raises(self) -> None: + with self.assertRaisesRegex(ValueError, "rank"): + default_shards( + chunks=(256, 256), + shape=(4096, 4096, 4096), + dtype=np.dtype("float32"), + target_bytes=self.TARGET, + ) + + def test_zero_itemsize_returns_none(self) -> None: + # void(0) has itemsize 0; defensive guard against degenerate dtypes. + result = default_shards( + chunks=(256, 256), + shape=(4096, 4096), + dtype=np.dtype("V0"), + target_bytes=self.TARGET, + ) + self.assertIsNone(result) + + if __name__ == "__main__": unittest.main() From 18bd59acd2dd9c342684b88d798b7f087e41ec4f Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 11:47:11 -0700 Subject: [PATCH 45/60] refactor(zarr): drop dead max(1,...) clamp from default_shards MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The max(1, ...) guard in default_shards was unreachable: the function returns None before reaching that line when chunk_bytes >= target_bytes, so ratio > 1 always holds and round(ratio**(1/n)) >= 1 is guaranteed. The clamp was misleading — it implied k could be 0 or negative, which undermined the subsequent if k <= 1 guard. Drop the clamp and add an inline comment on the guard explaining the scenario it actually catches (rounding to a 1x no-op multiplier). Also expand the Returns-None paragraph in the docstring to mention the dtype.itemsize == 0 branch that was previously omitted. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_layout.py | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index 74c4bffe..e2bf5eca 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -128,10 +128,10 @@ def default_shards( ) -> tuple[int, ...] | None: """Derive a default shard shape from ``chunks``, ``shape``, and ``dtype``. - Returns ``None`` when sharding would be a no-op: the array is - already a single chunk per axis, the chunk is already at least - ``target_bytes`` big, or the byte budget rounds to ``k == 1`` - chunks per growable axis. + Returns ``None`` when sharding would be a no-op: ``dtype.itemsize`` + is zero (object dtypes), the array is already a single chunk per + axis, the chunk is already at least ``target_bytes`` big, or the + byte budget rounds to ``k == 1`` chunks per growable axis. The rule grows only axes whose ``chunks[i] < shape[i]`` (the others already cover the full extent), uses one uniform multiplier @@ -170,9 +170,9 @@ def default_shards( if not growable: return None ratio = target_bytes / chunk_bytes - k = max(1, round(ratio ** (1.0 / len(growable)))) + k = round(ratio ** (1.0 / len(growable))) if k <= 1: - return None + return None # budget allows at most a 1x multiplier — no-op shard shard = list(chunks) for i in growable: n_chunks_axis = math.ceil(shape[i] / chunks[i]) From 659e0749b9a1f63da86c03445596bde4aeebc116 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 11:50:30 -0700 Subject: [PATCH 46/60] feat(zarr): default-shard image, variance, mask in ZarrOutputArchive --- python/lsst/images/zarr/_output_archive.py | 22 +++++++++++++++++++-- tests/test_zarr_round_trip.py | 23 +++++++++++++++++++++- 2 files changed, 42 insertions(+), 3 deletions(-) diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index edc94176..eaeb8581 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -38,6 +38,7 @@ no_header_updates, ) from ._common import ( + DEFAULT_TARGET_SHARD_BYTES, ZarrCompressionOptions, ZarrPointerModel, archive_path_to_zarr_path, @@ -49,6 +50,7 @@ chunks_aligned_to, chunks_for, decorate_sub_archives, + default_shards, serialize_fits_opaque_metadata, ) from ._model import ( @@ -183,12 +185,20 @@ def add_array( chunks = self._chunks.get(name) or self._chunks.get(leaf) if chunks is None and self._image_chunks is not None: chunks = chunks_aligned_to(image_chunks=self._image_chunks, shape=packed.shape) + shards = self._shards.get(name) or self._shards.get(leaf) + if shards is None and chunks is not None: + shards = default_shards( + chunks=tuple(chunks), + shape=tuple(packed.shape), + dtype=packed.dtype, + target_bytes=DEFAULT_TARGET_SHARD_BYTES, + ) extra: dict[str, Any] = {"_ARRAY_DIMENSIONS": ["y", "x"]} extra.update(flag_attrs.dump()) ir_array = ZarrArray( data=packed, chunks=chunks, - shards=self._shards.get(name), + shards=shards, compression=self._compression.get(name), ) ir_array.attributes.extra = extra @@ -223,10 +233,18 @@ def add_array( if chunks is None and leaf == "psf" and array.ndim == 4 and parent_path == "/": chunks = (1, 1, array.shape[2], array.shape[3]) + shards = self._shards.get(name) or self._shards.get(leaf) + if shards is None and chunks is not None: + shards = default_shards( + chunks=tuple(chunks), + shape=tuple(array.shape), + dtype=array.dtype, + target_bytes=DEFAULT_TARGET_SHARD_BYTES, + ) ir_array = ZarrArray( data=np.ascontiguousarray(array), chunks=chunks, - shards=self._shards.get(name), + shards=shards, compression=self._compression.get(name), ) if parent_path == "/" and leaf in ("image", "variance"): diff --git a/tests/test_zarr_round_trip.py b/tests/test_zarr_round_trip.py index c0b30a4b..e00c1e90 100644 --- a/tests/test_zarr_round_trip.py +++ b/tests/test_zarr_round_trip.py @@ -18,9 +18,10 @@ from lsst.images import Box, ColorImage, Image, MaskedImage, MaskPlane, MaskSchema try: - import zarr # noqa: F401 + import zarr from lsst.images.tests import RoundtripZarr + from lsst.images.zarr._store import open_store_for_read HAVE_ZARR = True except ImportError: @@ -41,6 +42,26 @@ def test_image_round_trip(self) -> None: np.testing.assert_array_equal(recovered.array, original.array) self.assertEqual(recovered.bbox, original.bbox) + def test_image_round_trip_writes_shards(self) -> None: + # 300x300 float32: chunks (256, 256) -> shard (512, 512) by the + # byte-budget rule (target 16 MiB, ratio ~64, k ~ 8 capped at the + # 2-chunk-per-axis ceiling of 256 * 2 = 512). + original = Image( + np.zeros((300, 300), dtype=np.float32), + bbox=Box.factory[0:300, 0:300], + ) + with RoundtripZarr(self, original) as roundtrip: + with open_store_for_read(roundtrip.filename) as store: + root = zarr.open_group(store=store, mode="r", zarr_format=3) + image_arr = root["image"] + self.assertEqual(tuple(image_arr.chunks), (256, 256)) + self.assertEqual(tuple(image_arr.shards), (512, 512)) + # Single-chunk metadata arrays must NOT be sharded. + lsst_json_arr = root["lsst_json"] + self.assertIsNone(lsst_json_arr.shards) + # Data round-trip is preserved. + np.testing.assert_array_equal(roundtrip.result.array, original.array) + def test_masked_image_round_trip(self) -> None: schema = MaskSchema( [ From 83ea791e8a11998c8043f1640a9ad6abf60e40d6 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 14:27:31 -0700 Subject: [PATCH 47/60] docs(zarr): correct stale ZarrArray chunks/shards docstrings The `chunks` description referred to a "~1024 per axis" default that has been 256 since Task 1. The `shards` description claimed `to_zarr` derives a 4x-chunk default, but shard defaulting actually lives in `ZarrOutputArchive.add_array` via `default_shards`. Replace both parameter docstrings with accurate descriptions. Generated with AI Co-Authored-By: SLAC AI --- python/lsst/images/zarr/_model.py | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/python/lsst/images/zarr/_model.py b/python/lsst/images/zarr/_model.py index 07307ee5..39c29327 100644 --- a/python/lsst/images/zarr/_model.py +++ b/python/lsst/images/zarr/_model.py @@ -104,12 +104,15 @@ class ZarrArray: archive) or a ``zarr.Array`` (when read by the input archive). The two forms never mix in a single instance. chunks - Per-axis chunk shape. ``None`` lets `to_zarr` derive a default - from the array shape (~1024 per axis for plain images). + Per-axis chunk shape. ``None`` lets `to_zarr` derive a fallback + default for any IR node that reached the writer without explicit + chunks (the output archive normally sets these via the + `~lsst.images.zarr._layout.chunks_for` family of rules). shards - Per-axis shard shape (zarr v3 native). ``None`` lets `to_zarr` - derive a default of 4× the chunk shape per axis when the - resulting shard exceeds 1 MiB. + Per-axis shard shape (zarr v3 native). ``None`` means the array + is unsharded. Populated by `ZarrOutputArchive` via the + `~lsst.images.zarr._layout.default_shards` rule for arrays large + enough to benefit; tiny / single-chunk arrays stay ``None``. compression Codec configuration. ``None`` falls back to `ZarrCompressionOptions.default_for_dtype`. From 5eaf986aabbd633a5beedb358b1aacd915994489 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 14:29:24 -0700 Subject: [PATCH 48/60] test(zarr): cover CellCoadd PSF shard defaulting and overrides --- tests/test_zarr_output_archive.py | 32 +++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index ae8c538e..b3141be7 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -297,6 +297,38 @@ def test_psf_user_override_wins(self) -> None: node = archive.document.root.get("/psf") self.assertEqual(tuple(node.chunks), (2, 3, 21, 21)) + def test_psf_array_gets_default_shards(self) -> None: + # 25x25 cells of 150x150 float32: chunk_bytes = 90 KiB, + # ratio ~ 186, k = round(sqrt(186)) = 14 -> shard (14, 14, 150, 150). + psf = np.zeros((25, 25, 150, 150), dtype=np.float32) + archive = ZarrOutputArchive(archive_class="CellCoadd") + archive.add_array(psf, name="psf") + node = archive.document.root.get("/psf") + self.assertEqual(tuple(node.shards), (14, 14, 150, 150)) + + def test_psf_user_shard_override_wins(self) -> None: + psf = np.zeros((25, 25, 150, 150), dtype=np.float32) + archive = ZarrOutputArchive( + archive_class="CellCoadd", + shards={"psf": (5, 5, 150, 150)}, + ) + archive.add_array(psf, name="psf") + node = archive.document.root.get("/psf") + self.assertEqual(tuple(node.shards), (5, 5, 150, 150)) + + def test_small_psf_shard_caps_at_array_bounds(self) -> None: + # 2x3 cells of 21x21 float32: chunk_bytes = 1764 B, ratio ~9295, + # 2 growable axes, k = round(sqrt(9295)) = 96. The cap clamps + # each growable axis to chunks[i] * ceil(shape[i]/chunks[i]) = + # 1 * shape[i] = shape[i], yielding shard (2, 3, 21, 21) — the + # whole 6-cell PSF goes into one shard. Inner axes (21, 21) are + # not growable since chunks already cover them. + psf = np.zeros((2, 3, 21, 21), dtype=np.float32) + archive = ZarrOutputArchive(archive_class="CellCoadd") + archive.add_array(psf, name="psf") + node = archive.document.root.get("/psf") + self.assertEqual(tuple(node.shards), (2, 3, 21, 21)) + @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") class ZarrOpaqueMetadataWriteTestCase(unittest.TestCase): From ac032d5a6bc8cd874f48cd07aa7df34f16b3df25 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 14:32:55 -0700 Subject: [PATCH 49/60] test(zarr): correct arithmetic in small-PSF shard test comment --- tests/test_zarr_output_archive.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index b3141be7..db94953f 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -317,8 +317,8 @@ def test_psf_user_shard_override_wins(self) -> None: self.assertEqual(tuple(node.shards), (5, 5, 150, 150)) def test_small_psf_shard_caps_at_array_bounds(self) -> None: - # 2x3 cells of 21x21 float32: chunk_bytes = 1764 B, ratio ~9295, - # 2 growable axes, k = round(sqrt(9295)) = 96. The cap clamps + # 2x3 cells of 21x21 float32: chunk_bytes = 1764 B, ratio ~9511, + # 2 growable axes, k = round(sqrt(9511)) = 98. The cap clamps # each growable axis to chunks[i] * ceil(shape[i]/chunks[i]) = # 1 * shape[i] = shape[i], yielding shard (2, 3, 21, 21) — the # whole 6-cell PSF goes into one shard. Inner axes (21, 21) are From 1b699ad312da3369a7e06748327bcb8ff029ffc9 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 14:34:50 -0700 Subject: [PATCH 50/60] test(zarr): round-trip a sharded array through ZipStore --- tests/test_zarr_store.py | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/tests/test_zarr_store.py b/tests/test_zarr_store.py index 28267569..e6c11fc8 100644 --- a/tests/test_zarr_store.py +++ b/tests/test_zarr_store.py @@ -15,6 +15,8 @@ import tempfile import unittest +import numpy as np + try: import zarr @@ -58,6 +60,27 @@ def test_create_only_refuses_existing(self) -> None: with open_store_for_write(target): pass + def test_zip_store_round_trips_sharded_array(self) -> None: + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "out.zarr.zip") + data = np.arange(300 * 300, dtype=np.float32).reshape(300, 300) + with open_store_for_write(target) as store: + group = zarr.create_group(store=store, zarr_format=3) + arr = group.create_array( + name="image", + shape=data.shape, + chunks=(256, 256), + shards=(512, 512), + dtype=data.dtype, + ) + arr[:] = data + with open_store_for_read(target) as store: + group = zarr.open_group(store=store, mode="r", zarr_format=3) + image = group["image"] + self.assertEqual(tuple(image.chunks), (256, 256)) + self.assertEqual(tuple(image.shards), (512, 512)) + np.testing.assert_array_equal(image[...], data) + if __name__ == "__main__": unittest.main() From cf5312e8ebd00811c84fae66e92d43c7fe508810 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 14:39:53 -0700 Subject: [PATCH 51/60] docs: include zarr sharding and chunk default in changelog --- doc/changes/DM-55041.feature.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/changes/DM-55041.feature.md b/doc/changes/DM-55041.feature.md index 064c0e91..c232d459 100644 --- a/doc/changes/DM-55041.feature.md +++ b/doc/changes/DM-55041.feature.md @@ -1 +1 @@ -Added a new `lsst.images.zarr` archive backend that reads and writes Zarr v3 archives. The on-disk layout is xarray/CF-shaped at the root (`image`, `variance`, `mask` as siblings sharing `(y, x)` dimensions, CF `flag_masks`/`flag_meanings` on the mask) with OME-NGFF v0.5 multiscales metadata layered on top — the same bytes are visible to xarray, GDAL, and OME-Zarr tooling like `napari` and `ome-zarr-py`. Supports `Image`, `Mask`, `MaskedImage`, and `ColorImage`. Cloud-friendly defaults (tile-aligned chunks, fsspec-backed remote stores) and subset reads that only fetch the chunks they need. Install via the new `[zarr]` extra (`pip install lsst-images[zarr]`). +Added a new `lsst.images.zarr` archive backend that reads and writes Zarr v3 archives. The on-disk layout is xarray/CF-shaped at the root (`image`, `variance`, `mask` as siblings sharing `(y, x)` dimensions, CF `flag_masks`/`flag_meanings` on the mask) with OME-NGFF v0.5 multiscales metadata layered on top — the same bytes are visible to xarray, GDAL, and OME-Zarr tooling like `napari` and `ome-zarr-py`. Supports `Image`, `Mask`, `MaskedImage`, and `ColorImage`. Cloud-friendly defaults (256-pixel tile-aligned chunks, automatic v3 sharding tuned for ~16 MiB shards on S3/GCS, fsspec-backed remote stores) and subset reads that only fetch the chunks they need. Tunable via the `LSST_IMAGES_ZARR_TARGET_SHARD_BYTES` environment variable. Install via the new `[zarr]` extra (`pip install lsst-images[zarr]`). From 9e7189b00223ae9c271dd6024688c871fd41460a Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 15:18:24 -0700 Subject: [PATCH 52/60] docs(zarr): refresh module docstring for new chunk default and sharding Update the "Cloud-friendly defaults" docstring bullet in the zarr __init__ to reflect the 256 chunk cap (was 1024), reference the DEFAULT_CHUNK_AXIS_LIMIT constant, and add a new bullet describing automatic v3 sharding with the byte-budget rule and env-var escape hatch. Also drop the now-redundant max(1, ...) guard from the default_shards code sketch in the sharding design spec. Generated with AI Co-Authored-By: SLAC AI --- .../specs/2026-05-25-zarr-sharding-design.md | 2 +- python/lsst/images/zarr/__init__.py | 12 ++++++++++-- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/docs/superpowers/specs/2026-05-25-zarr-sharding-design.md b/docs/superpowers/specs/2026-05-25-zarr-sharding-design.md index b638f627..0cdb9ed9 100644 --- a/docs/superpowers/specs/2026-05-25-zarr-sharding-design.md +++ b/docs/superpowers/specs/2026-05-25-zarr-sharding-design.md @@ -118,7 +118,7 @@ def default_shards( if not growable: return None # array fits in one chunk per axis ratio = target_bytes / chunk_bytes - k = max(1, round(ratio ** (1 / len(growable)))) + k = round(ratio ** (1.0 / len(growable))) if k <= 1: return None # rounding produced a no-op shard shard = list(chunks) diff --git a/python/lsst/images/zarr/__init__.py b/python/lsst/images/zarr/__init__.py index cbac1931..44bd8bc8 100644 --- a/python/lsst/images/zarr/__init__.py +++ b/python/lsst/images/zarr/__init__.py @@ -70,9 +70,17 @@ Cloud-friendly defaults ----------------------- -- Default chunk geometry is tile-aligned: ``min(1024, dim)`` per +- Default chunk geometry is tile-aligned: ``min(256, dim)`` per axis for plain images, ``cell_shape`` for ``CellCoadd``, - single-cell for ``CellCoadd``'s 4-D PSF. + single-cell for ``CellCoadd``'s 4-D PSF. The per-axis cap is + configurable via the `DEFAULT_CHUNK_AXIS_LIMIT` constant. +- Bulk pixel arrays (``image``, ``variance``, ``mask``, and + ``CellCoadd``'s ``psf``) are sharded by default to keep object + counts on S3 / GCS low. The shard size is chosen by a byte-budget + rule (~16 MiB by default; tunable via the + ``LSST_IMAGES_ZARR_TARGET_SHARD_BYTES`` environment variable). + Tiny single-chunk arrays (``lsst_json``, ``wcs_ast``, FITS + opaque-metadata blocks) stay unsharded. - Subset reads via ``slices=`` to `~lsst.images.serialization.InputArchive.get_array` exploit zarr's chunk index: only chunks intersecting the slice are fetched, even From 0c2f12a72319710be595bea6367052aab738ea04 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 15:32:52 -0700 Subject: [PATCH 53/60] fix(zarr): write top-level Mask by reading schema from the object --- python/lsst/images/zarr/_output_archive.py | 5 ++++- tests/test_zarr_round_trip.py | 25 +++++++++++++++++++++- 2 files changed, 28 insertions(+), 2 deletions(-) diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index eaeb8581..433ab737 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -23,7 +23,7 @@ import pydantic import zarr -from .._mask import MaskSchema +from .._mask import Mask, MaskSchema from .._transforms import FrameSet from .._transforms._ast import Channel, StringStream from ..fits._common import FitsOpaqueMetadata @@ -468,6 +468,9 @@ def write( mask = getattr(obj, "mask", None) if mask is not None: mask_schema = getattr(mask, "schema", None) + if mask_schema is None and isinstance(obj, Mask): + # Top-level Mask: schema is on the object itself. + mask_schema = obj.schema if mask_schema is not None: archive_metadata["mask_schema"] = mask_schema diff --git a/tests/test_zarr_round_trip.py b/tests/test_zarr_round_trip.py index e00c1e90..aeb0935c 100644 --- a/tests/test_zarr_round_trip.py +++ b/tests/test_zarr_round_trip.py @@ -15,7 +15,7 @@ import numpy as np -from lsst.images import Box, ColorImage, Image, MaskedImage, MaskPlane, MaskSchema +from lsst.images import Box, ColorImage, Image, Mask, MaskedImage, MaskPlane, MaskSchema try: import zarr @@ -83,6 +83,29 @@ def test_masked_image_round_trip(self) -> None: np.testing.assert_array_equal(recovered.image.array, original.image.array) np.testing.assert_array_equal(recovered.mask.array, original.mask.array) + def test_mask_round_trip(self) -> None: + # Top-level Mask: schema is on the object itself, not on an + # inner ``mask`` attribute. write() must reach it. + schema = MaskSchema( + [ + MaskPlane("BAD", "Bad pixel."), + MaskPlane("SAT", "Saturated."), + MaskPlane("CR", "Cosmic ray."), + ] + ) + original = Mask( + np.zeros((4, 5, schema.mask_size), dtype=schema.dtype), + bbox=Box.factory[10:14, 20:25], + schema=schema, + ) + original.set("BAD", np.array([[i % 2 == 0 for i in range(5)] for _ in range(4)])) + original.set("SAT", np.array([[i > 2 for i in range(5)] for _ in range(4)])) + with RoundtripZarr(self, original) as roundtrip: + recovered = roundtrip.result + np.testing.assert_array_equal(recovered.array, original.array) + self.assertEqual(recovered.bbox, original.bbox) + self.assertEqual(list(recovered.schema.names), list(original.schema.names)) + def test_masked_image_with_40_planes_round_trip(self) -> None: schema = MaskSchema([MaskPlane(f"P{i}", f"Plane {i}.") for i in range(40)]) image = Image( From c832e86ac20bba818a90a0e7d81fb34b63785028 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 15:46:06 -0700 Subject: [PATCH 54/60] fix(zarr): pack mask with schema element stride, not byte stride --- python/lsst/images/zarr/_input_archive.py | 34 +++++++++++++--------- python/lsst/images/zarr/_output_archive.py | 17 ++++++++--- tests/test_zarr_round_trip.py | 29 ++++++++++++++++++ 3 files changed, 62 insertions(+), 18 deletions(-) diff --git a/python/lsst/images/zarr/_input_archive.py b/python/lsst/images/zarr/_input_archive.py index cf398015..ac957129 100644 --- a/python/lsst/images/zarr/_input_archive.py +++ b/python/lsst/images/zarr/_input_archive.py @@ -154,7 +154,7 @@ def get_array( and len(node.shape) == 2 and "flag_masks" in node.attributes.extra ): - return self._read_packed_mask(node, claimed_shape, slices) + return self._read_packed_mask(node, claimed_shape, np.dtype(model.datatype.to_numpy()), slices) # Standard path: forward slices straight to the lazy handle. return node.read(slices=slices) @@ -163,37 +163,43 @@ def _read_packed_mask( self, node: ZarrArray, claimed_shape: tuple[int, ...], + element_dtype: np.dtype, slices: tuple[slice, ...] | EllipsisType, ) -> np.ndarray: """Unpack a 2-D wide-int mask back to 3-D ``(mask_size, y, x)``. Mask deserialization expects the storage layout that - ``Mask.serialize`` streamed — ``(mask_size, y, x)`` — and does - the swap to ``(y, x, mask_size)`` itself. Rank-3 ``slices`` - from the deserializer are ``(byte_axis, y_slice, x_slice)``; - the byte axis is stripped before forwarding the spatial slice - to the lazy handle and re-applied to the unpacked output. + ``Mask.serialize`` streamed — ``(mask_size, y, x)`` — with one + ``element_dtype`` element per slice along the leading axis, + matching the schema's element packing. Each element's bits + live at packed positions ``[stride*i, stride*(i+1))`` where + ``stride = 8 * element_dtype.itemsize``. Rank-3 ``slices`` + from the deserializer are ``(element_axis, y_slice, + x_slice)``; the leading slice is stripped before forwarding + the spatial slice to the lazy handle and re-applied to the + unpacked output. """ mask_size = claimed_shape[-1] # Forward slice to the lazy handle so only intersecting chunks # are fetched even on remote stores. if slices is ...: spatial_slices: tuple[slice, ...] | EllipsisType = ... - byte_slice: slice | EllipsisType = ... + element_slice: slice | EllipsisType = ... elif len(slices) == 3: - byte_slice = slices[0] + element_slice = slices[0] spatial_slices = slices[1:] else: spatial_slices = slices - byte_slice = ... + element_slice = ... packed = node.read(slices=spatial_slices) - # Unpack: low byte first, in the (mask_size, y, x) layout. - out = np.empty((mask_size,) + packed.shape, dtype=np.uint8) + stride = 8 * element_dtype.itemsize + element_mask = (np.uint64(1) << np.uint64(stride)) - np.uint64(1) + out = np.empty((mask_size,) + packed.shape, dtype=element_dtype) for i in range(mask_size): - out[i] = (packed >> np.uint64(8 * i)) & np.uint64(0xFF) - if byte_slice is ...: + out[i] = ((packed >> np.uint64(stride * i)) & element_mask).astype(element_dtype) + if element_slice is ...: return out - return out[byte_slice] + return out[element_slice] def get_table( self, diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index 433ab737..e33f2014 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -203,10 +203,14 @@ def add_array( ) ir_array.attributes.extra = extra parent.arrays[leaf] = ir_array + # The model reports the schema's element dtype (uint8 / + # uint16 / ...) so the input archive can recover the + # original ``(y, x, mask_size)`` array; the on-disk array + # itself is the wide packed integer. return ArrayReferenceModel( source=f"zarr:{zarr_path}", shape=list(packed.shape), - datatype=NumberType.from_numpy(packed.dtype), + datatype=NumberType.from_numpy(array.dtype), ) chunks = self._chunks.get(name) or self._chunks.get(leaf) @@ -279,11 +283,16 @@ def _pack_mask(self, array: np.ndarray) -> tuple[np.ndarray, CfFlagAttributes]: ) n_planes = len(schema) target_dtype = mask_dtype_for_plane_count(n_planes) - # Pack: each (y, x) pixel's mask_size bytes -> one wide integer. - # Byte 0 is the low byte (planes 0..7), byte 1 is the next, etc. + # Pack: each (y, x) pixel's mask_size schema-dtype elements + # become one wide integer. Element 0 occupies bits + # [0, stride), element 1 occupies [stride, 2*stride), etc., + # where stride = 8 * schema.dtype.itemsize. Plane N therefore + # lives at packed bit position N, matching the CF flag_masks + # attribute (1 << N). + stride = 8 * array.dtype.itemsize packed = np.zeros(array.shape[:2], dtype=target_dtype) for i in range(array.shape[2]): - packed |= array[..., i].astype(target_dtype) << (8 * i) + packed |= array[..., i].astype(target_dtype) << (stride * i) # ``MaskSchema`` may carry ``None`` placeholders for retired plane # bits; drop them in the CF flag list. planes = [ diff --git a/tests/test_zarr_round_trip.py b/tests/test_zarr_round_trip.py index aeb0935c..ac24f956 100644 --- a/tests/test_zarr_round_trip.py +++ b/tests/test_zarr_round_trip.py @@ -106,6 +106,35 @@ def test_mask_round_trip(self) -> None: self.assertEqual(recovered.bbox, original.bbox) self.assertEqual(list(recovered.schema.names), list(original.schema.names)) + def test_uint16_mask_packs_with_element_stride(self) -> None: + # 20-plane uint16 schema (mask_size = 2 elements per pixel). + # Setting plane 16 must produce an on-disk packed value whose + # CF flag_masks[16] bit is set — the bit position must match + # the schema's element-stride layout, not the byte-stride + # layout. Without the fix, plane 16 lands at packed bit 8. + schema = MaskSchema( + [MaskPlane(f"P{i}", f"Plane {i}.") for i in range(20)], + dtype=np.uint16, + ) + image = Image( + np.zeros((4, 5), dtype=np.float32), + bbox=Box.factory[10:14, 20:25], + ) + original = MaskedImage(image, mask_schema=schema) + target_pixel = np.zeros((4, 5), dtype=bool) + target_pixel[0, 0] = True + original.mask.set("P16", target_pixel) + with RoundtripZarr(self, original) as roundtrip: + with open_store_for_read(roundtrip.filename) as store: + root = zarr.open_group(store=store, mode="r", zarr_format=3) + mask_arr = root["mask"] + flag_masks = list(mask_arr.attrs["flag_masks"]) + on_disk = int(mask_arr[0, 0]) + self.assertEqual(flag_masks[16], 1 << 16) + self.assertNotEqual(on_disk & flag_masks[16], 0) + recovered = roundtrip.result + np.testing.assert_array_equal(recovered.mask.array, original.mask.array) + def test_masked_image_with_40_planes_round_trip(self) -> None: schema = MaskSchema([MaskPlane(f"P{i}", f"Plane {i}.") for i in range(40)]) image = Image( From 0077cbfa2b79be21e6b72daef27be2388272db55 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 15:50:55 -0700 Subject: [PATCH 55/60] test(zarr): verify CF flag decoding via xarray for multi-element masks --- tests/test_zarr_xarray_interop.py | 69 ++++++++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-) diff --git a/tests/test_zarr_xarray_interop.py b/tests/test_zarr_xarray_interop.py index 525b47d8..8048d369 100644 --- a/tests/test_zarr_xarray_interop.py +++ b/tests/test_zarr_xarray_interop.py @@ -17,7 +17,7 @@ import numpy as np -from lsst.images import Box, Image, MaskedImage, MaskPlane, MaskSchema +from lsst.images import Box, Image, Mask, MaskedImage, MaskPlane, MaskSchema try: import zarr # noqa: F401 @@ -108,5 +108,72 @@ def test_open_zarr_data_values_match_in_memory(self) -> None: np.testing.assert_array_equal(ds["mask"].values, packed) +@unittest.skipUnless(HAVE_ZARR and HAVE_XARRAY, "xarray is not installed") +class XarrayCfFlagDecodingTestCase(unittest.TestCase): + """A standalone CF-aware reader can decode plane membership. + + Uses ``xarray.open_zarr`` to read the archive without any LSST + code on the read side, then applies the standard CF flag-decoding + rule ``(value & flag_masks[i]) != 0`` to recover the plane + membership of every pixel. The recovered membership must match + what was written. Catches regressions in the on-disk packing + layout (e.g. element-stride vs byte-stride bugs) that would + otherwise be invisible to an internal round-trip. + """ + + def test_uint16_schema_decodes_under_cf_rules(self) -> None: + # 20-plane uint16 schema (mask_size = 2) — exercises the + # multi-element packing path. + plane_names = [f"P{i}" for i in range(20)] + schema = MaskSchema( + [MaskPlane(name, f"Plane {name}.") for name in plane_names], + dtype=np.uint16, + ) + # Set distinct planes in distinct pixels so a single pass can + # cover the whole bit range, including planes that only the + # high element holds (P16..P19). + original = Mask( + np.zeros((4, 5, schema.mask_size), dtype=schema.dtype), + bbox=Box.factory[0:4, 0:5], + schema=schema, + ) + plane_for_pixel = {(0, 0): "P0", (0, 1): "P7", (1, 2): "P8", (2, 3): "P15", (3, 4): "P16"} + for (y, x), plane_name in plane_for_pixel.items(): + sel = np.zeros((4, 5), dtype=bool) + sel[y, x] = True + original.set(plane_name, sel) + + with tempfile.TemporaryDirectory() as tmp: + target = os.path.join(tmp, "mask.zarr") + write(original, target) + ds = xr.open_zarr(target, consolidated=False) + mask_da = ds["mask"] + flag_masks = list(mask_da.attrs["flag_masks"]) + flag_meanings = mask_da.attrs["flag_meanings"].split() + self.assertEqual(flag_meanings, plane_names) + self.assertEqual(flag_masks, [1 << i for i in range(20)]) + mask_values = mask_da.values + # CF decode: plane i is set at (y, x) iff + # (mask_values[y, x] & flag_masks[i]) != 0. + for (y, x), plane_name in plane_for_pixel.items(): + plane_idx = flag_meanings.index(plane_name) + bit = flag_masks[plane_idx] + self.assertNotEqual( + int(mask_values[y, x]) & bit, + 0, + f"plane {plane_name} (bit {bit:#x}) not set at ({y}, {x}); " + f"on-disk value = {int(mask_values[y, x]):#x}", + ) + # All other planes must be unset at this pixel. + for other_idx in range(len(flag_meanings)): + if other_idx == plane_idx: + continue + self.assertEqual( + int(mask_values[y, x]) & flag_masks[other_idx], + 0, + f"plane {flag_meanings[other_idx]} unexpectedly set at ({y}, {x})", + ) + + if __name__ == "__main__": unittest.main() From a949612651aef7b1992dc2b9293a992dcbf95293 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 16:03:07 -0700 Subject: [PATCH 56/60] fix(zarr): resolve CellCoadd cell shape via obj.grid.cell_shape --- python/lsst/images/zarr/_output_archive.py | 82 +++++++++++++++++----- tests/test_cell_coadd.py | 26 +++++++ tests/test_zarr_output_archive.py | 57 ++++++++++++++- 3 files changed, 145 insertions(+), 20 deletions(-) diff --git a/python/lsst/images/zarr/_output_archive.py b/python/lsst/images/zarr/_output_archive.py index e33f2014..ccc8fc66 100644 --- a/python/lsst/images/zarr/_output_archive.py +++ b/python/lsst/images/zarr/_output_archive.py @@ -446,6 +446,69 @@ def _data_model_for(archive_class: str) -> str: }.get(archive_class, f"org.lsst.{archive_class.lower()}") +def build_archive_metadata(obj: Any) -> dict[str, Any]: + """Resolve layout-affecting metadata from an in-memory archive object. + + The output archive's chunk and metadata rules consult + ``cell_shape`` (used by `~lsst.images.zarr._layout.chunks_for` to + align chunks to a `CellCoadd`'s cells) and ``mask_schema`` (used + by `_pack_mask` to produce CF flag attributes). Different archive + classes expose this information under different attribute names: + + - ``Image``: nothing (no cell grid, no mask schema). + - ``MaskedImage``: ``mask.schema``. + - ``Mask``: ``schema`` directly on the object. + - ``CellCoadd``: ``mask.schema`` and ``grid.cell_shape``. + + Returns a flat ``dict`` ready to pass as + ``ZarrOutputArchive(archive_metadata=...)``. Keys are present + only when a value was found. + """ + metadata: dict[str, Any] = {} + cell_shape = _resolve_cell_shape(obj) + if cell_shape is not None: + metadata["cell_shape"] = cell_shape + mask_schema = _resolve_mask_schema(obj) + if mask_schema is not None: + metadata["mask_schema"] = mask_schema + return metadata + + +def _resolve_cell_shape(obj: Any) -> tuple[int, ...] | None: + """Return the cell shape as a ``(y, x)`` tuple, or ``None``. + + Tries ``obj.cell_shape`` first, then ``obj.grid.cell_shape`` + (used by `CellCoadd`), then ``obj.cell_grid.cell_shape``. + """ + direct = getattr(obj, "cell_shape", None) + if direct is not None: + return tuple(direct) + grid = getattr(obj, "grid", None) + if grid is None: + grid = getattr(obj, "cell_grid", None) + if grid is not None: + nested = getattr(grid, "cell_shape", None) + if nested is not None: + return tuple(nested) + return None + + +def _resolve_mask_schema(obj: Any) -> MaskSchema | None: + """Return the mask schema, or ``None`` if the object has no mask.""" + direct = getattr(obj, "mask_schema", None) + if direct is not None: + return direct + mask = getattr(obj, "mask", None) + if mask is not None: + nested = getattr(mask, "schema", None) + if nested is not None: + return nested + if isinstance(obj, Mask): + # Top-level Mask: schema is on the object itself. + return obj.schema + return None + + def write( obj: Any, path: Any, @@ -464,24 +527,7 @@ def write( """ archive_class = type(obj).__name__ archive_default_name = getattr(obj, "_archive_default_name", None) - archive_metadata: dict[str, Any] = {} - if (cell_shape := getattr(obj, "cell_shape", None)) is not None: - archive_metadata["cell_shape"] = tuple(cell_shape) - if (cell_grid := getattr(obj, "cell_grid", None)) is not None: - archive_metadata["cell_grid"] = { - "bbox": list(cell_grid.bbox) if hasattr(cell_grid, "bbox") else None, - "cell_shape": list(cell_grid.cell_shape) if hasattr(cell_grid, "cell_shape") else None, - } - mask_schema = getattr(obj, "mask_schema", None) - if mask_schema is None: - mask = getattr(obj, "mask", None) - if mask is not None: - mask_schema = getattr(mask, "schema", None) - if mask_schema is None and isinstance(obj, Mask): - # Top-level Mask: schema is on the object itself. - mask_schema = obj.schema - if mask_schema is not None: - archive_metadata["mask_schema"] = mask_schema + archive_metadata = build_archive_metadata(obj) archive = ZarrOutputArchive( chunks=chunks, diff --git a/tests/test_cell_coadd.py b/tests/test_cell_coadd.py index 8e47af0d..deac31f7 100644 --- a/tests/test_cell_coadd.py +++ b/tests/test_cell_coadd.py @@ -24,11 +24,21 @@ DP2_COADD_DATA_ID, DP2_COADD_MISSING_CELL, RoundtripFits, + RoundtripZarr, assert_masked_images_equal, assert_psfs_equal, compare_cell_coadd_to_legacy, ) +try: + import zarr + + from lsst.images.zarr._store import open_store_for_read + + HAVE_ZARR = True +except ImportError: + HAVE_ZARR = False + DATA_DIR = os.environ.get("TESTDATA_IMAGES_DIR", None) @@ -147,6 +157,22 @@ def test_roundtrip(self) -> None: psf_points=self.psf_points, ) + @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") + def test_zarr_roundtrip_uses_cell_aligned_chunks(self) -> None: + """Writing a CellCoadd to zarr aligns chunks to the cell shape. + + The bug fixed in DM-55041 was that ``write()`` probed + ``obj.cell_shape`` / ``obj.cell_grid`` but `CellCoadd` exposes + the cell shape under ``obj.grid.cell_shape``. Without the fix, + real CellCoadd writes fall back to generic 256-pixel chunks + instead of cell-aligned chunks. + """ + cell_shape = self.cell_coadd.grid.cell_shape + with RoundtripZarr(self, self.cell_coadd, "CellCoadd") as roundtrip: + with open_store_for_read(roundtrip.filename) as store: + root = zarr.open_group(store=store, mode="r", zarr_format=3) + self.assertEqual(tuple(root["image"].chunks), (cell_shape.y, cell_shape.x)) + if __name__ == "__main__": unittest.main() diff --git a/tests/test_zarr_output_archive.py b/tests/test_zarr_output_archive.py index db94953f..0b4a6323 100644 --- a/tests/test_zarr_output_archive.py +++ b/tests/test_zarr_output_archive.py @@ -14,13 +14,14 @@ import os import tempfile import unittest +from types import SimpleNamespace import astropy.io.fits import astropy.table import numpy as np import pydantic -from lsst.images import Box, ColorImage, Image, MaskedImage, MaskPlane, MaskSchema +from lsst.images import YX, Box, ColorImage, Image, Mask, MaskedImage, MaskPlane, MaskSchema from lsst.images.fits._common import ExtensionKey, FitsOpaqueMetadata try: @@ -28,7 +29,7 @@ from lsst.images.zarr import ZarrPointerModel, write from lsst.images.zarr._model import ZarrDocument - from lsst.images.zarr._output_archive import ZarrOutputArchive + from lsst.images.zarr._output_archive import ZarrOutputArchive, build_archive_metadata HAVE_ZARR = True except ImportError: @@ -370,5 +371,57 @@ def test_fits_opaque_metadata_persists(self) -> None: self.assertEqual(recovered["EXPTIME"], 30.0) +@unittest.skipUnless(HAVE_ZARR, "zarr is not installed") +class BuildArchiveMetadataTestCase(unittest.TestCase): + """`build_archive_metadata` resolves cell shape and mask schema.""" + + def test_cell_shape_from_grid_attribute(self) -> None: + # CellCoadd exposes its cells via ``.grid.cell_shape`` (a YX), + # not via ``.cell_shape`` directly. The resolver must walk the + # grid attribute so cell-aligned chunks fire on real writes. + grid = SimpleNamespace(cell_shape=YX(y=150, x=200)) + obj = SimpleNamespace(grid=grid) + metadata = build_archive_metadata(obj) + self.assertEqual(metadata["cell_shape"], (150, 200)) + + def test_cell_shape_direct_attribute_wins(self) -> None: + # If both ``.cell_shape`` and ``.grid.cell_shape`` exist the + # direct attribute is preferred (allows callers to override). + grid = SimpleNamespace(cell_shape=YX(y=150, x=200)) + obj = SimpleNamespace(cell_shape=(64, 64), grid=grid) + metadata = build_archive_metadata(obj) + self.assertEqual(metadata["cell_shape"], (64, 64)) + + def test_cell_shape_from_legacy_cell_grid_attribute(self) -> None: + # Older / hypothetical objects may expose ``.cell_grid`` instead + # of ``.grid``; the resolver falls through to that. + cell_grid = SimpleNamespace(cell_shape=YX(y=128, x=128)) + obj = SimpleNamespace(cell_grid=cell_grid) + metadata = build_archive_metadata(obj) + self.assertEqual(metadata["cell_shape"], (128, 128)) + + def test_no_cell_shape_for_plain_image(self) -> None: + image = Image(np.zeros((4, 5), dtype=np.float32), bbox=Box.factory[0:4, 0:5]) + metadata = build_archive_metadata(image) + self.assertNotIn("cell_shape", metadata) + + def test_mask_schema_from_inner_mask(self) -> None: + schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) + image = Image(np.zeros((4, 5), dtype=np.float32), bbox=Box.factory[0:4, 0:5]) + masked = MaskedImage(image, mask_schema=schema) + metadata = build_archive_metadata(masked) + self.assertIs(metadata["mask_schema"], masked.mask.schema) + + def test_mask_schema_for_top_level_mask(self) -> None: + schema = MaskSchema([MaskPlane("BAD", "Bad pixel.")]) + mask = Mask( + np.zeros((4, 5, schema.mask_size), dtype=schema.dtype), + bbox=Box.factory[0:4, 0:5], + schema=schema, + ) + metadata = build_archive_metadata(mask) + self.assertIs(metadata["mask_schema"], schema) + + if __name__ == "__main__": unittest.main() From 4df6981115d828a932d0863e0981e824c48044b1 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 16:07:30 -0700 Subject: [PATCH 57/60] fix(zarr): normalize affine in OME transforms so scale doesn't double-apply --- python/lsst/images/zarr/_layout.py | 16 +++++++++++++--- tests/test_zarr_layout.py | 28 +++++++++++++++++++++++----- 2 files changed, 36 insertions(+), 8 deletions(-) diff --git a/python/lsst/images/zarr/_layout.py b/python/lsst/images/zarr/_layout.py index e2bf5eca..140d2c49 100644 --- a/python/lsst/images/zarr/_layout.py +++ b/python/lsst/images/zarr/_layout.py @@ -255,15 +255,25 @@ def affine_check( f"linearApprox returned {len(coeffs)} coefficients; expected 6 for a 2-D pixel→sky mapping." ) c0, c1, j00, j10, j01, j11 = (float(x) for x in coeffs) - # OME affine: rows are output axes, columns are input axes + the - # constant offset. - affine_matrix = [[j00, j01, c0], [j10, j11, c1], [0.0, 0.0, 1.0]] # Pixel scale per input axis: length of the corresponding Jacobian # column in output coordinates. scale_axis0 = float(np.hypot(j00, j10)) scale_axis1 = float(np.hypot(j01, j11)) + # NGFF composes ``coordinateTransformations`` in list order: the + # scale is applied first, then the affine. To avoid double-counting + # the pixel-size factor, normalise each Jacobian column by its + # length so the affine carries only the rotation / shear that the + # scale does not capture. ``pixel_scale`` is the geometric mean of + # the two column norms; if it were zero we'd already have returned + # above, so dividing by ``scale_axis*`` is safe here. + j00_n = j00 / scale_axis0 + j10_n = j10 / scale_axis0 + j01_n = j01 / scale_axis1 + j11_n = j11 / scale_axis1 + affine_matrix = [[j00_n, j01_n, c0], [j10_n, j11_n, c1], [0.0, 0.0, 1.0]] + coordinate_transformations: list[dict[str, Any]] = [ {"type": "scale", "scale": [scale_axis0, scale_axis1]}, {"type": "affine", "affine": affine_matrix}, diff --git a/tests/test_zarr_layout.py b/tests/test_zarr_layout.py index f61772c0..d1bb4959 100644 --- a/tests/test_zarr_layout.py +++ b/tests/test_zarr_layout.py @@ -115,6 +115,12 @@ def _make_distorted_frame_set(self) -> FrameSet: return fs def test_pure_linear_passes(self) -> None: + # NGFF v0.5 composes ``coordinateTransformations`` in list + # order: ``scale`` is applied first, then ``affine``. For a + # pure 0.2 pixel→sky scale, the composed effect on a unit + # pixel must be 0.2 — not 0.04 (which would result from + # leaving the scale embedded in both the explicit scale block + # AND the affine's Jacobian). fs = self._make_linear_frame_set(scale=0.2) result = affine_check( frame_set=fs, @@ -123,12 +129,24 @@ def test_pure_linear_passes(self) -> None: ) self.assertFalse(result.dropped) self.assertIsNotNone(result.coordinate_transformations) - # The affine block AST returns has the scale embedded in the - # diagonal of the affine matrix. - affine_block = result.coordinate_transformations[1] + ct = result.coordinate_transformations + assert ct is not None # for type checkers + scale_block, affine_block = ct[0], ct[1] + self.assertEqual(scale_block["type"], "scale") self.assertEqual(affine_block["type"], "affine") - self.assertAlmostEqual(affine_block["affine"][0][0], 0.2) - self.assertAlmostEqual(affine_block["affine"][1][1], 0.2) + # The scale block carries the per-axis pixel size; the affine + # has unit-norm columns (a pure rotation/translation here). + self.assertAlmostEqual(scale_block["scale"][0], 0.2) + self.assertAlmostEqual(scale_block["scale"][1], 0.2) + self.assertAlmostEqual(affine_block["affine"][0][0], 1.0) + self.assertAlmostEqual(affine_block["affine"][1][1], 1.0) + # Compose scale ∘ affine and apply to a unit pixel vector. + scale = scale_block["scale"] + affine = np.array(affine_block["affine"]) + scaled = np.array([scale[0] * 1.0, scale[1] * 1.0, 1.0]) + composed = affine @ scaled + self.assertAlmostEqual(composed[0], 0.2) + self.assertAlmostEqual(composed[1], 0.2) def test_high_distortion_drops_block(self) -> None: fs = self._make_distorted_frame_set() From 8a6b7afb1333d217411b88d2557315aaf2933cc3 Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 16:14:18 -0700 Subject: [PATCH 58/60] docs: strip trailing blank lines from zarr-io plan --- docs/superpowers/plans/2026-05-22-zarr-io-backend.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/docs/superpowers/plans/2026-05-22-zarr-io-backend.md b/docs/superpowers/plans/2026-05-22-zarr-io-backend.md index 1b062d0d..b482f2e2 100644 --- a/docs/superpowers/plans/2026-05-22-zarr-io-backend.md +++ b/docs/superpowers/plans/2026-05-22-zarr-io-backend.md @@ -5351,8 +5351,3 @@ These are intentional handoffs, not placeholder content in the production code. 2. Aligned chunks — Phase 2.5 test asserting `variance` follows `image_chunks` after the override; CellCoadd test in 4.4 asserting all three siblings have `cell_shape` chunks. 3. Affine residual validator — Phase 2.3 tests with a synthetic linear FrameSet (passes) and a synthetic high-distortion FrameSet (drops). 4. No byte duplication — implicit in the "no fixup pass" architecture; explicit assertions in 4.1 (root has no OME multiscales for ColorImage) and 4.4 (CellCoadd PSF is a single 4-D array, not per-cell groups + a stacked array). - - - - - From 0b25cc98be914ac4484fbc0b0c549d0a329513fc Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 16:22:02 -0700 Subject: [PATCH 59/60] docs(zarr): fix title-level inconsistency and stray single backticks --- doc/lsst.images/zarr.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/lsst.images/zarr.rst b/doc/lsst.images/zarr.rst index 7ac7c841..3a8bf354 100644 --- a/doc/lsst.images/zarr.rst +++ b/doc/lsst.images/zarr.rst @@ -18,7 +18,7 @@ On top of that we layer four community standards so the same bytes are usable by * `xarray / CF-conventions `_ — every array carries an ``_ARRAY_DIMENSIONS`` attribute and a v3 ``dimension_names`` metadata field. The mask carries CF ``flag_masks`` / ``flag_meanings`` / ``flag_descriptions`` so any CF-aware tool can interpret the bit assignments. * `OME-NGFF v0.5 `_ — the root group carries a ``multiscales`` block whose only ``dataset.path`` points back at the same ``image`` array. This makes the same archive openable by OME-Zarr tooling without any byte duplication. -* `Geo-Zarr `_ shape compatibility — sibling arrays sharing ``(y, x)`` dimensions with CF flag attributes is the same convention `rasterio` and `GDAL`'s Zarr driver expect for raster + mask layers. +* `Geo-Zarr `_ shape compatibility — sibling arrays sharing ``(y, x)`` dimensions with CF flag attributes is the same convention ``rasterio`` and ``GDAL``'s Zarr driver expect for raster + mask layers. * `LSST archive tree <#data-model>`_ — a Pydantic JSON document at ``/lsst_json`` carries the full LSST-specific metadata (WCS, PSF, detector, butler info, …) that the community standards have no place for. Same convention as the FITS backend's ``JSON`` HDU and the NDF backend's ``/MORE/LSST/JSON`` path. Data model @@ -52,7 +52,7 @@ Example layouts --------------- `~lsst.images.VisitImage` -~~~~~~~~~~~~~~~~~~~~~~~~~ +^^^^^^^^^^^^^^^^^^^^^^^^^ The most common case — a single detector exposure with a projection, PSF, and detector geometry:: @@ -70,7 +70,7 @@ The ``lsst_json`` tree carries the projection, PSF type, detector reference, obs For the WCS specifically, the projection's ``pixel_to_sky`` mapping is decomposed into a chain of Frames and Mappings (including any ``PolyMap`` for SIP distortion); reading is byte-exact. `~lsst.images.cells.CellCoadd` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A coadd composed of a regular grid of cells, each with its own PSF:: @@ -90,7 +90,7 @@ The ``image`` / ``variance`` / ``mask`` chunks are aligned to the cell grid so r The ``psf`` array's chunking is per-cell so a single-cell PSF read is also one chunk. `~lsst.images.ColorImage` -~~~~~~~~~~~~~~~~~~~~~~~~~ +^^^^^^^^^^^^^^^^^^^^^^^^^ A 3-channel display image:: From 0e02466bd360b090a70b728a032c165786eb653d Mon Sep 17 00:00:00 2001 From: Tim Jenness Date: Mon, 25 May 2026 17:47:20 -0700 Subject: [PATCH 60/60] fixup! docs(zarr): refresh module docstring for new chunk default and sharding --- doc/lsst.images/zarr.rst | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/doc/lsst.images/zarr.rst b/doc/lsst.images/zarr.rst index 3a8bf354..ee5c3f27 100644 --- a/doc/lsst.images/zarr.rst +++ b/doc/lsst.images/zarr.rst @@ -4,7 +4,8 @@ Zarr I/O A Zarr v3 serialization backend whose on-disk layout is xarray/CF-shaped at the root (``image`` / ``variance`` / ``mask`` as siblings sharing ``(y, x)`` dimensions, CF ``flag_masks`` / ``flag_meanings`` on the mask) with OME-NGFF v0.5 multiscales metadata as a discoverability layer pointing at the same ``image`` array. The same bytes are visible to ``xarray``, GDAL's Zarr driver, and OME-Zarr tooling like ``napari`` and ``ome-zarr-py``. -Default chunking is tile-aligned (~1024×1024 for plain images, ``cell_shape`` for ``CellCoadd``); subset reads via ``slices=`` only fetch the chunks they need — including on remote stores accessed through ``lsst.resources.ResourcePath`` and ``fsspec``. +Default chunking is tile-aligned — 256 pixels per spatial axis for plain images, ``cell_shape`` for ``CellCoadd`` — and bulk pixel arrays are sharded with a ~16 MiB byte budget so a typical archive is a small handful of objects rather than thousands of chunk files. +Subset reads via ``slices=`` only fetch the chunks they need, including on remote stores accessed through ``lsst.resources.ResourcePath`` and ``fsspec``. This backend requires the optional ``zarr >= 3.0`` package. Install via the ``[zarr]`` extra:: @@ -48,6 +49,28 @@ Per-array data The ``image`` / ``variance`` / ``mask`` arrays at the root, plus any class-specific extras. Mask is a 2-D unsigned integer (``uint8`` for ≤8 planes, ``uint64`` for 17–64 planes; >64 raises) with CF ``flag_masks`` / ``flag_meanings`` / ``flag_descriptions``. +Chunking and sharding +--------------------- + +Chunks + The default chunk shape per top-level array is ``min(256, dim)`` per axis for plain image arrays (``DEFAULT_CHUNK_AXIS_LIMIT`` in `lsst.images.zarr`). + For `~lsst.images.cells.CellCoadd`, ``image`` / ``variance`` / ``mask`` chunks are aligned to ``cell_shape`` so a single-cell read is one chunk per array; the 4-D ``psf`` array is chunked ``(1, 1, Py, Px)`` so a single-cell PSF read is also one chunk. + Sibling arrays (``variance`` / ``mask``) inherit the ``image`` array's chunk shape unless the caller passes an explicit override to `~lsst.images.zarr.write`. + +Sharding + Bulk pixel arrays (``image`` / ``variance`` / ``mask`` / ``CellCoadd``'s ``psf``) are sharded by default so a remote archive on S3 / GCS is a small number of objects rather than thousands of chunk files. + The shard shape is chosen by a byte-budget rule that grows axes whose chunk does not already cover the full extent until each shard is close to ``LSST_IMAGES_ZARR_TARGET_SHARD_BYTES`` of uncompressed data; the default budget is 16 MiB. + Shard axes are always integer multiples of the corresponding chunk axes, capped at the array extent. + Tiny single-chunk arrays (``lsst_json``, ``wcs_ast``, the FITS opaque-metadata block, per-PSF parameter arrays whose chunks already cover the whole array) are left unsharded — sharding them would only add a layer of indirection. + Sharding can be disabled or overridden per-array by passing ``shards={"image": None, ...}`` to `~lsst.images.zarr.write`. + +Stores + The store implementation is selected from the URI shape: a path ending in ``.zarr.zip`` (or any ``.zip``) opens a ``ZipStore``, a remote URI (``s3://``, ``gs://``, ``http(s)://``) opens a ``FsspecStore`` via `lsst.resources.ResourcePath`, and anything else opens a ``LocalStore`` directory. + Two caveats worth knowing about: + + * Writing a ``ZipStore`` directly to a remote URI is not yet supported — write to a local ``.zarr.zip`` and upload, or write to a remote directory store. Reading a remote ``.zarr.zip`` works (the file is fetched to a local cache first via ``ResourcePath.as_local``, then opened). + * After a directory or fsspec write, consolidated metadata is emitted so a single read fetches the whole hierarchy's ``zarr.json`` contents — a significant latency win on remote stores. ``ZipStore`` does not support consolidation; zip writes succeed without consolidated metadata, and reads of zip archives walk the hierarchy normally. + Example layouts ---------------