feat(accelerator): numba granularity backend#56
Conversation
First real accelerator end-to-end on top of the merged #49 dispatch: `set_accelerator("numba")` now routes `intensity` to a numba implementation and composes it with the numpy backend for every other feature. - _detect.py: capability flags (HAS_NUMBA/HAS_JAX/HAS_JAX_GPU) via find_spec, resolved once at import. No try/except — an absent backend is never attempted, a present-but-broken one raises. - primitives/: shared host segment layer. flatten_labeled reduces a labeled (Z,Y,X) image to flat (values, seg0, coords); a single kernel set then covers 2D, 3D and future batches with no image/batch axis baked in. max_position is a host scipy.ndimage.maximum_position call for bit-exact parity with the numpy backend's tie-break. - primitives/_segment_numba.py: @njit(cache=True), single-threaded kernels — fused single-pass moments + centroid cross-sums, residual-sumsq std, CSR per-segment quantiles/MAD. - core/numba/: import-selected backend (`from cp_measure.core.numba import get_intensity`); identical dict contract, 2D and 3D. - bulk._dispatch: "numba" composes numba intensity + numpy rest; raises if numba is not installed (no silent fallback). - numba is an optional extra ([numba]); the default install stays numba-free. CI tests install .[numba] and run the correctness harness. test/test_backend_correctness.py asserts numba == numpy (2D/3D, edge on/off, rtol=1e-6), the dispatch composition, and the absent-numba raise path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move Location_MaxIntensity_* out of the host scipy per-object call and into the fused segment_moments kernel via a deterministic `>=`-last argmax (records the max pixel's coordinates in the same single pass). scipy.ndimage.maximum_position's labeled tie-break is `argsort` (quicksort) + last-write-wins, i.e. an arbitrary tied pixel that is not stable across numpy versions — so there is no stable rule to replicate. On real continuous data the max is unique, so the kernel's `>=`-last result is bit-identical to scipy (the correctness harness confirms 2D/3D, edge on/off); only exact-value ties can differ, and the kernel's rule is the more reproducible of the two. Drops the now-unused max_position_per_object host helper (and its scipy import) from the primitive layer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- flatten_labeled: derive (z,y,x) coords from numpy.nonzero(lmask) instead of materialising three full-volume mgrid arrays then masking them — same coords in the same C order, no per-call O(volume) temporaries. - label_to_idx_lut: drop the unused sorted-labels return value (now just (lut, n)); the max_position-in-kernel refactor removed its only consumer. - add a lighter segment_stats kernel (count/sum/min/max) and use it for the edge path, replacing the segment_moments call that needed throwaway zero coordinate arrays and discarded the centroid cross-sums. No behaviour change; correctness harness + full suite stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
flatten_labeled built the flat (values, seg0, coords) arrays with a numpy (masks>0)&isfinite mask + numpy.nonzero + two fancy-index gathers — several full-image passes plus a boolean-array allocation, and the dominant cost of the non-edge path. Replace it with flatten_numba: two grid scans (count, then fill) in a single @njit kernel, coordinates taken from the loop indices. The flat-segment kernels and the rest of the backend are unchanged — only how the flat arrays are built. Measured (single image, non-edge core): flatten step ~4-10x faster (10x at 1024^2), full core ~1.1x (256^2) / ~1.5x (1024^2); the gain grows with image size. Bit-identical output (correctness harness stays green). The numpy flatten_labeled (its only consumer) is removed; primitives/segment.py now holds just the numpy label->index lookup, the numba layer owns the flatten. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- _detect.py: drop the unused HAS_JAX / HAS_JAX_GPU flags. Besides being dead for this PR, HAS_JAX_GPU eagerly imported jax at module load whenever jax was installed, just to set a flag nothing reads. jax detection lands with the jax backend; HAS_NUMBA alone establishes the find_spec pattern. - flatten the image without a forced float64 copy: pass masked_image through ascontiguousarray without dtype=, and let flatten_numba upcast the kept values into its float64 output. Avoids a full-image float64 temporary for non-float64 inputs (e.g. float32 microscopy data); bit-identical for float64. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
label_to_idx_lut used numpy.unique(masks) — a full-image sort — to find the present labels. scipy.ndimage.find_objects (scipy is already a core dep) returns the same ascending present-label set in one O(P) pass, giving a bit-identical LUT ~3-5x faster (12.4->3.5 ms at 1024^2; 21.9->4.4 ms on a 32x240x240 volume). Trick borrowed from Alan's pure-numpy speedup (#55); unlike its percentile/MAD changes, this one preserves output exactly (verified identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Profiling the sparse-large regime (1024^2, 64 obj, edge on) showed skimage.find_boundaries was ~37% of the call (~20-29 ms) — the morphology dominates, not the scan. A one-pass numba inner-boundary kernel (4-neighbour check, the cp_measure_fast approach) is bit-identical to find_boundaries( mode="inner") and 12-27x faster, verified exact across (H,W) and (1,H,W). Used for 2D planes (Z==1); true 3D keeps skimage (6-neighbourhood). Single-image 1024^2/64 edge-on drops ~47->32 ms, and per-image batch ~445->264 ms. No correctness change (exact boundary match; harness stays green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address PR #54 review: - bulk._dispatch: reword the absent-numba RuntimeError from the imperative "install it via" to "you can install it via" (avoid issuing pip commands imperatively at the user). - primitives is an internal layer with no public API to curate; import label_to_idx_lut directly from primitives.segment (matching how the _segment_numba kernels are already imported) and drop the __init__ re-export. Documents the convention in the package docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(accelerator): numba intensity backend
8d84546 to
13f1ee7
Compare
Both Zernike features on the #49 dispatch seam, sharing one fused numba kernel (core/numba/_zernike.py::zernike_moments): per pixel it evaluates the full Zernike basis (Horner over r^2 with centrosome's LUT coeffs + z=ym+i*xm azimuthal recurrence + strict r^2>1 disk cutoff) AND scatter-adds the weighted-complex Re/Im into per-object (n,K) accumulators in one pass -- no (M,K) intermediate. Shared by both features (weights=1 for shape, weights=pixels for radial); host post-step differs only in normalisation (pi*r^2 vs pixel count) and output (magnitude only vs magnitude+phase). Centrosome stays host-side for the degree-only LUT and minimum_enclosing_circle (the conventions live there -- a basis-eval unit test asserts our Horner matches construct_zernike_polynomials bit-exactly). 2D-only (3D -> {}, matching baseline); batch-shaped via to_bzyx (single = batch-of-1, serial per-image loop, no in-function parallelism). Reuses label_to_idx_lut and a single nonzero pass instead of np.unique / masks_to_ijv. Tests: basis-eval bit-exact vs centrosome; kernel vs numpy reference; both backends vs the numpy baseline within tol (single, batch, 3D->{}); dispatch wiring. Full suite 195 passed. Benchmark (1080^2, 144 obj): zernike 632->22 ms (28.7x), radial_zernikes 476->23 ms (20.9x). Stacked on #56 (needs to_bzyx); retarget base to main once #56 merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Shared foundational helper used by the numba intensity/granularity/zernike backends to normalise any input (2D/3D/4D/list) to the canonical batch-of-volumes form: single image = batch of 1, returning a dict for a lone image/volume and a list of dicts for a batch. Pure numpy, no numba. Extracted to its own PR so it can be reviewed first and unblock the feature backends (#56/#57/#58). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second accelerator on the #49/#54 dispatch seam: set_accelerator("numba") now routes `granularity` (2D) to a numba backend alongside `intensity`. Kernels (core/numba/_granularity.py, single-threaded @njit, bit-exact vs skimage): - disk erosion/dilation via row-decomposed van-Herk/Gil-Werman sliding min/max (O(r·HW)) for the background opening - 5-tap disk(1) erosion/dilation for the spectrum loop - reconstruction by dilation via the Vincent (1993) FIFO-hybrid (1 fwd + 1 bwd raster, then a ring-buffer FIFO with a per-pixel in-queue flag — O(N), no overflow), replacing the per-iteration scipy/skimage morphology Wrapper (core/numba/measuregranularity.py): mirrors the numpy baseline but keeps resampling on scipy map_coordinates, swaps the morphology for the kernels, reads per-object means back via a sparse point-query of the reconstructed image (no full-res upsample) + dense bincount, and applies the cascaded reconstruction mask (rec_g <= rec_{g-1} <= pixels, exact by monotonicity). 3D / non-2D falls back to the numpy baseline. Batch shape: new primitives/shapes.py `to_bzyx` normalises any input to the canonical (B,Z,Y,X) form (2D/3D/4D array or ragged list), single image = batch of 1, returning a dict for a lone image/volume and a list of dicts for a batch. A 3D array stays one volume (never a batch), preserving existing semantics. Correctness: morphology bit-exact vs skimage (radii 1..10 + cascade chain); wrapper matches the numpy baseline within tol for fullres + all subsample modes, single/1-px/empty/label-gap objects, batch (4D + ragged), and 3D fallback. Full suite 177 passed. Benchmark (1080², 144 objects): fullres 9000->1369 ms (6.57x), default 1178->94 ms (12.47x). Follow-up: migrate the intensity backend onto to_bzyx (golden-test guarded). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Quality pass (no behaviour change; bit-exact, 107 tests pass): - _disk_reduce: np.full for the out init (was a hand-rolled nested loop); replace the `x % k` modulo in the per-pixel VHG prefix/suffix loops with a running counter (one hoisted modulo per row-band) - get_granularity: drop the `ero.max() == 0` early-out — it scanned the whole image every one of the 16 iterations, while reconstruction already returns zeros cheaply on an all-zero seed - shapes.to_bzyx: document that it is a pure batch normaliser (no mask-broadcast; per-element ndim dispatch stays in each backend) Benchmark (1080², 144 obj): fullres 6.57x->6.86x, default 12.47x->12.58x. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI lint (nix fmt -> ruff-format) reformats lines over the 88-col default; apply ruff format to match. Also fix a stale module docstring in _granularity.py (reconstruction is the Vincent FIFO-hybrid, not raster-until-convergence). No behaviour change; 99 granularity tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
13f1ee7 to
78788a4
Compare
Both Zernike features on the #49 dispatch seam, sharing one fused numba kernel (core/numba/_zernike.py::zernike_moments): per pixel it evaluates the full Zernike basis (Horner over r^2 with centrosome's LUT coeffs + z=ym+i*xm azimuthal recurrence + strict r^2>1 disk cutoff) AND scatter-adds the weighted-complex Re/Im into per-object (n,K) accumulators in one pass -- no (M,K) intermediate. Shared by both features (weights=1 for shape, weights=pixels for radial); host post-step differs only in normalisation (pi*r^2 vs pixel count) and output (magnitude only vs magnitude+phase). Centrosome stays host-side for the degree-only LUT and minimum_enclosing_circle (the conventions live there -- a basis-eval unit test asserts our Horner matches construct_zernike_polynomials bit-exactly). 2D-only (3D -> {}, matching baseline); batch-shaped via to_bzyx (single = batch-of-1, serial per-image loop, no in-function parallelism). Reuses label_to_idx_lut and a single nonzero pass instead of np.unique / masks_to_ijv. Tests: basis-eval bit-exact vs centrosome; kernel vs numpy reference; both backends vs the numpy baseline within tol (single, batch, 3D->{}); dispatch wiring. Full suite 195 passed. Benchmark (1080^2, 144 obj): zernike 632->22 ms (28.7x), radial_zernikes 476->23 ms (20.9x). Stacked on #56 (needs to_bzyx); retarget base to main once #56 merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The default config (subsample<1) resamples the reconstructed image back to the object sample points via scipy.ndimage.map_coordinates ONCE PER granularity step (16x), re-deriving the floor/frac each time though the sample points never change -- ~67% of default-config runtime. Replace those 16 calls with _granularity.bilinear_gather (order-1, mode=constant cval=0), with floor/frac precomputed once and reused across steps. The three one-off grid resamples (subsample/background) stay on map_coordinates. Matches map_coordinates(order=1) to ~3e-16 per call (FP add-order only; verified vs scipy incl. boundary points in test_bilinear_gather_matches_map_coordinates); the full-output difference vs the old map_coordinates path is ~3e-14, far inside this backend's existing rtol=1e-6 contract (granularity is a within-tolerance lane, not bit-exact). Backend golden test (vs numpy, rtol=1e-6) stays green. Default config 1080^2/144 obj: 741 -> 257 ms (2.88x faster numba backend). Full suite 179 passed, lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Follow-up perf (commit 89dde43): the default config resamples the reconstructed image back to the object sample points via Replaced those 16 calls with a numba Accuracy: the gather matches |
…truction Vincent/Robinson hybrid reconstruction: run three forward+backward raster geodesic passes before FIFO seeding (vs one pair) and pack FIFO coordinates as (i<<16)|j int32 decoded by shift/mask (vs i*W+j int64 with int-div/mod). The extra raster pairs cut the FIFO seed count ~50-95%, trading cheap cache-sequential scans for far fewer random-access FIFO updates; the packed coords drop the IDIV/MOD from the hot dequeue and halve the queue's cache footprint. Bit-identical to the previous kernel and to skimage.reconstruction (the result is independent of how many raster passes precede the FIFO) — the overflow-proof N+1 ring buffer + per-pixel in-queue dedup flag are kept. ~1.5x faster reconstruction; end-to-end ~3% (default subsample=0.25, reconstruction is ~20% post-gather) to ~8% (fullres subsample=1.0, reconstruction is ~81%). Non-seeding raster passes factored into _geodesic_raster_fwd/_bwd helpers. +long-geodesic regression test. Full suite 180 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Shared foundational helper used by the numba intensity/granularity/zernike backends to normalise any input (2D/3D/4D/list) to the canonical batch-of-volumes form: single image = batch of 1, returning a dict for a lone image/volume and a list of dicts for a batch. Pure numpy, no numba. Extracted to its own PR so it can be reviewed first and unblock the feature backends (#56/#57/#58). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Routes
granularity(2D) to a numba backend on the #49/#54 dispatch seam, alongsideintensity. Bit-exact morphology kernels (disk open via VHG, disk(1), Vincent FIFO-hybrid reconstruction); resampling stays on scipy. 3D falls back to numpy.New
primitives/shapes.py::to_bzyxadds the canonical(B,Z,Y,X)batch shape (single = batch-of-1; ragged lists ok), so granularity accepts a single image or a batch.Speedup (1080², 144 obj): fullres 6.57× (9000→1369 ms), default 12.47× (1178→94 ms).
Tests: kernels bit-exact vs skimage; wrapper matches baseline within tol (fullres/subsampled/edges/batch/3D). Full suite 177 passed.