feat(accelerator): numba granularity backend by timtreis · Pull Request #56 · afermg/cp_measure

timtreis · 2026-06-03T15:13:51Z

Routes granularity (2D) to a numba backend on the #49/#54 dispatch seam, alongside intensity. Bit-exact morphology kernels (disk open via VHG, disk(1), Vincent FIFO-hybrid reconstruction); resampling stays on scipy. 3D falls back to numpy.

New primitives/shapes.py::to_bzyx adds the canonical (B,Z,Y,X) batch shape (single = batch-of-1; ragged lists ok), so granularity accepts a single image or a batch.

Speedup (1080², 144 obj): fullres 6.57× (9000→1369 ms), default 12.47× (1178→94 ms).

Tests: kernels bit-exact vs skimage; wrapper matches baseline within tol (fullres/subsampled/edges/batch/3D). Full suite 177 passed.

First real accelerator end-to-end on top of the merged #49 dispatch: `set_accelerator("numba")` now routes `intensity` to a numba implementation and composes it with the numpy backend for every other feature. - _detect.py: capability flags (HAS_NUMBA/HAS_JAX/HAS_JAX_GPU) via find_spec, resolved once at import. No try/except — an absent backend is never attempted, a present-but-broken one raises. - primitives/: shared host segment layer. flatten_labeled reduces a labeled (Z,Y,X) image to flat (values, seg0, coords); a single kernel set then covers 2D, 3D and future batches with no image/batch axis baked in. max_position is a host scipy.ndimage.maximum_position call for bit-exact parity with the numpy backend's tie-break. - primitives/_segment_numba.py: @njit(cache=True), single-threaded kernels — fused single-pass moments + centroid cross-sums, residual-sumsq std, CSR per-segment quantiles/MAD. - core/numba/: import-selected backend (`from cp_measure.core.numba import get_intensity`); identical dict contract, 2D and 3D. - bulk._dispatch: "numba" composes numba intensity + numpy rest; raises if numba is not installed (no silent fallback). - numba is an optional extra ([numba]); the default install stays numba-free. CI tests install .[numba] and run the correctness harness. test/test_backend_correctness.py asserts numba == numpy (2D/3D, edge on/off, rtol=1e-6), the dispatch composition, and the absent-numba raise path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Move Location_MaxIntensity_* out of the host scipy per-object call and into the fused segment_moments kernel via a deterministic `>=`-last argmax (records the max pixel's coordinates in the same single pass). scipy.ndimage.maximum_position's labeled tie-break is `argsort` (quicksort) + last-write-wins, i.e. an arbitrary tied pixel that is not stable across numpy versions — so there is no stable rule to replicate. On real continuous data the max is unique, so the kernel's `>=`-last result is bit-identical to scipy (the correctness harness confirms 2D/3D, edge on/off); only exact-value ties can differ, and the kernel's rule is the more reproducible of the two. Drops the now-unused max_position_per_object host helper (and its scipy import) from the primitive layer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- flatten_labeled: derive (z,y,x) coords from numpy.nonzero(lmask) instead of materialising three full-volume mgrid arrays then masking them — same coords in the same C order, no per-call O(volume) temporaries. - label_to_idx_lut: drop the unused sorted-labels return value (now just (lut, n)); the max_position-in-kernel refactor removed its only consumer. - add a lighter segment_stats kernel (count/sum/min/max) and use it for the edge path, replacing the segment_moments call that needed throwaway zero coordinate arrays and discarded the centroid cross-sums. No behaviour change; correctness harness + full suite stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

flatten_labeled built the flat (values, seg0, coords) arrays with a numpy (masks>0)&isfinite mask + numpy.nonzero + two fancy-index gathers — several full-image passes plus a boolean-array allocation, and the dominant cost of the non-edge path. Replace it with flatten_numba: two grid scans (count, then fill) in a single @njit kernel, coordinates taken from the loop indices. The flat-segment kernels and the rest of the backend are unchanged — only how the flat arrays are built. Measured (single image, non-edge core): flatten step ~4-10x faster (10x at 1024^2), full core ~1.1x (256^2) / ~1.5x (1024^2); the gain grows with image size. Bit-identical output (correctness harness stays green). The numpy flatten_labeled (its only consumer) is removed; primitives/segment.py now holds just the numpy label->index lookup, the numba layer owns the flatten. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- _detect.py: drop the unused HAS_JAX / HAS_JAX_GPU flags. Besides being dead for this PR, HAS_JAX_GPU eagerly imported jax at module load whenever jax was installed, just to set a flag nothing reads. jax detection lands with the jax backend; HAS_NUMBA alone establishes the find_spec pattern. - flatten the image without a forced float64 copy: pass masked_image through ascontiguousarray without dtype=, and let flatten_numba upcast the kept values into its float64 output. Avoids a full-image float64 temporary for non-float64 inputs (e.g. float32 microscopy data); bit-identical for float64. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

label_to_idx_lut used numpy.unique(masks) — a full-image sort — to find the present labels. scipy.ndimage.find_objects (scipy is already a core dep) returns the same ascending present-label set in one O(P) pass, giving a bit-identical LUT ~3-5x faster (12.4->3.5 ms at 1024^2; 21.9->4.4 ms on a 32x240x240 volume). Trick borrowed from Alan's pure-numpy speedup (#55); unlike its percentile/MAD changes, this one preserves output exactly (verified identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Profiling the sparse-large regime (1024^2, 64 obj, edge on) showed skimage.find_boundaries was ~37% of the call (~20-29 ms) — the morphology dominates, not the scan. A one-pass numba inner-boundary kernel (4-neighbour check, the cp_measure_fast approach) is bit-identical to find_boundaries( mode="inner") and 12-27x faster, verified exact across (H,W) and (1,H,W). Used for 2D planes (Z==1); true 3D keeps skimage (6-neighbourhood). Single-image 1024^2/64 edge-on drops ~47->32 ms, and per-image batch ~445->264 ms. No correctness change (exact boundary match; harness stays green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address PR #54 review: - bulk._dispatch: reword the absent-numba RuntimeError from the imperative "install it via" to "you can install it via" (avoid issuing pip commands imperatively at the user). - primitives is an internal layer with no public API to curate; import label_to_idx_lut directly from primitives.segment (matching how the _segment_numba kernels are already imported) and drop the __init__ re-export. Documents the convention in the package docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(accelerator): numba intensity backend

Both Zernike features on the #49 dispatch seam, sharing one fused numba kernel (core/numba/_zernike.py::zernike_moments): per pixel it evaluates the full Zernike basis (Horner over r^2 with centrosome's LUT coeffs + z=ym+i*xm azimuthal recurrence + strict r^2>1 disk cutoff) AND scatter-adds the weighted-complex Re/Im into per-object (n,K) accumulators in one pass -- no (M,K) intermediate. Shared by both features (weights=1 for shape, weights=pixels for radial); host post-step differs only in normalisation (pi*r^2 vs pixel count) and output (magnitude only vs magnitude+phase). Centrosome stays host-side for the degree-only LUT and minimum_enclosing_circle (the conventions live there -- a basis-eval unit test asserts our Horner matches construct_zernike_polynomials bit-exactly). 2D-only (3D -> {}, matching baseline); batch-shaped via to_bzyx (single = batch-of-1, serial per-image loop, no in-function parallelism). Reuses label_to_idx_lut and a single nonzero pass instead of np.unique / masks_to_ijv. Tests: basis-eval bit-exact vs centrosome; kernel vs numpy reference; both backends vs the numpy baseline within tol (single, batch, 3D->{}); dispatch wiring. Full suite 195 passed. Benchmark (1080^2, 144 obj): zernike 632->22 ms (28.7x), radial_zernikes 476->23 ms (20.9x). Stacked on #56 (needs to_bzyx); retarget base to main once #56 merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Shared foundational helper used by the numba intensity/granularity/zernike backends to normalise any input (2D/3D/4D/list) to the canonical batch-of-volumes form: single image = batch of 1, returning a dict for a lone image/volume and a list of dicts for a batch. Pure numpy, no numba. Extracted to its own PR so it can be reviewed first and unblock the feature backends (#56/#57/#58). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Second accelerator on the #49/#54 dispatch seam: set_accelerator("numba") now routes `granularity` (2D) to a numba backend alongside `intensity`. Kernels (core/numba/_granularity.py, single-threaded @njit, bit-exact vs skimage): - disk erosion/dilation via row-decomposed van-Herk/Gil-Werman sliding min/max (O(r·HW)) for the background opening - 5-tap disk(1) erosion/dilation for the spectrum loop - reconstruction by dilation via the Vincent (1993) FIFO-hybrid (1 fwd + 1 bwd raster, then a ring-buffer FIFO with a per-pixel in-queue flag — O(N), no overflow), replacing the per-iteration scipy/skimage morphology Wrapper (core/numba/measuregranularity.py): mirrors the numpy baseline but keeps resampling on scipy map_coordinates, swaps the morphology for the kernels, reads per-object means back via a sparse point-query of the reconstructed image (no full-res upsample) + dense bincount, and applies the cascaded reconstruction mask (rec_g <= rec_{g-1} <= pixels, exact by monotonicity). 3D / non-2D falls back to the numpy baseline. Batch shape: new primitives/shapes.py `to_bzyx` normalises any input to the canonical (B,Z,Y,X) form (2D/3D/4D array or ragged list), single image = batch of 1, returning a dict for a lone image/volume and a list of dicts for a batch. A 3D array stays one volume (never a batch), preserving existing semantics. Correctness: morphology bit-exact vs skimage (radii 1..10 + cascade chain); wrapper matches the numpy baseline within tol for fullres + all subsample modes, single/1-px/empty/label-gap objects, batch (4D + ragged), and 3D fallback. Full suite 177 passed. Benchmark (1080², 144 objects): fullres 9000->1369 ms (6.57x), default 1178->94 ms (12.47x). Follow-up: migrate the intensity backend onto to_bzyx (golden-test guarded). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Quality pass (no behaviour change; bit-exact, 107 tests pass): - _disk_reduce: np.full for the out init (was a hand-rolled nested loop); replace the `x % k` modulo in the per-pixel VHG prefix/suffix loops with a running counter (one hoisted modulo per row-band) - get_granularity: drop the `ero.max() == 0` early-out — it scanned the whole image every one of the 16 iterations, while reconstruction already returns zeros cheaply on an all-zero seed - shapes.to_bzyx: document that it is a pure batch normaliser (no mask-broadcast; per-element ndim dispatch stays in each backend) Benchmark (1080², 144 obj): fullres 6.57x->6.86x, default 12.47x->12.58x. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CI lint (nix fmt -> ruff-format) reformats lines over the 88-col default; apply ruff format to match. Also fix a stale module docstring in _granularity.py (reconstruction is the Vincent FIFO-hybrid, not raster-until-convergence). No behaviour change; 99 granularity tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both Zernike features on the #49 dispatch seam, sharing one fused numba kernel (core/numba/_zernike.py::zernike_moments): per pixel it evaluates the full Zernike basis (Horner over r^2 with centrosome's LUT coeffs + z=ym+i*xm azimuthal recurrence + strict r^2>1 disk cutoff) AND scatter-adds the weighted-complex Re/Im into per-object (n,K) accumulators in one pass -- no (M,K) intermediate. Shared by both features (weights=1 for shape, weights=pixels for radial); host post-step differs only in normalisation (pi*r^2 vs pixel count) and output (magnitude only vs magnitude+phase). Centrosome stays host-side for the degree-only LUT and minimum_enclosing_circle (the conventions live there -- a basis-eval unit test asserts our Horner matches construct_zernike_polynomials bit-exactly). 2D-only (3D -> {}, matching baseline); batch-shaped via to_bzyx (single = batch-of-1, serial per-image loop, no in-function parallelism). Reuses label_to_idx_lut and a single nonzero pass instead of np.unique / masks_to_ijv. Tests: basis-eval bit-exact vs centrosome; kernel vs numpy reference; both backends vs the numpy baseline within tol (single, batch, 3D->{}); dispatch wiring. Full suite 195 passed. Benchmark (1080^2, 144 obj): zernike 632->22 ms (28.7x), radial_zernikes 476->23 ms (20.9x). Stacked on #56 (needs to_bzyx); retarget base to main once #56 merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The default config (subsample<1) resamples the reconstructed image back to the object sample points via scipy.ndimage.map_coordinates ONCE PER granularity step (16x), re-deriving the floor/frac each time though the sample points never change -- ~67% of default-config runtime. Replace those 16 calls with _granularity.bilinear_gather (order-1, mode=constant cval=0), with floor/frac precomputed once and reused across steps. The three one-off grid resamples (subsample/background) stay on map_coordinates. Matches map_coordinates(order=1) to ~3e-16 per call (FP add-order only; verified vs scipy incl. boundary points in test_bilinear_gather_matches_map_coordinates); the full-output difference vs the old map_coordinates path is ~3e-14, far inside this backend's existing rtol=1e-6 contract (granularity is a within-tolerance lane, not bit-exact). Backend golden test (vs numpy, rtol=1e-6) stays green. Default config 1080^2/144 obj: 741 -> 257 ms (2.88x faster numba backend). Full suite 179 passed, lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

timtreis · 2026-06-04T05:04:53Z

Follow-up perf (commit 89dde43): the default config resamples the reconstructed image back to the object sample points via scipy.ndimage.map_coordinates once per granularity step (16×), re-deriving the floor/frac each time though the sample points are fixed — ~67% of default-config runtime.

Replaced those 16 calls with a numba bilinear_gather (order-1, mode=constant cval=0), floor/frac precomputed once and reused. The three one-off grid resamples (subsample/background) stay on map_coordinates. Default config 1080²/144 obj: 741 → 257 ms (2.88×).

Accuracy: the gather matches map_coordinates(order=1) to ~3e-16 per call (verified vs scipy incl. boundary points), full-output delta ~3e-14 — far inside this backend's existing rtol=1e-6 contract (granularity is a within-tolerance lane, not bit-exact). Backend golden test (vs numpy) stays green; full suite 179 passed. Flagging the (machine-eps) resampling change explicitly in case you'd prefer to keep map_coordinates — trivial to revert.

…truction Vincent/Robinson hybrid reconstruction: run three forward+backward raster geodesic passes before FIFO seeding (vs one pair) and pack FIFO coordinates as (i<<16)|j int32 decoded by shift/mask (vs i*W+j int64 with int-div/mod). The extra raster pairs cut the FIFO seed count ~50-95%, trading cheap cache-sequential scans for far fewer random-access FIFO updates; the packed coords drop the IDIV/MOD from the hot dequeue and halve the queue's cache footprint. Bit-identical to the previous kernel and to skimage.reconstruction (the result is independent of how many raster passes precede the FIFO) — the overflow-proof N+1 ring buffer + per-pixel in-queue dedup flag are kept. ~1.5x faster reconstruction; end-to-end ~3% (default subsample=0.25, reconstruction is ~20% post-gather) to ~8% (fullres subsample=1.0, reconstruction is ~81%). Non-seeding raster passes factored into _geodesic_raster_fwd/_bwd helpers. +long-geodesic regression test. Full suite 180 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Shared foundational helper used by the numba intensity/granularity/zernike backends to normalise any input (2D/3D/4D/list) to the canonical batch-of-volumes form: single image = batch of 1, returning a dict for a lone image/volume and a list of dicts for a batch. Pure numpy, no numba. Extracted to its own PR so it can be reviewed first and unblock the feature backends (#56/#57/#58). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

timtreis and others added 10 commits June 2, 2026 04:47

lock: add numba to lock

eefb4d4

Merge pull request #54 from afermg/feat/accelerator-numba-intensity

4ca0a35

feat(accelerator): numba intensity backend

timtreis marked this pull request as ready for review June 3, 2026 15:36

timtreis mentioned this pull request Jun 3, 2026

refactor(numba): migrate intensity backend onto (B,Z,Y,X) #57

Draft

timtreis requested a review from afermg June 3, 2026 15:52

This was referenced Jun 3, 2026

feat(accelerator): numba zernike + radial_zernikes backend #58

Draft

feat(primitives): (B,Z,Y,X) batch-shape helper (to_bzyx) #59

Open

timtreis force-pushed the feat/accelerator-numba-granularity branch from 8d84546 to 13f1ee7 Compare June 3, 2026 18:55

timtreis changed the base branch from main to feat/bzyx-shape June 3, 2026 18:55

timtreis force-pushed the feat/bzyx-shape branch from 4851c3e to 4ed7d79 Compare June 3, 2026 19:08

timtreis and others added 3 commits June 3, 2026 21:08

timtreis force-pushed the feat/accelerator-numba-granularity branch from 13f1ee7 to 78788a4 Compare June 3, 2026 19:09

This was referenced Jun 3, 2026

feat(accelerator): numba colocalization (pearson/manders/rwc/overlap) #60

Draft

feat(accelerator): numba texture (Haralick) backend #64

Draft

feat(accelerator): numba feret (Feret diameters) backend #65

Draft

timtreis added the numba label Jun 9, 2026

timtreis force-pushed the feat/bzyx-shape branch from 4ed7d79 to e03aff4 Compare June 9, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accelerator): numba granularity backend#56

feat(accelerator): numba granularity backend#56
timtreis wants to merge 16 commits into
feat/bzyx-shapefrom
feat/accelerator-numba-granularity

timtreis commented Jun 3, 2026 •

edited

Loading

Uh oh!

timtreis commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timtreis commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timtreis commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timtreis commented Jun 3, 2026 •

edited

Loading