feat(accelerator): numba zernike + radial_zernikes backend#58
Draft
timtreis wants to merge 13 commits into
Draft
feat(accelerator): numba zernike + radial_zernikes backend#58timtreis wants to merge 13 commits into
timtreis wants to merge 13 commits into
Conversation
First real accelerator end-to-end on top of the merged #49 dispatch: `set_accelerator("numba")` now routes `intensity` to a numba implementation and composes it with the numpy backend for every other feature. - _detect.py: capability flags (HAS_NUMBA/HAS_JAX/HAS_JAX_GPU) via find_spec, resolved once at import. No try/except — an absent backend is never attempted, a present-but-broken one raises. - primitives/: shared host segment layer. flatten_labeled reduces a labeled (Z,Y,X) image to flat (values, seg0, coords); a single kernel set then covers 2D, 3D and future batches with no image/batch axis baked in. max_position is a host scipy.ndimage.maximum_position call for bit-exact parity with the numpy backend's tie-break. - primitives/_segment_numba.py: @njit(cache=True), single-threaded kernels — fused single-pass moments + centroid cross-sums, residual-sumsq std, CSR per-segment quantiles/MAD. - core/numba/: import-selected backend (`from cp_measure.core.numba import get_intensity`); identical dict contract, 2D and 3D. - bulk._dispatch: "numba" composes numba intensity + numpy rest; raises if numba is not installed (no silent fallback). - numba is an optional extra ([numba]); the default install stays numba-free. CI tests install .[numba] and run the correctness harness. test/test_backend_correctness.py asserts numba == numpy (2D/3D, edge on/off, rtol=1e-6), the dispatch composition, and the absent-numba raise path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move Location_MaxIntensity_* out of the host scipy per-object call and into the fused segment_moments kernel via a deterministic `>=`-last argmax (records the max pixel's coordinates in the same single pass). scipy.ndimage.maximum_position's labeled tie-break is `argsort` (quicksort) + last-write-wins, i.e. an arbitrary tied pixel that is not stable across numpy versions — so there is no stable rule to replicate. On real continuous data the max is unique, so the kernel's `>=`-last result is bit-identical to scipy (the correctness harness confirms 2D/3D, edge on/off); only exact-value ties can differ, and the kernel's rule is the more reproducible of the two. Drops the now-unused max_position_per_object host helper (and its scipy import) from the primitive layer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- flatten_labeled: derive (z,y,x) coords from numpy.nonzero(lmask) instead of materialising three full-volume mgrid arrays then masking them — same coords in the same C order, no per-call O(volume) temporaries. - label_to_idx_lut: drop the unused sorted-labels return value (now just (lut, n)); the max_position-in-kernel refactor removed its only consumer. - add a lighter segment_stats kernel (count/sum/min/max) and use it for the edge path, replacing the segment_moments call that needed throwaway zero coordinate arrays and discarded the centroid cross-sums. No behaviour change; correctness harness + full suite stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
flatten_labeled built the flat (values, seg0, coords) arrays with a numpy (masks>0)&isfinite mask + numpy.nonzero + two fancy-index gathers — several full-image passes plus a boolean-array allocation, and the dominant cost of the non-edge path. Replace it with flatten_numba: two grid scans (count, then fill) in a single @njit kernel, coordinates taken from the loop indices. The flat-segment kernels and the rest of the backend are unchanged — only how the flat arrays are built. Measured (single image, non-edge core): flatten step ~4-10x faster (10x at 1024^2), full core ~1.1x (256^2) / ~1.5x (1024^2); the gain grows with image size. Bit-identical output (correctness harness stays green). The numpy flatten_labeled (its only consumer) is removed; primitives/segment.py now holds just the numpy label->index lookup, the numba layer owns the flatten. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- _detect.py: drop the unused HAS_JAX / HAS_JAX_GPU flags. Besides being dead for this PR, HAS_JAX_GPU eagerly imported jax at module load whenever jax was installed, just to set a flag nothing reads. jax detection lands with the jax backend; HAS_NUMBA alone establishes the find_spec pattern. - flatten the image without a forced float64 copy: pass masked_image through ascontiguousarray without dtype=, and let flatten_numba upcast the kept values into its float64 output. Avoids a full-image float64 temporary for non-float64 inputs (e.g. float32 microscopy data); bit-identical for float64. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
label_to_idx_lut used numpy.unique(masks) — a full-image sort — to find the present labels. scipy.ndimage.find_objects (scipy is already a core dep) returns the same ascending present-label set in one O(P) pass, giving a bit-identical LUT ~3-5x faster (12.4->3.5 ms at 1024^2; 21.9->4.4 ms on a 32x240x240 volume). Trick borrowed from Alan's pure-numpy speedup (#55); unlike its percentile/MAD changes, this one preserves output exactly (verified identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Profiling the sparse-large regime (1024^2, 64 obj, edge on) showed skimage.find_boundaries was ~37% of the call (~20-29 ms) — the morphology dominates, not the scan. A one-pass numba inner-boundary kernel (4-neighbour check, the cp_measure_fast approach) is bit-identical to find_boundaries( mode="inner") and 12-27x faster, verified exact across (H,W) and (1,H,W). Used for 2D planes (Z==1); true 3D keeps skimage (6-neighbourhood). Single-image 1024^2/64 edge-on drops ~47->32 ms, and per-image batch ~445->264 ms. No correctness change (exact boundary match; harness stays green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address PR #54 review: - bulk._dispatch: reword the absent-numba RuntimeError from the imperative "install it via" to "you can install it via" (avoid issuing pip commands imperatively at the user). - primitives is an internal layer with no public API to curate; import label_to_idx_lut directly from primitives.segment (matching how the _segment_numba kernels are already imported) and drop the __init__ re-export. Documents the convention in the package docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(accelerator): numba intensity backend
8d84546 to
13f1ee7
Compare
b073742 to
2f3a204
Compare
Shared foundational helper used by the numba intensity/granularity/zernike backends to normalise any input (2D/3D/4D/list) to the canonical batch-of-volumes form: single image = batch of 1, returning a dict for a lone image/volume and a list of dicts for a batch. Pure numpy, no numba. Extracted to its own PR so it can be reviewed first and unblock the feature backends (#56/#57/#58). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both Zernike features on the #49 dispatch seam, sharing one fused numba kernel (core/numba/_zernike.py::zernike_moments): per pixel it evaluates the full Zernike basis (Horner over r^2 with centrosome's LUT coeffs + z=ym+i*xm azimuthal recurrence + strict r^2>1 disk cutoff) AND scatter-adds the weighted-complex Re/Im into per-object (n,K) accumulators in one pass -- no (M,K) intermediate. Shared by both features (weights=1 for shape, weights=pixels for radial); host post-step differs only in normalisation (pi*r^2 vs pixel count) and output (magnitude only vs magnitude+phase). Centrosome stays host-side for the degree-only LUT and minimum_enclosing_circle (the conventions live there -- a basis-eval unit test asserts our Horner matches construct_zernike_polynomials bit-exactly). 2D-only (3D -> {}, matching baseline); batch-shaped via to_bzyx (single = batch-of-1, serial per-image loop, no in-function parallelism). Reuses label_to_idx_lut and a single nonzero pass instead of np.unique / masks_to_ijv. Tests: basis-eval bit-exact vs centrosome; kernel vs numpy reference; both backends vs the numpy baseline within tol (single, batch, 3D->{}); dispatch wiring. Full suite 195 passed. Benchmark (1080^2, 144 obj): zernike 632->22 ms (28.7x), radial_zernikes 476->23 ms (20.9x). Stacked on #56 (needs to_bzyx); retarget base to main once #56 merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
/simplify pass (no behaviour change; 26 zernike tests pass, ruff clean): - Extract the convention-bearing per-object setup shared by both wrappers (label_to_idx_lut -> nonzero -> minimum_enclosing_circle -> ym=row/xm=col normalisation -> fused kernel) into _zernike.zernike_moments_per_object, returning (vr, vi, radii, seg0). Each wrapper now only supplies the weight source (None vs pixels) and its own denominator (pi*r^2 vs pixel count), output keys, and magnitude(+phase). Removes ~15 lines of lockstep-risk duplication (the coord convention + kernel arg-packing lived in two places). - Move _zernike_basis_numpy (test-only numpy reference) out of the production module into test/test_zernike_kernels.py. - Drop now-redundant seg0.astype(int64) (label_lut is already int64) and the per-wrapper centrosome.cpmorphology / label_to_idx_lut imports. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2f3a204 to
b1c45fe
Compare
This was referenced Jun 3, 2026
timtreis
added a commit
that referenced
this pull request
Jun 9, 2026
Shared foundational helper used by the numba intensity/granularity/zernike backends to normalise any input (2D/3D/4D/list) to the canonical batch-of-volumes form: single image = batch of 1, returning a dict for a lone image/volume and a list of dicts for a batch. Pure numpy, no numba. Extracted to its own PR so it can be reviewed first and unblock the feature backends (#56/#57/#58). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #56 (needs
to_bzyx; retarget base tomainonce #56 merges).Both Zernike features on the #49 dispatch seam, sharing one fused numba kernel (
zernike_moments): per pixel it evaluates the Zernike basis (Horner over r² with centrosome's LUT +z=ym+i·xmrecurrence + strictr²>1cutoff) and scatter-adds the weighted-complex Re/Im into per-object accumulators in one pass — no(M,K)intermediate. Shape usesweights=1, radialweights=pixels; they differ only in normalisation (π·r² vs pixel count) and output (magnitude vs magnitude+phase).Centrosome stays host-side for the degree-only LUT and
minimum_enclosing_circle(conventions live there — a unit test asserts our basis Horner is bit-exact vsconstruct_zernike_polynomials). 2D-only (3D→{}); batch-shaped viato_bzyx(single = batch-of-1, serial per-image loop, no in-function parallelism). Useslabel_to_idx_lut+ onenonzeropass instead ofnp.unique/masks_to_ijv.Speedup (1080², 144 obj): zernike 28.7× (632→22 ms), radial_zernikes 20.9× (476→23 ms).
Tests: basis bit-exact vs centrosome; kernel vs reference; both backends vs baseline within tol (single/batch/3D→{}); dispatch. Full suite 195 passed; ruff clean.