feat(accelerator): numba feret (Feret diameters) backend by timtreis · Pull Request #65 · afermg/cp_measure

timtreis · 2026-06-04T03:42:19Z

Summary

Adds a numba feret backend on the set_accelerator("numba") dispatch seam (sibling of #56/#57/#58/#60/#64 on #59), accelerating get_feret 42–43× bit-exact (1080²/144 obj: 618 → 14 ms).

Where the time went

get_feret's cost is not the geometry — it's the plumbing. Step-0 profile (1080²/144 obj):

component	ms	%	boundary
`utils.masks_to_ijv`	521.7	86.1%	reducible
`convex_hull_ijv`	78.9	13.0%	import (geometry)
`feret_diameter`	3.9	0.6%	import (geometry)

masks_to_ijv is a per-label numpy.where scan that re-reads the whole image once per object, then feeds every object pixel to the convex hull.

The fix — two bit-exact reductions in one numba pass

core/numba/_feret.py::_boundary_ijv:

Replace the per-label scan with a single row-major counting-scatter into per-label offsets — produces the same (i, j, label) rows in the same order as masks_to_ijv (label-ascending, row-major within a label). Bit-identical.
Feed the hull only boundary pixels. hull(object) == hull(boundary(object)): an interior pixel (all 8 neighbours share its label) can never be a hull vertex. Emitting only boundary pixels leaves convex_hull_ijv and feret_diameter bit-identical while shrinking the hull input ~17× (≈6% of pixels) — collapsing 78.9 ms → 6.2 ms on top of killing the 86% scan.

8-connectivity + image-edge detection is load-bearing: a diagonal staircase corner (8-conn only) and an edge-clipped pixel are real hull vertices that 4-conn/edge-ignoring detection would drop.

convex_hull_ijv / feret_diameter stay in centrosome (computational geometry — imported, per the reimplement/import boundary rule). Serial kernel, no prange/nogil. Batch via to_bzyx; 3D volumes → {} like the baseline.

Correctness

test_feret_kernels.py — boundary-ijv order matches masks_to_ijv restricted to the boundary; identical hull + feret vs the full-pixel path; random masks; empty mask.
test_feret_backend.py — golden numba-vs-numpy (2D, batch, 3D→{}); empty masks raise the same ValueError as the baseline (drop-in fidelity — convex_hull_ijv does np.max on empty input).
+feret in test_backend_correctness.py dispatch composition.
Full suite green, lint clean (ruff 0.12.1).

Note for reviewers

Kernel + thin wrapper live together in core/numba/_feret.py (not the usual core/numba/measure<module>.py split) to avoid a new-file collision with sibling #58, which owns core/numba/measureobjectsizeshape.py.

sizeshape (sibling lane) — documented NO-GO

The other remaining core lane, sizeshape, was profiled and deliberately not ported: it's Amdahl-capped at ~1.13×. Its cost is dominated by regionprops_table (73%), whose single hot primitive is the convex hull (import-boundary geometry); the only mechanically-reducible group (moments) is +4.7 ms. No reimplementable hot primitive, unlike radial (geodesic) or texture (histogram). Verdict recorded; not forced into a low-value kernel.

🤖 Generated with Claude Code

First real accelerator end-to-end on top of the merged #49 dispatch: `set_accelerator("numba")` now routes `intensity` to a numba implementation and composes it with the numpy backend for every other feature. - _detect.py: capability flags (HAS_NUMBA/HAS_JAX/HAS_JAX_GPU) via find_spec, resolved once at import. No try/except — an absent backend is never attempted, a present-but-broken one raises. - primitives/: shared host segment layer. flatten_labeled reduces a labeled (Z,Y,X) image to flat (values, seg0, coords); a single kernel set then covers 2D, 3D and future batches with no image/batch axis baked in. max_position is a host scipy.ndimage.maximum_position call for bit-exact parity with the numpy backend's tie-break. - primitives/_segment_numba.py: @njit(cache=True), single-threaded kernels — fused single-pass moments + centroid cross-sums, residual-sumsq std, CSR per-segment quantiles/MAD. - core/numba/: import-selected backend (`from cp_measure.core.numba import get_intensity`); identical dict contract, 2D and 3D. - bulk._dispatch: "numba" composes numba intensity + numpy rest; raises if numba is not installed (no silent fallback). - numba is an optional extra ([numba]); the default install stays numba-free. CI tests install .[numba] and run the correctness harness. test/test_backend_correctness.py asserts numba == numpy (2D/3D, edge on/off, rtol=1e-6), the dispatch composition, and the absent-numba raise path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Move Location_MaxIntensity_* out of the host scipy per-object call and into the fused segment_moments kernel via a deterministic `>=`-last argmax (records the max pixel's coordinates in the same single pass). scipy.ndimage.maximum_position's labeled tie-break is `argsort` (quicksort) + last-write-wins, i.e. an arbitrary tied pixel that is not stable across numpy versions — so there is no stable rule to replicate. On real continuous data the max is unique, so the kernel's `>=`-last result is bit-identical to scipy (the correctness harness confirms 2D/3D, edge on/off); only exact-value ties can differ, and the kernel's rule is the more reproducible of the two. Drops the now-unused max_position_per_object host helper (and its scipy import) from the primitive layer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- flatten_labeled: derive (z,y,x) coords from numpy.nonzero(lmask) instead of materialising three full-volume mgrid arrays then masking them — same coords in the same C order, no per-call O(volume) temporaries. - label_to_idx_lut: drop the unused sorted-labels return value (now just (lut, n)); the max_position-in-kernel refactor removed its only consumer. - add a lighter segment_stats kernel (count/sum/min/max) and use it for the edge path, replacing the segment_moments call that needed throwaway zero coordinate arrays and discarded the centroid cross-sums. No behaviour change; correctness harness + full suite stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

flatten_labeled built the flat (values, seg0, coords) arrays with a numpy (masks>0)&isfinite mask + numpy.nonzero + two fancy-index gathers — several full-image passes plus a boolean-array allocation, and the dominant cost of the non-edge path. Replace it with flatten_numba: two grid scans (count, then fill) in a single @njit kernel, coordinates taken from the loop indices. The flat-segment kernels and the rest of the backend are unchanged — only how the flat arrays are built. Measured (single image, non-edge core): flatten step ~4-10x faster (10x at 1024^2), full core ~1.1x (256^2) / ~1.5x (1024^2); the gain grows with image size. Bit-identical output (correctness harness stays green). The numpy flatten_labeled (its only consumer) is removed; primitives/segment.py now holds just the numpy label->index lookup, the numba layer owns the flatten. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- _detect.py: drop the unused HAS_JAX / HAS_JAX_GPU flags. Besides being dead for this PR, HAS_JAX_GPU eagerly imported jax at module load whenever jax was installed, just to set a flag nothing reads. jax detection lands with the jax backend; HAS_NUMBA alone establishes the find_spec pattern. - flatten the image without a forced float64 copy: pass masked_image through ascontiguousarray without dtype=, and let flatten_numba upcast the kept values into its float64 output. Avoids a full-image float64 temporary for non-float64 inputs (e.g. float32 microscopy data); bit-identical for float64. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

label_to_idx_lut used numpy.unique(masks) — a full-image sort — to find the present labels. scipy.ndimage.find_objects (scipy is already a core dep) returns the same ascending present-label set in one O(P) pass, giving a bit-identical LUT ~3-5x faster (12.4->3.5 ms at 1024^2; 21.9->4.4 ms on a 32x240x240 volume). Trick borrowed from Alan's pure-numpy speedup (#55); unlike its percentile/MAD changes, this one preserves output exactly (verified identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Profiling the sparse-large regime (1024^2, 64 obj, edge on) showed skimage.find_boundaries was ~37% of the call (~20-29 ms) — the morphology dominates, not the scan. A one-pass numba inner-boundary kernel (4-neighbour check, the cp_measure_fast approach) is bit-identical to find_boundaries( mode="inner") and 12-27x faster, verified exact across (H,W) and (1,H,W). Used for 2D planes (Z==1); true 3D keeps skimage (6-neighbourhood). Single-image 1024^2/64 edge-on drops ~47->32 ms, and per-image batch ~445->264 ms. No correctness change (exact boundary match; harness stays green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address PR #54 review: - bulk._dispatch: reword the absent-numba RuntimeError from the imperative "install it via" to "you can install it via" (avoid issuing pip commands imperatively at the user). - primitives is an internal layer with no public API to curate; import label_to_idx_lut directly from primitives.segment (matching how the _segment_numba kernels are already imported) and drop the __init__ re-export. Documents the convention in the package docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(accelerator): numba intensity backend

Shared foundational helper used by the numba intensity/granularity/zernike backends to normalise any input (2D/3D/4D/list) to the canonical batch-of-volumes form: single image = batch of 1, returning a dict for a lone image/volume and a list of dicts for a batch. Pure numpy, no numba. Extracted to its own PR so it can be reviewed first and unblock the feature backends (#56/#57/#58). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

get_feret spends ~86% of its time in utils.masks_to_ijv -- a per-label numpy.where scan that re-reads the whole image once per object -- and then feeds every object pixel to centrosome's convex_hull_ijv. Two bit-exact reductions in one numba pass (core/numba/_feret.py::_boundary_ijv): 1. Replace the per-label scan with a single row-major counting-scatter into per-label offsets -- same (i,j,label) rows in the same order as masks_to_ijv. 2. Feed the hull only boundary pixels. hull(object) == hull(boundary): an interior pixel (all 8 neighbours share its label) is never a hull vertex. This leaves convex_hull_ijv and feret_diameter bit-identical while shrinking the hull input ~17x (~6% of pixels). convex_hull_ijv / feret_diameter stay in centrosome (computational geometry -- imported). Serial kernel, no prange/nogil; batch via to_bzyx; 3D volumes -> {} like the baseline. Wired into _numba_registries + core/numba/__init__. 1080^2/144 obj: 620.8 -> 14.7 ms (42.1x), bit-exact vs numpy. Tests: test_feret_kernels.py (boundary-ijv order + identical hull/feret + random + empty), test_feret_backend.py (golden 2D/batch/3D-empty/empty-raises), +feret in test_backend_correctness dispatch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ion) `get_feret` is shape-only and ignores `pixels`, but it passed `pixels` straight into `to_bzyx`, which does `numpy.asarray(pixels)` — for `pixels=None` that is a 0-D object array, so `_to_zyx` raised `ValueError: expected a 2D or 3D image, got ndim=0`. The accelerator dispatch (and the numpy baseline) call feret as `get_feret(masks, None)`, so the numba backend errored on its own documented calling convention — every existing test happened to pass a real `pixels` array, hiding it. Feed the mask into the batch-normaliser's pixel slot when `pixels is None` (the result is ignored anyway) and default the parameter to None. Add a regression test exercising the `(masks, None)` convention; result is bit-exact vs the numpy baseline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

timtreis and others added 11 commits June 2, 2026 04:47

lock: add numba to lock

eefb4d4

Merge pull request #54 from afermg/feat/accelerator-numba-intensity

4ca0a35

feat(accelerator): numba intensity backend

timtreis force-pushed the feat/numba-feret branch from ffb25f1 to 5e668e8 Compare June 4, 2026 03:45

timtreis force-pushed the feat/numba-feret branch from 5e668e8 to c44feed Compare June 4, 2026 03:56

timtreis added the numba label Jun 9, 2026

timtreis force-pushed the feat/bzyx-shape branch from 4ed7d79 to e03aff4 Compare June 9, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accelerator): numba feret (Feret diameters) backend#65

feat(accelerator): numba feret (Feret diameters) backend#65
timtreis wants to merge 13 commits into
feat/bzyx-shapefrom
feat/numba-feret

timtreis commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timtreis commented Jun 4, 2026

Summary

Where the time went

The fix — two bit-exact reductions in one numba pass

Correctness

Note for reviewers

sizeshape (sibling lane) — documented NO-GO

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants