feat(sanitize): central non-contiguous label-ID sanitation#89
Conversation
Relabel arbitrary or non-contiguous mask label IDs (e.g. {1,5,17}) to
contiguous 1..N as early as possible in the pipeline, so every measurement
function downstream can assume clean labels and focus on the math. The caller's
array is never mutated and results are reported against the original IDs.
- new cp_measure._sanitize: sanitize_masks() built on
skimage.segmentation.relabel_sequential (already a dependency), plus a thin
@sanitize_labels decorator that finds the label argument by name
- featurizer sanitizes once at the top of the per-mask loop and emits original
IDs as row labels (was range(1, n+1))
- @sanitize_labels applied to every core/correlation/numba get_* entry so
direct callers get the same guarantee (cheap idempotent guard)
- raises ValueError on negative / non-integer labels
- tests: gapped {1,5,17} == contiguous baseline per function, featurizer
end-to-end with original-ID rows, no-mutation, validation, 3D, registry
decoration meta-test (partial-aware, multimask excluded)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Benchmark —
|
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.96× | 0.98× | 0.99× |
| 512 | 0.97× | 1.00× | 1.00× |
| 1024 | 0.92× | 0.95× | 0.98× |
feret
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.97× | 0.96× | 0.96× |
| 512 | 0.97× | 0.96× | 0.97× |
| 1024 | 0.95× | 0.97× | 0.94× |
intensity
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.95× | 0.97× | 0.99× |
| 512 | 0.95× | 0.95× | 0.98× |
| 1024 | 0.94× | 0.95× | 0.96× |
manders_fold
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.76× | 0.91× | 0.99× |
| 512 | 0.77× | 0.82× | 0.91× |
| 1024 | 0.75× | 0.79× | 0.80× |
pearson
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.90× | 1.00× | 1.00× |
| 512 | 0.92× | 0.91× | 0.98× |
| 1024 | 0.89× | 0.91× | 0.95× |
rwc
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.95× | 0.99× | 1.01× |
| 512 | 0.96× | 0.94× | 0.94× |
| 1024 | 0.93× | 0.94× | 0.94× |
texture
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 1.00× | 1.00× | 1.01× |
| 512 | 1.04× | 1.00× | 0.99× |
| 1024 | 1.11× | 1.04× | 1.01× |
Review-driven cleanup of the sanitation layer: - collapse sanitize_masks to a single numpy.unique pass: derive the contiguity test, negative-value guard and original IDs from one sorted array; drop the separate _is_contiguous helper and the redundant max()/min()/second-unique passes - accept boolean masks (treated as one object) instead of raising — restores the pre-existing behaviour of the raw measurement functions - decorate get_correlation_overlap (the one public correlation entry that was missed), so direct callers get correct results on gapped labels - trim the over-long module/function docstrings and the featurizer comment - tests: parametrize the policy/edge cases, exercise the rank remap with non-monotonic labels, assert real multimask functions stay unsanitized, add a bool-mask case (net leaner, 109 passing) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The decorator ran numpy.unique over the whole image on every call (~10 ms on 1024²), a fixed cost that dominated cheap functions and large/few-object images (manders_fold fell to 0.40×). Replace it with a max() gate plus a bincount presence check (~4x faster), falling back to unique only for pathologically large labels. relabel_sequential still handles the rare gapped path. Measured on 1024²: manders_fold 0.74→0.91×, intensity 0.97×, feret 0.99×; expensive functions (zernike/sizeshape/texture) were already ~1.0×. Negatives still raise; bit-identical results. 109 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
featurize() sanitized each mask up front and then called the @sanitize_labels- decorated functions, so every feature re-ran the contiguity scan on the already-clean mask (~12x per object-type). Unwrap the decorator (via the __wrapped__ the wrapper already sets, partial-aware for the legacy path) so the featurizer pays the sanitation cost exactly once per mask. Direct callers still get the decorator's guarantee. Test asserts sanitize_masks is called once per object-type mask. 110 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
of note: the faster the actual runtime of a given function, the more the overhead of this hits, even when short-circuiting, f.e. in |
The per-function equivalence tests parametrized over all ~13 measurement functions were redundant: the decorator is uniform, so the coverage meta-test (every public function wrapped) + one representative direct call (zernike, which crashes on gaps without sanitation) + the featurizer end-to-end test cover the same ground. Also merged the mutation/copy cases, dropped the separate numba meta-test, and removed a dead second monkeypatch. 178 -> 102 lines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With sanitize_masks guaranteeing contiguous 1..N at every entry, the gap-tolerant label bookkeeping inside the feature functions is dead weight. Remove it: - texture/granularity/radial_distribution/radial_zernikes: drop the internal `numpy.unique(masks)` enumeration (a full-image sort, redundant after the decorator's pass). Use len(props) / range_.size / arange(1, max+1) instead. - numba intensity: labels are 1..N, so the label->index map is just `label - 1`. Drop `lut` from flatten_numba (and the edge path) and compute it inline; nobjects = masks.max(). - delete primitives/segment.py: label_to_idx_lut had no remaining consumers. All outputs verified bit-identical on contiguous input (numpy + numba), numba gapped input maps correctly, backend-correctness + full suite (91) pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ac9950c to
b21be05
Compare
…tring flatten_numba (in this module) builds the flat arrays now; segment.py is gone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sanitation relabels arbitrary IDs to 1..N; drop the relabel_sequential caveat. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
afermg
left a comment
There was a problem hiding this comment.
It looks good in general, my main issue is that the default behaviour is to sanitize if called directly and not sanitize if called from one of the entry points (featurizer, bulk). I think we should sanitize when using an entry-point (happy paths) and not do it otherwise (you're on your own mate).
| _2D_ONLY = {"radial_distribution", "radial_zernikes", "zernike", "feret"} | ||
|
|
||
|
|
||
| def _unwrap(fn): |
There was a problem hiding this comment.
This seems like double-work, why are we sanitising individually if we are going to unwrwap the for the featurizer? I think we should instead provide them unwrapped and only wrap when they are called from the bulk entry point (get_X_features). If the users want to pull one feature at a time they ought to understand the assumptions (and we should document them in the functions' docstrings aswell).
| assert not any(_is_sanitized(fn) for fn in get_multimask_measurements().values()) | ||
|
|
||
|
|
||
| def test_decorated_func_handles_gapped_labels(): |
There was a problem hiding this comment.
Why are we testing zernike explicitly (question for myself)? Some measurements support non-sequential labels and others don't (due to their algorithm). For simplicity I guess this behaviour is okay, but the test should swap gapped/contig once get_core_measurements() returns the sanitised versions and the direct import the unsanitized.
This is why I think the default behaviour should be unsanitized, and only do so through wrappers. We can add a flag to the bulk wrapper to sanitize functions or not. Sanitation by default sounds sensible. P.S. These are the complications I was trying to avoid by not sanitising, but having the user do it explicitly using a helper function. |
? |
Address review feedback: raw measurement functions no longer sanitize their label argument. Drop the @sanitize_labels decorator from all core and correlation get_* functions and rename the helper sanitize_labels -> sanitize (the user-facing sanitize(fn) escape hatch), removing the wrapper._sanitized state attribute. Sanitation now lives only at the entry points: the bulk registries (get_core_measurements/_3d, get_correlation_measurements) gain a sanitize=True flag, and featurize fetches the raw functions (sanitize=False) so it keeps sanitizing each mask exactly once instead of decorating-then-unwrapping. This also removes the per-call overhead on fast functions when callers import them directly. Document the contiguous-1..N assumption in every raw function's docstring, update the README label note, and drop the redundant int(label) cast. Tests rewritten accordingly (no more _sanitized flag). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
numpy 2.5 tightened/regressed its type stubs, turning two pre-existing lines red under mypy on Python 3.11+ (CI installs numpy unpinned, so this surfaced on the first commit after the release; it reproduces on the parent commit too): - bisection/linear costes returned numpy floats from a tuple[float, float] signature -> cast to float(). - numpy.ma.masked_array is now a non-callable TypeAliasType in the stubs; call the MaskedArray class directly (identical at runtime). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code-review cleanup (no behavior change on supported paths): - Collapse the 14 repeated 3-line "assumes 1..N" docstring stanzas to one line each; trim the over-long README bullet and the _sanitize module + wrapper docstrings and the featurizer mask-param doc (the policy is stated once centrally). Drop 3 stray blank-line additions and a lazy import. - sanitize_masks: raise on negative labels in both branches (the unique branch previously dropped them silently), and return an integer dtype for a bool fast-path mask instead of leaking bool downstream. - radial_distribution: astype(numpy.integer) -> int64 (the abstract type is deprecated as a dtype). Net -42 LOC. 92 tests pass; ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sorry that it's unclear. To clarify: Sanitation by default in the entry point interfaces (featurizer, bulk). We sanitize in one place (two places I think) - the definition of those high level (public) functions. If the user is pulling the raw functions, they should be unsanitized (they know what they are doing). |
afermg
left a comment
There was a problem hiding this comment.
LGTM, it also looks leaner than the previous version with decorators
PR #89 centralized label sanitation at the entry points and deleted primitives/segment.py (with label_to_idx_lut), which _zernike_scores imported. Adapt the vectorized zernike work to the merged design: - Move _zernike_scores from utils.py into a new numpy primitive module primitives/_segment.py (sibling to _segment_numba.py, no numba dependency). - Drop the deleted label_to_idx_lut: under the 1..N contiguity contract the segment index is label - 1, so the lookup is a plain arange. Guard masks.max() on size-0 arrays (the old find_objects path tolerated them). - Update test_zernike imports; remove the non-contiguous-labels test, which asserted pre-#89 behavior raw funcs no longer support. - Update test_sanitize::test_raw_function_does_not_sanitize: the vectorized get_zernike no longer raises IndexError on gapped IDs (it pads to max-label rows), so assert raw output differs from the sanitized wrapper instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#89 centralized label sanitation at the entry points and deleted primitives/segment.py (label_to_idx_lut), which _zernike_scores imported. - Drop the deleted lookup table: under the contiguous 1..N contract the segment index is label - 1, so it is a plain arange (guard masks.max() on size-0). _zernike_scores stays in utils.py (reused by #75) — no new module. - Import centrosome explicitly as `from centrosome import zernike` (no centrosome.zernike.* attribute access) in utils and the sizeshape caller. - test_zernike: drop the non-contiguous test (unsupported), drop redundant single/multi-object tests (the irregular case covers them) and parametrize it over zernike numbers 5/9/14, drop the weighted test + _weighted_reference helper (more test code than the branch it checked), rename the generator to _generate_square_objects. - test_sanitize: the vectorized get_zernike no longer raises IndexError on gapped IDs (it pads to max-label rows), so assert raw output differs from the sanitized wrapper instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* perf(sizeshape): vectorize get_zernike on foreground pixels (~8x) `get_zernike` delegated to `centrosome.zernike.zernike`, which scatters the per-pixel Zernike basis into a full `(H, W, K)` complex array (~560 MB at 1080^2, K~30) and scores via ~60 full-image `scipy.ndimage.sum` calls. Both costs scale with image area rather than object pixels, so most work lands on background. Replace it with a pure-numpy `_zernike_scores` helper that keeps the basis on the masked foreground vectors and segment-sums each moment by label with a single `numpy.bincount`. The Horner basis evaluation is copied verbatim from `centrosome.construct_zernike_polynomials` (same lookup-table coefficients, `r**2 > 1` cutoff and `z = y + i*x` convention), so results track the installed centrosome to floating-point round-off (bit-identical in practice). The helper lives in `cp_measure.utils` (alongside `masks_to_ijv`) and takes an optional pixel `weight` so the sibling `get_radial_zernikes` can reuse it for intensity-weighted moments in a follow-up PR. Measured: 8.6x at typical density (782->98 ms large tier), 3.7-44x depending on foreground fraction. No new deps. Also switches from `range(1, n+1)` to the actual unique labels (identical on contiguous masks, correct for non-contiguous). Adds golden + edge tests (empty, single-pixel r=0, non-contiguous labels, edge-touching, non-default zernike_numbers) asserting parity with centrosome. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(sizeshape): make _zernike_scores a complete reusable primitive Reuse primitives.segment.label_to_idx_lut for the label->row map (correct sizing, find_objects-based) instead of a hand-rolled reverse map keyed on masks.max(); derive labels internally so get_zernike no longer needs its own unique() pass. Single foreground gather, skip the identically-zero imaginary segment-sum for m==0 moments, and precompute the azimuthal powers once. Return (real_sums, imag_sums, radii, counts): radii feeds get_zernike's pi*r**2 normalisation, counts the intensity-weighted radial Zernikes (PR #75), which reuse this via the restored `weight` arg. Add weighted + count golden tests vs centrosome so no path ships untested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(zernike): suppress expected single-pixel divide warning in _zernike_scores Single-pixel objects have an enclosing-circle radius of 0, so the unit-disk coordinate division is 0/0 -> NaN (discarded later by the r**2 > 1 cutoff, matching centrosome). Wrap it in numpy.errstate so the expected RuntimeWarning isn't emitted from this shared helper (it would otherwise crash callers running under -W error::RuntimeWarning). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(test): Wrap long lines in `_weighted_reference` function * fix(zernike): adapt to post-#89 contract; address review #89 centralized label sanitation at the entry points and deleted primitives/segment.py (label_to_idx_lut), which _zernike_scores imported. - Drop the deleted lookup table: under the contiguous 1..N contract the segment index is label - 1, so it is a plain arange (guard masks.max() on size-0). _zernike_scores stays in utils.py (reused by #75) — no new module. - Import centrosome explicitly as `from centrosome import zernike` (no centrosome.zernike.* attribute access) in utils and the sizeshape caller. - test_zernike: drop the non-contiguous test (unsupported), drop redundant single/multi-object tests (the irregular case covers them) and parametrize it over zernike numbers 5/9/14, drop the weighted test + _weighted_reference helper (more test code than the branch it checked), rename the generator to _generate_square_objects. - test_sanitize: the vectorized get_zernike no longer raises IndexError on gapped IDs (it pads to max-label rows), so assert raw output differs from the sanitized wrapper instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(sanitize): simpler raw-vs-sanitized assertion, no padding specifics Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(zernike): drop redundant 3D-empty test The ndim==3 guard is a #35 band-aid; dimensionality dispatch belongs at the entry points, not asserted in a per-measurement test. test_core_measurements already covers the 3D-empty case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Alán F. Muñoz <afer.mg@gmail.com>
Relabel non-contiguous / arbitrary mask IDs (e.g.
{1,5,17}) to1..Nso measurement code can assume clean labels. Input is never mutated; results map back to original IDs.How
cp_measure._sanitize.sanitize_masks(masks) -> (clean, ids)viaskimage.segmentation.relabel_sequential(existing dep). Singleuniquepass; no-copy fast path when already contiguous; bool masks = one object; raises on negative/non-integer.@sanitize_labelson every core/correlation/numbaget_*for direct callers.