Skip to content

feat(sanitize): central non-contiguous label-ID sanitation#89

Merged
afermg merged 11 commits into
mainfrom
feat/sanitize-noncontiguous-labels
Jun 26, 2026
Merged

feat(sanitize): central non-contiguous label-ID sanitation#89
afermg merged 11 commits into
mainfrom
feat/sanitize-noncontiguous-labels

Conversation

@timtreis

@timtreis timtreis commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Relabel non-contiguous / arbitrary mask IDs (e.g. {1,5,17}) to 1..N so measurement code can assume clean labels. Input is never mutated; results map back to original IDs.

How

  • cp_measure._sanitize.sanitize_masks(masks) -> (clean, ids) via skimage.segmentation.relabel_sequential (existing dep). Single unique pass; no-copy fast path when already contiguous; bool masks = one object; raises on negative/non-integer.
  • Featurizer sanitizes once at loop entry, emits original IDs as rows.
  • @sanitize_labels on every core/correlation/numba get_* for direct callers.
  • short-circuits when mask is already contiguous

Relabel arbitrary or non-contiguous mask label IDs (e.g. {1,5,17}) to
contiguous 1..N as early as possible in the pipeline, so every measurement
function downstream can assume clean labels and focus on the math. The caller's
array is never mutated and results are reported against the original IDs.

- new cp_measure._sanitize: sanitize_masks() built on
  skimage.segmentation.relabel_sequential (already a dependency), plus a thin
  @sanitize_labels decorator that finds the label argument by name
- featurizer sanitizes once at the top of the per-mask loop and emits original
  IDs as row labels (was range(1, n+1))
- @sanitize_labels applied to every core/correlation/numba get_* entry so
  direct callers get the same guarantee (cheap idempotent guard)
- raises ValueError on negative / non-integer labels
- tests: gapped {1,5,17} == contiguous baseline per function, featurizer
  end-to-end with original-ID rows, no-mutation, validation, 3D, registry
  decoration meta-test (partial-aware, multimask excluded)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

Benchmark — 90f2eed vs main

speedup = main/head (>1 faster, <1 slower) · median per cell · showing functions that moved ≥1.05× either way

costes

size \ objects 16 64 256
256 0.96× 0.98× 0.99×
512 0.97× 1.00× 1.00×
1024 0.92× 0.95× 0.98×

feret

size \ objects 16 64 256
256 0.97× 0.96× 0.96×
512 0.97× 0.96× 0.97×
1024 0.95× 0.97× 0.94×

intensity

size \ objects 16 64 256
256 0.95× 0.97× 0.99×
512 0.95× 0.95× 0.98×
1024 0.94× 0.95× 0.96×

manders_fold

size \ objects 16 64 256
256 0.76× 0.91× 0.99×
512 0.77× 0.82× 0.91×
1024 0.75× 0.79× 0.80×

pearson

size \ objects 16 64 256
256 0.90× 1.00× 1.00×
512 0.92× 0.91× 0.98×
1024 0.89× 0.91× 0.95×

rwc

size \ objects 16 64 256
256 0.95× 0.99× 1.01×
512 0.96× 0.94× 0.94×
1024 0.93× 0.94× 0.94×

texture

size \ objects 16 64 256
256 1.00× 1.00× 1.01×
512 1.04× 1.00× 0.99×
1024 1.11× 1.04× 1.01×

timtreis and others added 3 commits June 18, 2026 14:34
Review-driven cleanup of the sanitation layer:

- collapse sanitize_masks to a single numpy.unique pass: derive the
  contiguity test, negative-value guard and original IDs from one sorted
  array; drop the separate _is_contiguous helper and the redundant
  max()/min()/second-unique passes
- accept boolean masks (treated as one object) instead of raising — restores
  the pre-existing behaviour of the raw measurement functions
- decorate get_correlation_overlap (the one public correlation entry that was
  missed), so direct callers get correct results on gapped labels
- trim the over-long module/function docstrings and the featurizer comment
- tests: parametrize the policy/edge cases, exercise the rank remap with
  non-monotonic labels, assert real multimask functions stay unsanitized, add
  a bool-mask case (net leaner, 109 passing)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The decorator ran numpy.unique over the whole image on every call (~10 ms on
1024²), a fixed cost that dominated cheap functions and large/few-object images
(manders_fold fell to 0.40×). Replace it with a max() gate plus a bincount
presence check (~4x faster), falling back to unique only for pathologically
large labels. relabel_sequential still handles the rare gapped path.

Measured on 1024²: manders_fold 0.74→0.91×, intensity 0.97×, feret 0.99×;
expensive functions (zernike/sizeshape/texture) were already ~1.0×. Negatives
still raise; bit-identical results. 109 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
featurize() sanitized each mask up front and then called the @sanitize_labels-
decorated functions, so every feature re-ran the contiguity scan on the
already-clean mask (~12x per object-type). Unwrap the decorator (via the
__wrapped__ the wrapper already sets, partial-aware for the legacy path) so the
featurizer pays the sanitation cost exactly once per mask. Direct callers still
get the decorator's guarantee.

Test asserts sanitize_masks is called once per object-type mask. 110 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@timtreis

Copy link
Copy Markdown
Collaborator Author

of note: the faster the actual runtime of a given function, the more the overhead of this hits, even when short-circuiting, f.e. in manders

timtreis and others added 2 commits June 18, 2026 15:40
The per-function equivalence tests parametrized over all ~13 measurement
functions were redundant: the decorator is uniform, so the coverage meta-test
(every public function wrapped) + one representative direct call (zernike, which
crashes on gaps without sanitation) + the featurizer end-to-end test cover the
same ground. Also merged the mutation/copy cases, dropped the separate numba
meta-test, and removed a dead second monkeypatch. 178 -> 102 lines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With sanitize_masks guaranteeing contiguous 1..N at every entry, the gap-tolerant
label bookkeeping inside the feature functions is dead weight. Remove it:

- texture/granularity/radial_distribution/radial_zernikes: drop the internal
  `numpy.unique(masks)` enumeration (a full-image sort, redundant after the
  decorator's pass). Use len(props) / range_.size / arange(1, max+1) instead.
- numba intensity: labels are 1..N, so the label->index map is just `label - 1`.
  Drop `lut` from flatten_numba (and the edge path) and compute it inline;
  nobjects = masks.max().
- delete primitives/segment.py: label_to_idx_lut had no remaining consumers.

All outputs verified bit-identical on contiguous input (numpy + numba), numba
gapped input maps correctly, backend-correctness + full suite (91) pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@timtreis timtreis force-pushed the feat/sanitize-noncontiguous-labels branch from ac9950c to b21be05 Compare June 18, 2026 13:58
…tring

flatten_numba (in this module) builds the flat arrays now; segment.py is gone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@timtreis timtreis requested a review from afermg June 18, 2026 14:41
Sanitation relabels arbitrary IDs to 1..N; drop the relabel_sequential caveat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@afermg afermg left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good in general, my main issue is that the default behaviour is to sanitize if called directly and not sanitize if called from one of the entry points (featurizer, bulk). I think we should sanitize when using an entry-point (happy paths) and not do it otherwise (you're on your own mate).

Comment thread src/cp_measure/_sanitize.py Outdated
Comment thread src/cp_measure/core/measureobjectintensitydistribution.py
Comment thread src/cp_measure/core/measuretexture.py
Comment thread src/cp_measure/featurizer.py Outdated
_2D_ONLY = {"radial_distribution", "radial_zernikes", "zernike", "feret"}


def _unwrap(fn):

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like double-work, why are we sanitising individually if we are going to unwrwap the for the featurizer? I think we should instead provide them unwrapped and only wrap when they are called from the bulk entry point (get_X_features). If the users want to pull one feature at a time they ought to understand the assumptions (and we should document them in the functions' docstrings aswell).

Comment thread src/cp_measure/featurizer.py Outdated
Comment thread src/cp_measure/featurizer.py Outdated
Comment thread test/test_sanitize.py Outdated
Comment thread test/test_sanitize.py Outdated
assert not any(_is_sanitized(fn) for fn in get_multimask_measurements().values())


def test_decorated_func_handles_gapped_labels():

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we testing zernike explicitly (question for myself)? Some measurements support non-sequential labels and others don't (due to their algorithm). For simplicity I guess this behaviour is okay, but the test should swap gapped/contig once get_core_measurements() returns the sanitised versions and the direct import the unsanitized.

Comment thread test/test_sanitize.py
@afermg

afermg commented Jun 22, 2026

Copy link
Copy Markdown
Owner

of note: the faster the actual runtime of a given function, the more the overhead of this hits, even when short-circuiting, f.e. in manders

This is why I think the default behaviour should be unsanitized, and only do so through wrappers. We can add a flag to the bulk wrapper to sanitize functions or not. Sanitation by default sounds sensible.

P.S. These are the complications I was trying to avoid by not sanitising, but having the user do it explicitly using a helper function.

@timtreis

Copy link
Copy Markdown
Collaborator Author

This is why I think the default behaviour should be unsanitized

Sanitation by default sounds sensible.

?

timtreis and others added 3 commits June 22, 2026 16:28
Address review feedback: raw measurement functions no longer sanitize
their label argument. Drop the @sanitize_labels decorator from all core
and correlation get_* functions and rename the helper sanitize_labels ->
sanitize (the user-facing sanitize(fn) escape hatch), removing the
wrapper._sanitized state attribute.

Sanitation now lives only at the entry points: the bulk registries
(get_core_measurements/_3d, get_correlation_measurements) gain a
sanitize=True flag, and featurize fetches the raw functions
(sanitize=False) so it keeps sanitizing each mask exactly once instead of
decorating-then-unwrapping. This also removes the per-call overhead on
fast functions when callers import them directly.

Document the contiguous-1..N assumption in every raw function's
docstring, update the README label note, and drop the redundant
int(label) cast. Tests rewritten accordingly (no more _sanitized flag).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
numpy 2.5 tightened/regressed its type stubs, turning two pre-existing
lines red under mypy on Python 3.11+ (CI installs numpy unpinned, so this
surfaced on the first commit after the release; it reproduces on the
parent commit too):

- bisection/linear costes returned numpy floats from a tuple[float, float]
  signature -> cast to float().
- numpy.ma.masked_array is now a non-callable TypeAliasType in the stubs;
  call the MaskedArray class directly (identical at runtime).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code-review cleanup (no behavior change on supported paths):

- Collapse the 14 repeated 3-line "assumes 1..N" docstring stanzas to one
  line each; trim the over-long README bullet and the _sanitize module +
  wrapper docstrings and the featurizer mask-param doc (the policy is stated
  once centrally). Drop 3 stray blank-line additions and a lazy import.
- sanitize_masks: raise on negative labels in both branches (the unique
  branch previously dropped them silently), and return an integer dtype for
  a bool fast-path mask instead of leaking bool downstream.
- radial_distribution: astype(numpy.integer) -> int64 (the abstract type is
  deprecated as a dtype).

Net -42 LOC. 92 tests pass; ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@afermg

afermg commented Jun 22, 2026

Copy link
Copy Markdown
Owner

This is why I think the default behaviour should be unsanitized

Sanitation by default sounds sensible.

?

Sorry that it's unclear. To clarify:

Sanitation by default in the entry point interfaces (featurizer, bulk). We sanitize in one place (two places I think) - the definition of those high level (public) functions.

If the user is pulling the raw functions, they should be unsanitized (they know what they are doing).

@timtreis timtreis requested a review from afermg June 26, 2026 20:19

@afermg afermg left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it also looks leaner than the previous version with decorators

@afermg afermg merged commit 2c5a6cf into main Jun 26, 2026
17 checks passed
@timtreis timtreis deleted the feat/sanitize-noncontiguous-labels branch June 26, 2026 21:09
timtreis added a commit that referenced this pull request Jun 26, 2026
PR #89 centralized label sanitation at the entry points and deleted
primitives/segment.py (with label_to_idx_lut), which _zernike_scores imported.
Adapt the vectorized zernike work to the merged design:

- Move _zernike_scores from utils.py into a new numpy primitive module
  primitives/_segment.py (sibling to _segment_numba.py, no numba dependency).
- Drop the deleted label_to_idx_lut: under the 1..N contiguity contract the
  segment index is label - 1, so the lookup is a plain arange. Guard masks.max()
  on size-0 arrays (the old find_objects path tolerated them).
- Update test_zernike imports; remove the non-contiguous-labels test, which
  asserted pre-#89 behavior raw funcs no longer support.
- Update test_sanitize::test_raw_function_does_not_sanitize: the vectorized
  get_zernike no longer raises IndexError on gapped IDs (it pads to max-label
  rows), so assert raw output differs from the sanitized wrapper instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
timtreis added a commit that referenced this pull request Jun 26, 2026
#89 centralized label sanitation at the entry points and deleted
primitives/segment.py (label_to_idx_lut), which _zernike_scores imported.

- Drop the deleted lookup table: under the contiguous 1..N contract the segment
  index is label - 1, so it is a plain arange (guard masks.max() on size-0).
  _zernike_scores stays in utils.py (reused by #75) — no new module.
- Import centrosome explicitly as `from centrosome import zernike` (no
  centrosome.zernike.* attribute access) in utils and the sizeshape caller.
- test_zernike: drop the non-contiguous test (unsupported), drop redundant
  single/multi-object tests (the irregular case covers them) and parametrize it
  over zernike numbers 5/9/14, drop the weighted test + _weighted_reference
  helper (more test code than the branch it checked), rename the generator to
  _generate_square_objects.
- test_sanitize: the vectorized get_zernike no longer raises IndexError on gapped
  IDs (it pads to max-label rows), so assert raw output differs from the
  sanitized wrapper instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
afermg added a commit that referenced this pull request Jun 30, 2026
* perf(sizeshape): vectorize get_zernike on foreground pixels (~8x)

`get_zernike` delegated to `centrosome.zernike.zernike`, which scatters the
per-pixel Zernike basis into a full `(H, W, K)` complex array (~560 MB at
1080^2, K~30) and scores via ~60 full-image `scipy.ndimage.sum` calls. Both
costs scale with image area rather than object pixels, so most work lands on
background.

Replace it with a pure-numpy `_zernike_scores` helper that keeps the basis on
the masked foreground vectors and segment-sums each moment by label with a
single `numpy.bincount`. The Horner basis evaluation is copied verbatim from
`centrosome.construct_zernike_polynomials` (same lookup-table coefficients,
`r**2 > 1` cutoff and `z = y + i*x` convention), so results track the installed
centrosome to floating-point round-off (bit-identical in practice).

The helper lives in `cp_measure.utils` (alongside `masks_to_ijv`) and takes an
optional pixel `weight` so the sibling `get_radial_zernikes` can reuse it for
intensity-weighted moments in a follow-up PR.

Measured: 8.6x at typical density (782->98 ms large tier), 3.7-44x depending on
foreground fraction. No new deps. Also switches from `range(1, n+1)` to the
actual unique labels (identical on contiguous masks, correct for
non-contiguous).

Adds golden + edge tests (empty, single-pixel r=0, non-contiguous labels,
edge-touching, non-default zernike_numbers) asserting parity with centrosome.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(sizeshape): make _zernike_scores a complete reusable primitive

Reuse primitives.segment.label_to_idx_lut for the label->row map (correct
sizing, find_objects-based) instead of a hand-rolled reverse map keyed on
masks.max(); derive labels internally so get_zernike no longer needs its own
unique() pass. Single foreground gather, skip the identically-zero imaginary
segment-sum for m==0 moments, and precompute the azimuthal powers once.

Return (real_sums, imag_sums, radii, counts): radii feeds get_zernike's
pi*r**2 normalisation, counts the intensity-weighted radial Zernikes (PR #75),
which reuse this via the restored `weight` arg. Add weighted + count golden
tests vs centrosome so no path ships untested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(zernike): suppress expected single-pixel divide warning in _zernike_scores

Single-pixel objects have an enclosing-circle radius of 0, so the unit-disk
coordinate division is 0/0 -> NaN (discarded later by the r**2 > 1 cutoff,
matching centrosome). Wrap it in numpy.errstate so the expected RuntimeWarning
isn't emitted from this shared helper (it would otherwise crash callers running
under -W error::RuntimeWarning).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(test): Wrap long lines in `_weighted_reference` function

* fix(zernike): adapt to post-#89 contract; address review

#89 centralized label sanitation at the entry points and deleted
primitives/segment.py (label_to_idx_lut), which _zernike_scores imported.

- Drop the deleted lookup table: under the contiguous 1..N contract the segment
  index is label - 1, so it is a plain arange (guard masks.max() on size-0).
  _zernike_scores stays in utils.py (reused by #75) — no new module.
- Import centrosome explicitly as `from centrosome import zernike` (no
  centrosome.zernike.* attribute access) in utils and the sizeshape caller.
- test_zernike: drop the non-contiguous test (unsupported), drop redundant
  single/multi-object tests (the irregular case covers them) and parametrize it
  over zernike numbers 5/9/14, drop the weighted test + _weighted_reference
  helper (more test code than the branch it checked), rename the generator to
  _generate_square_objects.
- test_sanitize: the vectorized get_zernike no longer raises IndexError on gapped
  IDs (it pads to max-label rows), so assert raw output differs from the
  sanitized wrapper instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(sanitize): simpler raw-vs-sanitized assertion, no padding specifics

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(zernike): drop redundant 3D-empty test

The ndim==3 guard is a #35 band-aid; dimensionality dispatch belongs at the entry
points, not asserted in a per-measurement test. test_core_measurements already
covers the 3D-empty case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Alán F. Muñoz <afer.mg@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants