* feat(synth): deterministic synthetic cell-image generator for benchmarking
Add `cp_measure.synth.generate(image_size, n_objects, n_channels, seed)` — the
shipped, importable generator for the PR-benchmark action (build step 1).
Produces a cell-like contiguous label mask (organic star-shaped cells placed by
gap-respecting dart-throwing, log-normal sizes, no degenerate ~1px objects) plus
intensity channels built from a shared smooth envelope + shared/independent
multi-scale Gaussian splats, so area, intensity, texture AND colocalisation
features all carry real signal. Output is a pure function of the inputs (version
stamped via `__version__`); placement is capacity-checked and raises loudly
rather than silently under-placing.
test/test_synth.py replaces the design's "eyeball the examples" gate with
programmatic acceptance asserts at the matrix corners (min-size×max-count,
max-size×min-count): determinism, contiguous exact count, no degenerate objects,
shape/texture/intensity signal, and a controlled sub-unity channel correlation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(synth): review hardening — single cell-extent, sturdier tests, determinism
Apply the "fix now" set from the max-effort review of the generator (no
behavioural bugs were found; these harden maintainability, the test net, and
cross-version reproducibility):
- Extract `_cell_extent(base_r, amps)` as the single definition of a cell's
radial reach, used by both the packing radius (worst-case amps) and the
rasterisation window (actual amps). Removes the reach-vs-bulge drift risk that
could silently break the no-overlap guarantee if one formula were edited.
- Strengthen the two toothless tests: texture now asserts median per-object std
is well ABOVE the read-noise floor (a splat-removed regression collapses to
~noise and fails); organic-shape now asserts a boundary radial-roughness CV
that plain disks fail (the old solidity<0.99 passed for pixelated disks).
Both verified to fail on their intended regressions.
- Determinism: stable sort for tied radii; replace rng.choice(p=...) with
inverse-CDF sampling on rng.random (version-stable draw count) so two
separately-installed envs can't diverge. Bump __version__ 0.1.0 -> 0.2.0.
- Widen the brittle seed-averaged correlation band (0.4-0.7 -> 0.35-0.8) so a
legitimate constant re-tune doesn't flip it.
- Per decisions: keep realistic PSF splat bleed; drop the unimplemented
"clusters" docstring claim.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(bench): symbol-level PR target mapper (build step 2)
`python -m cp_measure._bench.targets --base <ref> --head <ref>` resolves a PR
diff to exactly the measurement functions it changes, for the benchmark action.
Resolution is SYMBOL-level, not file-level: it builds a static symbol-reference
graph over the package (AST, resolving intra-package imports incl. submodule and
relative imports) and selects a feature iff its call graph transitively reaches a
changed symbol. So a shared-helper edit (e.g. utils._zernike_scores) selects only
the features that actually use it — verified on the real PRs: #74 -> {zernike},
#75 -> {radial_zernikes}, where file-closure would have over-selected the ~6
features whose modules merely import utils.
- Rooted at an explicit entry-point table (the get_* registry) so bulk.py's lazy
numba/multimask imports can't cause an entry-point to be missed; a test
cross-checks the table against the live registries by function identity.
- Reads everything from git refs (git show), diffing against the merge-base, so
it matches CI and is correct for stacked PRs given the PR's real base.
- Three distinct states: benchmarked / skipped-unsupported (multimask, numba) /
empty — a multimask-only PR is never mistaken for "no measurement change".
- Tolerates the get_ferret->get_feret cross-branch rename via name candidates.
Hermetic tests build a throwaway git repo + mini package to prove symbol-level
precision and the three states; a guarded test checks the real #74/#75 refs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* revert(bench): drop the change-detection mapper; benchmark all exposed functions
Per design decision: the benchmark compares at the main exposed-function level —
run every public get_* feature base-vs-head and let the speedup table show what
changed (~1.0x = untouched). This removes the static AST symbol-graph mapper
(build step 2) entirely, along with its edge-case surface; benchmark cost is
controlled by the matrix size / per-function budget instead of pre-selection.
- Remove src/cp_measure/_bench/targets.py and test/test_targets.py (keep the
_bench package for the upcoming runner).
- Remove accidentally-committed __pycache__/*.pyc and add a .gitignore for
Python bytecode (the repo had none).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(bench): fixture/runner/comparator — benchmark all get_* head-vs-main
Build step 2 (v3): the benchmark core, three composable pieces.
- fixtures.py: build the (image_size x object_count x seed) matrix once from the
pinned synth generator, serialise to .npz with a manifest + per-array sha256
(stamps synth.__version__). Both envs load identical, checksum-verified inputs.
- run.py: `python -m cp_measure._bench.run` times EVERY public get_* function
(core arity-1, correlation arity-2, plus a [legacy] variant where a `legacy`
param exists) over the fixtures in one environment -> JSON. Channels normalised
to [0,1] (the pipeline convention; get_texture requires it). Per-call warmup +
reps (min), SIGALRM per-call timeout, thread-pinning set before numpy import.
Functions enumerated from the live registry at HEAD; a function that errors on
synth input is recorded, not fatal.
- compare.py: `python -m cp_measure._bench.compare` diffs two run JSONs into a
speedup table. speedup = main/head (>1 faster); per cell takes the min then the
median across seeds; classifies faster/slower/within-noise/new/removed/no-data.
Untouched functions land at ~1.0x — the "what changed" signal, no mapper needed.
Validated end-to-end on a smoke matrix (all 12 functions time ok incl. texture;
self-compare is 1.00x). The two-worktree/two-env orchestration is step 3 (workflow).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(bench): two-job benchmark workflow + orchestration driver (build step 3)
Wires fixtures -> run(head) + run(main) -> compare -> sticky PR comment.
- .github/workflows/benchmark.yml: triggered by the `benchmark` label (labels
need write access, so the trigger is maintainer-gated) or workflow_dispatch.
Two-job split: `build` runs untrusted PR code with `permissions: {}` (no token
to steal, persist-credentials off); `report` holds pull-requests/issues:write
but never checks out PR code — it only renders the artifact into a sticky
`<!-- cp-bench -->` comment and removes the label. fetch-depth: 0 so `main` is
present; concurrency cancels superseded runs.
- .github/scripts/run_benchmark.sh: installs head + main in two isolated uv envs,
VENDORS head's synth.py + _bench/ into the main worktree so the generator and
tooling are identical across both runs (only cp_measure.core.* differs), builds
the fixtures once, runs both, compares.
- fixtures.py: add CI_MATRIX (bounded for hosted-runner limits, the workflow
default; full DEFAULT via dispatch) + a `python -m cp_measure._bench.fixtures`
build CLI.
Validated locally: script bash-syntax, YAML structure (tokenless build, gated
report), fixtures CLI, full test suite. End-to-end CI run is via workflow_dispatch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(bench): review cleanup — leaner comments/docstrings + small fixes
Elegance/LOC pass over the benchmark PR (net ~-50 lines, mostly verbose
docstrings) plus the real findings from the review:
Fixes:
- workflow: `gh api --paginate | head` SIGPIPE under pipefail could abort the
comment post on PRs with many comments — use a single `?per_page=100` page +
`--jq 'first'` instead. Add `if: always()` to upload-artifact so a failed run
still surfaces partial output. Drop the redundant matrix default + useless cat.
- run.py: build call-args INSIDE the guarded path so an input a function can't
handle (e.g. a 1-channel fixture) is recorded per-cell, not fatal. Record the
matrix + fixture count in meta so the comment shows which sweep ran; note the
shared-fn JIT caveat for [legacy] variants.
- compare.py: label the status column (was a blank header); guard head_t==0;
surface the matrix scope in the header.
- run_benchmark.sh: trap-based cleanup of the temp dir/worktree/venvs (was leaked).
- .gitignore: ignore local benchmark artifacts (bench-out/, *.npz).
Cleanup: trim the synth/bench/test module docstrings and synth's per-constant
comments to their load-bearing facts; collapse generate()'s numpydoc block; drop
the unused load_fixture(verify=...) flag; de-clever _norm01's constant-image path.
Kept the _cell_extent single-source helper (an earlier review's no-overlap fix).
31 tests pass, ruff clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(bench): trim to a lean regression set
Consolidate the over-exhaustive acceptance tests (356 -> 110 lines, 7 tests):
- synth: one invariants test (shape/dtype/contiguous count, no degenerate
objects, shape+size variety, intensity/texture/coloc signal) + determinism +
edges, all at a single representative config instead of parametrising every
check over both matrix corners. Drop the radial-roughness disk-vs-organic
discriminator (eccentricity spread + size variety still catch a broken gen).
- bench: merge the fixture build/load/determinism cases, fold enumerate into the
run integration test, and collapse the compare classification/render cases into
one. Drop the trivial _norm01 and standalone CLI tests.
* style: ruff-format with current ruff (88-col line wraps)
* chore: stop tracking scratch tasks/ and .claude/ (added by mistake)
* refactor(bench): report raw main/head timings, drop noise-band classification
Per single-function resolution: show each function's main vs head time as
mean (min-max) over reps x seeds plus the raw main/head ratio, and let the
maintainer read their function directly. Removes the faster/slower/within-noise
band (which a noisy/sequential run could mislabel) and any normalisation; run.py
now stores just the rep times.
* Revert "Merge pull request #80 from afermg/feat/synth-bench-generator"
This reverts commit 3809218, reversing
changes made to 7f67606.
* feat(bench): synthetic-image PR performance benchmark action
Single revertable unit; re-introduces the benchmark mechanic (reverted #80) with
the harness-source fix folded in.
- cp_measure/synth.py: deterministic synthetic cell-image generator.
- cp_measure/_bench/{fixtures,run,compare}.py: build the (size x count x seed)
fixture matrix, time every get_* head-vs-main, report raw mean (min-max) timings.
- .github/workflows/benchmark.yml + scripts/run_benchmark.sh: label-triggered
two-job workflow. The harness is checked out from main (not the PR head, which a
perf PR does not carry); the PR head is fetched as a worktree, main's synth.py +
_bench/ vendored in, and only cp_measure.core differs between the timed runs.
* Revert "feat(bench): synthetic-image PR performance benchmark action"
* demo: self-contained PR benchmark action (simplified)
Everything lives on this branch (nothing on main). on: pull_request runs the
workflow from the PR branch on every commit, times every public get_* on the PR
head vs main, and posts a sticky comment with the timing table.
- synth.py: minimal generator — n ellipses on a regular grid + a few random
Gaussian blobs per channel.
- _bench/{fixtures,run,compare}.py: build fixtures, time all get_* head-vs-main,
raw timings table.
- .github: single-job pull_request workflow (no label, no pull_request_target) +
head-based driver that vendors the tooling into a main worktree.
- includes the granularity speedup (#76) so the demo table shows a real delta.
* demo: move the whole benchmark into .github/scripts (no package module)
Remove src/cp_measure/{synth.py,_bench/} and their tests. Everything now lives in
.github/scripts/benchmark.py — a single self-contained script (generator + runner
+ comparator); each env regenerates the same seeded inputs, so nothing is shared
or vendored. Table now references the commit and emits one grid per affected
function (speedup >= 1.1x) with image size as rows and object count as columns.
* demo: extend benchmark matrix to 4 sizes x 2 counts (256–2048)
Grid now spans image sizes 256/512/1024/2048 (rows) x object counts 16/64
(cols); bump the job timeout to 45m for the larger sizes.
* demo: median per cell, 3 seeds x 3 counts, dynamic affected-threshold caption
- per-cell aggregate is now the median (over seeds x reps); speedup = median/median
- matrix: sizes 256-2048 (rows) x counts 16/64/256 (cols) x 3 seeds = 36 cells
- caption derives the cutoff from AFFECTED (≥1.1x) instead of hardcoding >1
- job timeout 60m for the larger matrix
* demo: drop 256px image size (unrealistically small)
Sizes now 512/1024/2048 x counts 16/64/256 x 3 seeds = 27 cells (3x3 grid).
* demo: shift matrix down to 256-1024 (drop slow 2048)
Sizes 256/512/1024 x counts 16/64/256 x 3 seeds — 2048 was too slow per commit.
* demo: report regressions too — flag functions that moved >=1.05x either way
Was speedup>=1.1x only (regression-blind: a slowdown reported 'no change'). Now a
function is shown if any cell is >=1.05x faster OR <=1/1.05x slower; header notes
>1 faster / <1 slower.
* demo: slim benchmark.py — drop unused bits
- remove the n_channels param (always 2: ch0 for core, ch0+ch1 for coloc)
- drop 'from __future__ import annotations' (unneeded on the 3.12 runner)
- .gitignore: drop *.npz (no fixture files are written anymore)
* revert(granunlarity): it has an independent PR, was used as test
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Alán F. Muñoz <afer.mg@gmail.com>
Replace
get_zernike'scentrosome.zernike.zernikecall (scatters the basis into a full(H, W, K)array, scores via whole-imagescipy.ndimage.sum— both scale with image area) with a pure-numpy_zernike_scoresthat keeps the basis on foreground pixels and segment-sums each moment withnumpy.bincount. Basis copied fromcentrosome.construct_zernike_polynomials, so results match centrosome to round-off._zernike_scores(masks, zernike_indexes, weight=None) -> (real_sums, imag_sums, radii, counts)is shared with the radial Zernikes (#75). Assumes contiguous 1..N labels (the cp_measure contract).