Refactor applications/DynaCell into a reusable benchmark package by alxndrkalinin · Pull Request #404 · mehta-lab/VisCy

alxndrkalinin · 2026-04-07T21:29:49Z

Summary

Makes applications/dynacell into the reusable dynacell package in VisCy.

Splits config responsibilities more cleanly:

VisCy keeps reusable dynacell code plus generic model recipes/examples.
Paper-specific benchmark state such as SEC61B run configs is removed from VisCy and is expected to live in dynacell-paper.

This PR also adds opt-in validation loss support for CellDiff flow-matching so benchmark reruns can log a scalar validation curve without forcing extra validation work on every flow-matching job.

What changed

Config and package reorg

Move reusable configs from applications/dynacell/examples/configs/ into applications/dynacell/configs/.
Keep configs/recipes/ for reusable fragments and configs/examples/ for generic runnable examples.
Refresh generic fit/predict examples for CellDiff, FNet3D, and UNetViT3D, and add generic UNeXt2 fit/predict examples.
Remove committed lightning_logs/, SEC61B paper configs, and the hard-coded hcs_sec61b_3d.yml recipe.
Update the dynacell README, config-discovery tests, and local ignore rules to match the new layout.

Absorb reusable benchmark modules from `dynacell-paper`

Add dynacell.data with path-based manifest/collection/spec loaders and Pydantic schemas.
Add dynacell.reporting with Hydra config, tables, figures, and tests.
Add dynacell.evaluation with configs, spectral-PCC tooling, I/O helpers, segmentation, metrics, and tests.
Add dynacell.preprocess with the reusable config.py and zarr_utils.py utilities.

Package and CLI shape

Extend the dynacell CLI so evaluate and report route to Hydra entry points while fit/predict/test/validate still go through viscy_utils.
Ship Hydra _configs inside the Python package so installed dynacell evaluate/report commands resolve config from package data instead of the source tree.
Make dynacell.__init__ lazy so lightweight imports do not pull in the full training stack.
Add optional dependency extras for eval, report, and preprocess, plus clearer missing-extra install hints from the CLI router.

Runtime, training, and transform fixes

Add overwrite= support to HCSPredictionWriter, including explicit handling/logging when prediction channels already exist.
Adjust VisCyCLI checkpoint hparam precedence so predict/test/validate honor user config over stale checkpoint values, while fit resume still keeps checkpoint training hparams.
Add direct ckpt_path loading to DynacellFlowMatching so predict-time settings come from config rather than checkpoint metadata.
Add opt-in compute_validation_loss to DynacellFlowMatching. When enabled it logs loss/val/<idx> and aggregate loss/validate while preserving epoch-end ODE sample generation; when disabled it keeps the previous cheaper validation path.
Extract a shared _aggregate_validation_losses() helper so UNet and flow-matching use the same weighted aggregation logic.
Skip the strict HCS spatial-shape check when GPU augmentations intentionally handle cropping.
Extend BatchedRandAffined with configurable padding_mode, safe_crop_size, and safe_crop_coverage, plus warnings/tests around unsupported X/Y-rotation guarantees.

Evaluation and reporting follow-up fixes

Fix corr_coef/PCC computation and add regression tests.
Make cached evaluate_model() returns consistent with fresh evaluation output types.
Fix report barplots for mixed-result inputs where some models are missing requested metrics.
Respect use_gpu in evaluation instead of always taking the CUDA path when visible.
Raise FileNotFoundError for missing preprocess configs instead of failing less clearly.

Benchmark schema + launcher (landed earlier on this branch)

Commit the configs/benchmarks/virtual_staining/ layout with shared axes (train_sets/, targets/, model_overlays/, launcher_profiles/, predict_sets/) and per-leaf train/predict files composed via viscy_utils.compose.load_composed_config.
Migrate the CellDiff train + predict configs for ER/mito/nucleus/membrane onto schema leaves; add FNet3D paper, UNetViT3D, and UNeXt2 train leaves for ER.
Add applications/dynacell/tools/submit_benchmark_job.py with --dry-run, --print-script, --print-resolved-config, and --override. It composes a leaf, strips the launcher: + benchmark: reserved keys, freezes the resolved config to {run_root}/resolved/, and renders an sbatch script to {run_root}/slurm/.
Extend _maybe_compose_config in viscy_utils/cli.py so direct dynacell fit -c <leaf> also strips those reserved keys.
Add a ckpt_sha256_12 sidecar to the evaluation cache so repeated lookups skip full-file hashing.

Evaluation pipeline scale-up

DeepFeatureBatcher + upfront precompute_deep_features pass amortize deep-extractor forwards across FOVs and timepoints; per-side DeepFeatureBatcher.push streams raw crops with a single flush per side.
Prediction-side artifact cache (organelle masks + per-cell features for CP/DINOv3/DynaCLR/CELL-DINO) keyed alongside the existing GT cache; pred reruns hit cache after the first eval.
Controlled FOV-level parallelism (runtime.executor=process / fov_workers / threads_per_worker) via spawn-context ProcessPoolExecutor, fcntl GPU serialization lock, layered BLAS/OMP thread caps, and region timers dumped to eval_timing.csv (feat(dynacell): controlled FOV-parallel eval + thread budgeting #426).
Keep compute_pixel_metrics tensors device-resident (zero-copy ascupy/asnumpy handoff to cubic) so spectral-PCC and FSC stay on GPU on long timelapses.

Grouped multi-condition eval + re-eval campaigns

dynacell evaluate-grouped driver loads SuperModel + DINOv3 + DynaCLR + CELL-DINO once and loops over N I/O variants (typically A549 mock/denv/zikv) sharing the same target. Per-condition overlays may override io.* / save.* / runtime.* / cache / limit_positions / force_recompute; model-loading fields are guarded against override at runtime.
submit_benchmark_batch.py --array mode for cluster-scale re-eval submission; re-eval campaign generator emits grouped leaves under configs/benchmarks/virtual_staining/_internal/leaf/grouped/.
CELL-DINO GT cache warmable via dynacell precompute-gt; compute_feature_metrics=true is now the default in eval configs.

Feature metrics migration to `torch-fidelity` + linear probes

Replace the in-tree FID/KID/PRC math with torch-fidelity's eigvals-based composition; ship per-(FOV, t) FID/KID/median-cosine and dataset-level FID/KID/PRC/MIND/cosine (feat(dynacell): migrate eval metrics to torch-fidelity, add probes #423).
Add FOV-stratified linear-probe AUROC (paired [gt; pred] with GroupKFold and per-fold MADScaler fit, no leakage) + indistinguishability mapping; cross-condition probe scaffolding.
CP regionprops track gets variance + correlation pruning (vendored pycytominer math) before z-score and metric computation; per-(FOV, t) drop-target-zero + z-score for stability with few cells.
Short-circuit FID/KID to NaN at N < 2 cells; auto-shrink KID subset size for small cohorts.

3D pix2pix GAN baseline + modernization

New 3D pix2pix model (pix2pix3d_unetvit) with UNetViT3D generator + PatchGAN discriminator; benchmark train + predict leaves added under the existing schema (feat(dynacell): add 3D pix2pix GAN baseline (pix2pix3d_unetvit) #425).
StyleGAN2/R3GAN-style training recipe modernization: gradient penalty, EMA weights, refreshed LR schedules; default H200 SLURM constraint because the GAN OOMs 80 GB cards under DDP (feat(dynacell): modernize pix2pix3d training recipe (StyleGAN2/R3GAN-style) #428).

Topology / trainer-recipe ownership cleanup

Four commits (9734d07 → f9b8f1e → 5b7eaae → 3fdb7cf) untangle three layers that were previously mingled in recipes/trainer/fit_*.yml:

Runtime profile renamed. runtime_single_gpu.yml → runtime_shared.yml: the file's content (srun + PYTHONUNBUFFERED/NCCL_DEBUG/PYTHONFAULTHANDLER env vars) has nothing single-GPU-specific, and every leaf — including the 4-GPU UNeXt2 leaf — composed it. Rename + 14 leaf-base updates + schema-doc references.
New topology layer (recipes/topology/single_gpu.yml + ddp_4gpu.yml) owns trainer.accelerator/devices/strategy/num_nodes. Duplicated independently under applications/dynacell/configs/recipes/topology/ and applications/cytoland/examples/configs/recipes/topology/ because CLAUDE.md forbids cross-application imports.
Unified trainer recipes — recipes/trainer/fit.yml + predict.yml per app, replacing fit_1gpu/fit_4gpu/fit_fm_4gpu/predict_gpu. Mode invariants only (seed, log cadence, callbacks, WandbLogger class pinned with per-app project). Topology and precision are owned by the topology recipes and model overlays respectively. Every benchmark model overlay and example leaf now composes [fit.yml + topology/*.yml] and sets precision + max_epochs explicitly.
Hardware profiles (hardware_h200_single, hardware_gpu_any_long, hardware_4gpu) drop their trainer: block; they now own only launcher.sbatch.*.
Intentional behavior deltas (spelled out, not masked): strategy: ddp → auto on every single-GPU leaf (Lightning-equivalent at devices=1, just intent-honest); dynacell examples fnet3d/unetvit3d/unext2/fit.yml and benchmark unext2.yml now pin WandbLogger project=dynacell instead of falling through to Lightning's default TensorBoard; fit_1gpu-derived paths move from save_top_k: 4 to save_top_k: 5.
Preserved via leaf-body overrides: FNet3D paper keeps precision: 32-true; examples/celldiff/fit.yml keeps FM-style epoch-based checkpointing (save_top_k: -1, every_n_epochs: 10, no monitor); cytoland FCMAE pretraining keeps strategy: ddp_find_unused_parameters_true; cytoland→dynacell bridge configs keep project: dynacell override over cytoland's default.
LEGACY deletion: applications/dynacell/tools/LEGACY/examples_configs/ (pre-schema CellDiff configs) is removed outright per CLAUDE.md's "avoid backwards-compatibility hacks" rule. The equivalence tests that composed LEGACY (test_*_leaf_matches_legacy, test_fnet3d_paper_leaf_matches_ran_config, test_byte_equivalence_sec61b_train_leaf) are replaced with forward-looking composition sanity tests.
Plan + review trail: plan is in ~/.claude/plans/vectorized-sleeping-clock.md; commits ran through /review-plan and /verify-plan subagent rounds before landing.

Dependency and packaging updates

Pin microssim to a compatible Git revision and refresh uv.lock so the new dynacell[eval] extra resolves with the current workspace.
Allow direct references in hatch metadata so the workspace can build/package those dependency pins cleanly.

Why

This is the VisCy side of the dynacell-paper split:

dynacell in VisCy becomes the reusable runtime.
dynacell-paper owns benchmark-instance configs, orchestration, and manuscript-specific assets.

That gives us a cleaner boundary for future work like multi-experiment collections and benchmark-specific runners without keeping paper state in the shared application package.

The flow-matching validation-loss change is intentionally opt-in because it adds extra validation compute and the resulting metric is still stochastic; default flow-matching checkpointing remains epoch-based.

The topology / trainer-recipe cleanup at the end removes a silent-drop-DDP trap: before this change, pairing a mismatched trainer recipe and benchmark hardware profile let the hardware profile's trainer.devices win (both layers redundantly set it), which could silently run a DDP training on 1 GPU. Topology is now the single source of truth.

Verification

uvx ruff check applications/dynacell/
uvx ruff format --check applications/dynacell/
.venv/bin/python -m pytest applications/dynacell/tests -q
.venv/bin/python -m pytest applications/dynacell/tests/test_data_manifests.py applications/dynacell/tests/test_reporting_tables.py applications/dynacell/tests/test_reporting_tables_extended.py applications/dynacell/tests/test_reporting_figures.py applications/dynacell/tests/test_evaluation_metrics.py applications/dynacell/tests/test_evaluation_pipeline.py applications/dynacell/tests/test_evaluation_io.py applications/dynacell/tests/test_preprocess_config.py applications/dynacell/tests/test_preprocess_zarr_utils.py applications/dynacell/tests/test_cli_routing.py -q
.venv/bin/python -m pytest applications/dynacell/tests/test_engine.py -q
.venv/bin/python -m pytest applications/dynacell/tests/test_training_integration.py::test_celldiff_fm_validation_loss_keeps_generation applications/dynacell/tests/test_training_integration.py::test_celldiff_fm_warmup_cosine_fast_dev_run applications/dynacell/tests/test_training_integration.py::test_celldiff_fm_constant_schedule_fast_dev_run -q
.venv/bin/dynacell fit --print_config -c applications/dynacell/configs/examples/unext2/fit.yml
uv run pytest applications/dynacell/tests/test_benchmark_config_composition.py applications/dynacell/tests/test_submit_benchmark_job.py -v (benchmark schema + submit-tool tests)
Topology/trainer cleanup end-to-end:
- Composition snapshot of 35 touched leaves before vs after the cleanup; only expected intentional deltas (strategy ddp → auto at devices=1, WandbLogger added where previously default TB).
- uv run dynacell fit -c <benchmark-unext2-leaf> --trainer.fast_dev_run=true --trainer.devices=1 --trainer.strategy=auto ran one training step cleanly on an A40 interactive node (loss 1.048).
- submit_benchmark_job.py <benchmark-unext2-leaf> queued a real 4-GPU DDP job (31122607) via the new schema path.

Copilot

Pull request overview

This PR refactors applications/dynacell from a thin training app into a reusable dynacell benchmark package, reorganizing configs and bringing evaluation/reporting/preprocess/data modules (with tests) into VisCy while updating shared runtime utilities to support the new package shape.

Changes:

Reorganize Dynacell configs into configs/recipes/ + configs/examples/, remove paper-/HPC-specific SEC61B artifacts, and update docs/tests for config discovery.
Add reusable dynacell.data, dynacell.evaluation, dynacell.reporting, and dynacell.preprocess modules plus substantial test coverage.
Update shared VisCy components (CLI checkpoint merging, prediction writer overwrite behavior, affine augmentation safety options, HCS training shape checks) to support the new workflows.

Reviewed changes

Copilot reviewed 74 out of 85 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
packages/viscy-utils/src/viscy_utils/cli.py	Adjust checkpoint-hparam vs user-config precedence during LightningCLI parsing.
packages/viscy-utils/src/viscy_utils/callbacks/prediction_writer.py	Add `overwrite` option to allow reusing existing output stores.
packages/viscy-transforms/src/viscy_transforms/_affine.py	Add configurable `padding_mode` and safe-crop scale clamping for GPU affine aug.
packages/viscy-transforms/tests/test_affine.py	Extend affine tests to cover scale-floor computation and safe-crop behavior.
packages/viscy-data/src/viscy_data/hcs.py	Skip strict training shape check when GPU augmentations handle cropping.
applications/dynacell/src/dynacell/main.py	Route `evaluate`/`report` to Hydra entry points; keep fit/predict via `viscy_utils`.
applications/dynacell/src/dynacell/engine.py	Add `ckpt_path` init-time loading to decouple predict-time settings from ckpt hparams.
applications/dynacell/src/dynacell/data/_yaml.py	Add shared OmegaConf→Pydantic YAML loader.
applications/dynacell/src/dynacell/data/manifests.py	Add dataset manifest + split schemas and path-based loaders.
applications/dynacell/src/dynacell/data/collections.py	Add frozen benchmark collection schemas and loader.
applications/dynacell/src/dynacell/data/specs.py	Add benchmark-spec schema/loader for tying pipeline stages together.
applications/dynacell/src/dynacell/data/init.py	Re-export dynacell.data public API.
applications/dynacell/src/dynacell/evaluation/init.py	Initialize evaluation package.
applications/dynacell/src/dynacell/evaluation/pipeline.py	Add Hydra-based evaluation pipeline with caching + output writing.
applications/dynacell/src/dynacell/evaluation/metrics.py	Add pixel/mask/feature metric computations (optional deps guarded).
applications/dynacell/src/dynacell/evaluation/utils.py	Add feature extractors + plotting and distribution-distance helpers.
applications/dynacell/src/dynacell/evaluation/io.py	Add eval I/O dispatch helpers (Zarr vs TIFF-like) and preprocessing utilities.
applications/dynacell/src/dynacell/evaluation/segmentation.py	Add segmentation workflows with optional dependency guards.
applications/dynacell/src/dynacell/evaluation/formatting.py	Add DataFrame formatting utilities for evaluation outputs.
applications/dynacell/src/dynacell/evaluation/torch_ssim.py	Add torch-native SSIM implementation used by metrics.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/init.py	Initialize spectral_pcc subpackage.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/plot_shading_analysis.py	Add analysis plotting script for shading artifact comparison.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/plot_combined.py	Add combined metrics plotting + summary script.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/diagnostic_real.py	Add Hydra diagnostic workflow for real-data spectra/metrics analysis.
applications/dynacell/src/dynacell/reporting/init.py	Re-export reporting public API.
applications/dynacell/src/dynacell/reporting/tables.py	Add table aggregation + LaTeX rendering utilities.
applications/dynacell/src/dynacell/reporting/figures.py	Add matplotlib figure generation for aggregated metrics.
applications/dynacell/src/dynacell/reporting/cli.py	Add Hydra reporting CLI entry point and config path wiring.
applications/dynacell/src/dynacell/preprocess/init.py	Re-export preprocess utilities.
applications/dynacell/src/dynacell/preprocess/config.py	Add OmegaConf-based preprocess config loader (with fallback).
applications/dynacell/src/dynacell/preprocess/zarr_utils.py	Add OME-Zarr rewrite utility (rechunk/reshard).
applications/dynacell/configs/examples/unext2/fit.yml	Add generic UNeXt2 fit example in new config layout.
applications/dynacell/configs/examples/unext2/predict.yml	Add generic UNeXt2 predict example with init-time `ckpt_path`.
applications/dynacell/configs/examples/fnet3d/fit.yml	Update FNet3D fit example paths to new layout.
applications/dynacell/configs/examples/fnet3d/predict.yml	Add generic FNet3D predict example with init-time `ckpt_path`.
applications/dynacell/configs/examples/unetvit3d/fit.yml	Update UNetViT3D fit example paths to new layout.
applications/dynacell/configs/examples/unetvit3d/predict.yml	Update UNetViT3D predict example paths and use init-time `ckpt_path`.
applications/dynacell/configs/examples/celldiff/fit.yml	Update CellDiff fit example paths to new layout.
applications/dynacell/configs/examples/celldiff/predict.yml	Add CellDiff predict example with init-time `ckpt_path` and overlap config.
applications/dynacell/configs/recipes/trainer/fit_1gpu.yml	Add reusable 1-GPU trainer recipe.
applications/dynacell/configs/recipes/trainer/fit_4gpu.yml	Add reusable 4-GPU trainer recipe.
applications/dynacell/configs/recipes/trainer/fit_fm_4gpu.yml	Add reusable 4-GPU flow-matching trainer recipe.
applications/dynacell/configs/recipes/trainer/predict_gpu.yml	Remove `ckpt_path` from trainer recipe (moved to model init args).
applications/dynacell/configs/recipes/models/unext2_3d.yml	Add reusable UNeXt2 model recipe (z=15).
applications/dynacell/configs/recipes/models/unext2_3d_z8.yml	Add reusable UNeXt2 model recipe (z=8).
applications/dynacell/configs/recipes/models/fnet3d.yml	Add reusable FNet3D model recipe.
applications/dynacell/configs/recipes/models/fnet3d_z8.yml	Add reusable FNet3D model recipe (z=8).
applications/dynacell/configs/recipes/models/unetvit3d.yml	Add reusable UNetViT3D model recipe.
applications/dynacell/configs/recipes/models/celldiff_fm.yml	Add reusable CellDiff flow-matching model recipe.
applications/dynacell/configs/recipes/modes/spotlight.yml	Add Spotlight mode recipe for foreground-aware loss.
applications/dynacell/configs/recipes/data/hcs_phase_fluor_3d.yml	Adjust data recipe defaults (e.g., preload false).
applications/dynacell/configs/evaluate/eval.yaml	Add Hydra eval base config.
applications/dynacell/configs/evaluate/spectral_pcc/base.yaml	Add Hydra config for spectral PCC workflows.
applications/dynacell/configs/evaluate/spectral_pcc/simulate.yaml	Add simulation config for metric validation.
applications/dynacell/configs/evaluate/spectral_pcc/diagnostic_real.yaml	Add real-data diagnostic config.
applications/dynacell/configs/report/base.yaml	Add Hydra report config.
applications/dynacell/tests/test_training_integration.py	Update config discovery test to new `configs/examples` location.
applications/dynacell/tests/test_cli_routing.py	Add tests for CLI routing between Lightning and Hydra subcommands.
applications/dynacell/tests/test_data_manifests.py	Add tests for dataset manifest/collection/spec schemas and loaders.
applications/dynacell/tests/test_evaluation_io.py	Add tests for eval I/O dispatching based on path type.
applications/dynacell/tests/test_evaluation_metrics.py	Add regression tests for shared-scale pixel metrics behavior.
applications/dynacell/tests/test_evaluation_pipeline.py	Add caching regression test for evaluation pipeline.
applications/dynacell/tests/test_reporting_tables.py	Add tests for reporting tables loading/aggregation/LaTeX output.
applications/dynacell/tests/test_reporting_tables_extended.py	Add extended reporting table tests (caption/label, lower-is-better).
applications/dynacell/tests/test_reporting_figures.py	Add tests for figure creation/saving and empty-data behavior.
applications/dynacell/tests/test_preprocess_config.py	Add tests for preprocess config loader fallback behavior.
applications/dynacell/tests/test_preprocess_zarr_utils.py	Add tests for zarr rewrite correctness (data, chunks, metadata, shards).
applications/dynacell/README.md	Update usage + config layout documentation; remove paper-specific SEC61B instructions.
applications/dynacell/.gitignore	Ignore runtime artifacts (lightning_logs, outputs, pycache).
applications/dynacell/pyproject.toml	Add optional extras for eval/report/preprocess and allow direct references.
applications/dynacell/examples/configs/sec61b/run_unext2_continue.slurm	Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/run_unext2.slurm	Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/run_fnet3d_paper.slurm	Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/run_fnet3d.slurm	Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/fit_unext2.yml	Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/sec61b/fit_fnet3d_paper.yml	Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/sec61b/fit_fnet3d.yml	Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/sec61b/fit_celldiff.yml	Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/recipes/data/hcs_sec61b_3d.yml	Remove hard-coded SEC61B data recipe.
applications/dynacell/examples/configs/fnet3d/predict.yml	Remove old predict example path (superseded by new configs layout).
applications/dynacell/examples/configs/celldiff/predict.yml	Remove old predict example path (superseded by new configs layout).
CLAUDE.md	Update repo/dev workflow documentation and conventions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 78 out of 91 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Addresses two PR #404 review findings: 1. `_CacheContext._manifest_dirty` was mutated directly from helper call sites, leaking implementation detail. Adds `mark_manifest_dirty` and `consume_manifest_dirty` methods and routes every external touch through them. Only the dataclass itself now references the private field. 2. `resolve_dynaclr_encoder_cfg` used `except Exception` to detect a missing nested config key — wider than needed and against CLAUDE.md guidance. Replaced with `OmegaConf.select(..., default=None)`, which handles missing keys natively without a try/except. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses a PR #404 review finding: the split GT/pred feature API had structural tests (empty inputs, column-drop, shape mismatch) but no pinned-value regression guard. Adds two tests that seed deterministic synthetic inputs and assert exact output values for CP_FID / CP_KID / CP_Median_Cosine_Similarity and the DINOv3 equivalents. If anyone later changes the column-drop, per-side z-score, or FID/KID/cosine pairing logic — or a dependency shifts numerics — these tests will fail rather than silently drifting metrics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 131 out of 143 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Restructure applications/dynacell/configs/benchmarks/virtual_staining/ so train, predict, and eval leaves for the same benchmark cell live under one directory `<org>/<train_set>/<model>/`, with shared building blocks split by consumer: - `shared/model/` — train/predict building blocks (train_sets, predict_sets, targets, model_overlays, launcher_profiles) - `shared/eval/` — HPC-bound Hydra eval groups (target, feature_extractor/dynaclr) - `<org>/<train_set>/<model>/{train.yml, predict/<predset>.yml, eval/<predset>.yaml}` Fold `configs/evaluation/` into the benchmark tree and delete the sibling root. The eval selector renames from `benchmark=<path>` to `leaf=<path>` to reflect the new filesystem model; leaves are addressed through a committed symlink tree at `virtual_staining/leaf/` because Hydra's group resolution requires a physical `leaf/` prefix. Eval leaves keep `.yaml` (Hydra only discovers `.yaml` for groups); train and predict leaves stay `.yml`. `dynacell.__main__` now injects two `hydra.searchpath` roots — the benchmark tree (for `leaf/`) and `shared/eval/` (for `target/` and `feature_extractor/dynaclr/`). The `_EXTERNAL_CONFIGS_SUBPATH` constant becomes `_EXTERNAL_SEARCHPATHS`, and the helper is renamed to `_external_configs_dirs` (plural, returning a list). Also fixes a pre-existing bug: the three unetvit3d train leaves under mito/nucleus/membrane referenced a non-existent `runtime_single_gpu.yml` launcher profile; they now use `runtime_shared.yml`. Model-overlay `base:` entries targeting `recipes/` gain an extra `..` segment to account for the new `shared/model/` depth. Tests: - test_benchmark_config_composition.py: path literals, three new (unetvit3d × {mito,nucleus,membrane}) pairs, new test_eval_leaf_symlink_resolves parametrised over all 8 eval leaves. - test_cli_routing.py: three `_external_configs_dir` patches renamed (plural) and argv selector strings updated to `leaf=`. - test_submit_benchmark_job.py: parametrised leaf paths. Docs: virtual_staining/README.md (full rewrite to match new tree), evaluation/README.md (groups table, external-users section, benchmark eval leaves section), UNEXT2_VS_FCMAE_CLASSES.md (stale path), dynacell/ README.md (sbatch example path), BENCHMARK_CONFIG_SCHEMA.md (Hydra search-path contract note about the `leaf/` symlink adapter). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l/ move The preceding refactor commit renamed train and predict leaves to <org>/<train_set>/<model>/{train.yml, predict/<predset>.yml} but the content updates that prepend the shared/model/ segment to every base: line were unstaged when that commit landed. This catches the 25 train and predict leaves up with the new shared/model/ layout, and tightens a prose comment in fnet3d_paper_fit.yml that still referenced the pre-refactor shared/targets/ path. Without this fix a fresh checkout of the reorg commit has train and predict leaves pointing at shared/train_sets/, shared/targets/, etc. — paths that no longer exist — so `load_composed_config` would raise FileNotFoundError on every leaf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Trim the WHAT-narration from docstrings and comments in dynacell.__main__: drop the redundant "two roots are injected" enumeration (the tuple is self-describing), fix the "repo root" claim in _external_configs_dirs (the walk stops at the app's pyproject.toml, not the workspace root), and drop the "joined with commas" line from _inject_external_configs' docstring. Bind ``p = parent / sub`` via walrus inside the existence filter so the path is computed once instead of twice per subpath. In the composition tests, delete the ``EVAL_LEAVES = PREDICT_LEAVES`` alias (the aliased list was the same object, causing silent coupling) and drive the symlink-integrity test off ``PREDICT_LEAVES`` directly. Drop the ``@package _global_`` header assertion from the symlink test — that check belongs to composition, not to symlink integrity. Add a ``REPO_ROOT`` sanity assert so drift of the ``parents[3]`` magic count fails loudly. Tighten ``TestInjectExternalConfigs`` class docstring to a single line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the "in-package / repo checkout" shorthand in the config-groups table with concrete source paths, so readers see exactly where each group lives after the reorg: shared/eval/ for HPC-bound groups, leaf/ for the eval-leaf symlink tree, _configs/ for packaged schema. Tighten the "Running an evaluation" intro so it no longer implies all groups live under _configs/, and spell out the two hydra.searchpath roots dynacell.__main__ injects. Add a footgun entry for external users about Hydra's .yaml-only group resolution — saving a group as .yml makes the selector silently fail with MissingConfigException. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep the top level of virtual_staining/ biology-only: `er/`, `membrane/`, `mito/`, `nucleus/` plus README.md. Everything else — shared building blocks and the leaf/ Hydra symlink adapter — now lives under a single _internal/ directory whose leading underscore signals "implementation detail; don't browse here for science." Layout is now: virtual_staining/ <org>/<train_set>/<model>/{train.yml, predict/, eval/} _internal/ shared/ model/ # train/predict building blocks eval/ # target/, feature_extractor/dynaclr/ leaf/ # symlink tree aliasing canonical eval leaves dynacell.__main__ injects two hydra.searchpath roots under _internal/ (the tree root for leaf/, and shared/eval/ for target + dynaclr) — same cardinality as before, just under the hidden support tree. Train/predict leaves get `_internal/` prepended to every base: path. Model overlays gain one more `../` to reach configs/recipes/ (now 6 segments deep instead of 5). The 8 leaf/ symlinks are recreated with 6 relative-dot segments (was 5 at virtual_staining/leaf/, now 6 at virtual_staining/_internal/leaf/). Docs refreshed across BENCHMARK_CONFIG_SCHEMA.md, virtual_staining/ README.md, and evaluation/README.md: every layout tree, base: example, group-source column, and search-path reference points at _internal/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related bugs in predict-path tiling: P1. DynacellFlowMatching.predict_step routed predict_method='sliding_window' to CELLDiff3DVS.generate_sliding_window(), which is the non-overlapping tiler and does not consume predict_overlap. Any config that set both predict_method='sliding_window' and predict_overlap=[...] silently got non-overlapping tiles, producing different predictions than the user expected. Now raise ValueError when sliding_window is combined with a non-zero overlap, pointing users to predict_method='iterative' (the overlap-anchored velocity tiler) instead. Also refresh the docstring to spell out which methods consume predict_overlap. Fix the membrane predict leaf to match its peers (er, mito): predict_method=iterative. Example config at configs/examples/celldiff/predict.yml gets the same fix plus an inline comment clarifying which methods overlap. P2. DynacellUNet.predict_sliding_window allocated both accumulator buffers (prediction_sum, prediction_count) with zeros_like(source), so their channel dimension was always in_channels. For models whose out_channels != in_channels (common: 1 phase in -> 2 target out), the first `prediction_sum[slicer] += patch_out` raised a broadcast error. Allocate the accumulators lazily from the first patch output so they take the model's actual out_channels and dtype. Regression tests added for both paths: - sliding_window + non-zero overlap must raise ValueError - predict_sliding_window returns the correct shape when out_channels=2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The earlier "always append" strategy broke argparse when users placed diagnostic flags (-c job, --info, --help, ...) after positional overrides: $ dynacell evaluate leaf=er/... -c job dynacell: error: unrecognized arguments: hydra.searchpath=[file://...] Hydra's argparse declares ``overrides`` as a single positional with nargs="*". Once that nargs run is interrupted by a flag, any later positional is reported unrecognized. Appending the injected token at the end scatters a positional across the flag in the override-first layout; prepending scatters it across the flag in the flag-first layout. Fix: insert the token adjacent to an existing positional override when one is present, so positionals stay contiguous whichever side the flags sit on. Fall back to append when the user passed no positional overrides (e.g. bare --help). Three regression tests cover the three layouts: flag-leading, flag-trailing, and no-positional. The former "appends always" test is replaced by the layout-specific pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tighten the "Running an evaluation" lead so readers see up front that repo-checkout groups live under virtual_staining/_internal/ (the hidden support tree), not loose under virtual_staining/. The detailed bullet list below already spelled this out; the intro now matches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 191 out of 201 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Lightning's SLURMEnvironment rejects --ntasks at runtime (slurm.py:_validate_srun_variables) and demands --ntasks-per-node. Yesterday's fix emitted the wrong directive; all four FCMAE jobs failed within 50 s on the cluster. Switch the sbatch generator, hardware profiles, pre-submit validator, and invariant test to use ntasks_per_node throughout. The enforced invariant is now devices == ntasks_per_node and gpus == nodes × devices. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gated HF repos (DINOv3) need per-user access grants, but the downloaded weights are identical. Before this change every team member who ran `dynacell evaluate compute_feature_metrics=true` re-downloaded ~2 GB of DINOv3 weights into their own `~/.cache/huggingface/`. Even ignoring the network cost, only one team member has HF gate access right now — every other member's runs 403 at model fetch. Fix: point HF_HOME at a team-shared project-storage dir. First user with gated access downloads once; everyone else reads the shared copy. Two layers so the behavior is automatic regardless of invocation: 1. `dynacell.__main__._maybe_set_shared_hf_cache()` runs before Hydra subcommands (evaluate, precompute-gt, report). It defaults HF_HOME to `/hpc/projects/comp.micro/virtual_staining/models/dynacell/evaluation/hf_cache/` when three conditions hold: the user hasn't set HF_HOME themselves, we're on a repo checkout (same signal as external configs), and the shared dir exists on this machine. Wheel installs and non-HPC machines fall through to the normal ~/.cache/huggingface default. 2. `runtime_shared.yml` also sets HF_HOME for sbatch-rendered fit / predict jobs. Those jobs don't currently load HF models, but having the env var declared in the launcher profile future-proofs any later workflow that does and keeps the team convention visible in config. Four regression tests cover the auto-setter: user-set wins, wheel install skips, missing shared dir skips, repo checkout + existing dir sets. README documents the mechanism and first-time download contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three defaults were required on every eval / precompute-gt invocation even though they are effectively fixed for our benchmark set: feature_extractor/dinov3=lvd1689m (only DINOv3 we evaluate with) feature_extractor/dynaclr=default (team DynaCLR checkpoint) io.gt_cache_dir=/hpc/.../eval_cache/<TARGET> (one dir per target) Hoist them so the minimal command dynacell precompute-gt target=er_sec61b predict_set=ipsc_confocal composes everything needed for all four cache families, and dynacell evaluate leaf=<...> carries the same defaults without the redundant override directives. - eval.yaml defaults list switches optional feature_extractor/dinov3: null -> lvd1689m optional feature_extractor/dynaclr: null -> default The `optional` keyword means wheel installs that don't ship dynaclr=default (it lives in the repo's shared/eval/ tree) silently skip the default — wheel users supply their own via --config-dir. - Each target YAML now carries io.gt_cache_dir for its target's cache path (SEC61B, TOMM20, nucleus, membrane). - The 8 canonical eval leaves drop the redundant override /feature_extractor/dinov3: lvd1689m override /feature_extractor/dynaclr: default and the per-leaf io.gt_cache_dir — all now inherited. - README updated: "Enable feature metrics" and "Priming the cache" examples show the one-line minimal forms. To swap variants at call-time nothing changes: dynacell evaluate ... feature_extractor/dinov3=<other_variant> dynacell evaluate ... feature_extractor.dynaclr.checkpoint=/other.ckpt dynacell evaluate ... feature_extractor/dinov3=null # disable Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three issues from /simplify agents: - Hardcoded HPC path in _SHARED_HF_CACHE makes the module CZ Biohub- specific; other sites couldn't override without editing source. Read DYNACELL_SHARED_HF_CACHE at module load and fall back to the CZ Biohub default only when the env var is absent. Extension story is now: set DYNACELL_SHARED_HF_CACHE in your shell / sbatch env block and the rest of the mechanism just works. - Docstring on _maybe_set_shared_hf_cache named the sibling helper _external_configs_dirs; rephrase in semantic terms ("external Hydra searchpaths resolve") so the docstring doesn't rot if the detection mechanism ever changes. - Four test methods in TestMaybeSetSharedHfCache imported `os` inline instead of at module top (violates CLAUDE.md). Plus a redundant monkeypatch.delenv() cleanup at the end of the last test that pytest's monkeypatch fixture already handles on teardown. Lift `import os` to module top, drop the inline imports, drop the redundant cleanup. Skipped /simplify nitpicks: - functools.cache on _external_configs_dirs (6-8 stat calls saved per CLI invocation; not worth the cognitive cost of a cache that has to coexist with tests that patch the function reference) - Parametrize the 4 tests (4 is small enough to leave explicit) - Inline comment on opaque group names (marginal signal) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Setting HF_HOME to the team-shared project-storage dir inadvertently broke per-user gated-repo authentication: HF_HOME relocates the entire HF directory tree including the auth token file. The first precompute attempt with DINOv3 enabled came back 401 "Access to model ... is restricted. You must have access to it and be authenticated to access it" — not a missing grant (the user had been granted access, which would have returned 403), but a missing token. HF was looking under $HF_HOME/token, where no token file lives. Switch to HF_HUB_CACHE, which only relocates the weights/datasets cache and leaves auth state at its default ~/.cache/huggingface/token. Each user's personal token now enforces gate ACLs on their own account; the cache of downloaded weights is shared on project storage as intended. The _maybe_set_shared_hf_cache helper and its four tests flip from HF_HOME → HF_HUB_CACHE. The runtime_shared.yml launcher env block does the same. README gains a paragraph explaining *why* we chose HF_HUB_CACHE over HF_HOME, and clarifies that subsequent reads from the shared cache are plain disk reads that don't round-trip to HF and so don't need a token at all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous 4-GPU FCMAE runs OOM-killed before first training step. On gpu-c-1 sstat showed MaxVMSize=103G/rank at startup — 4 ranks × 103G ≈ 412G of virtual memory, plus mmap_preload /dev/shm staging, pushed past the 512G cgroup cap. num_workers=8 per rank meant 32 dataloader workers inflating per-rank VM before the first step. Halving to 4 cuts startup VM roughly in half while keeping prefetch above the single-worker bottleneck. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Collapse the 5-kwarg ``(gt_plate_path, gt_channel_name, ..., pred_plate_path, pred_channel_name)`` signatures on ``check_cache_identity`` and ``seed_cache_identity`` into a single ``source: Literal["gt", "pred"] | None`` + ``plate_path`` + ``channel_name`` triple. The paired-None runtime guards collapse to one check and the public surface no longer encodes the GT/pred axis in parameter names. The new ``seed_cache_identity`` accepts only one source per call; the test that previously seeded both sides in a single invocation splits into two sequential calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Collapse ``init_cache_context`` and ``init_pred_cache_context`` into one function with a required ``side: Literal["gt", "pred"]`` kwarg. The two near-duplicate ~100-LOC bodies become one ~80-LOC body; ``side`` selects which manifest key, which cache dir, and which identity fields to check. The cross-side contamination guard (``manifest.get(other_side) is not None``) now runs in both directions. Today only the pred path rejected GT-contaminated manifests; the GT path will now also reject pred-side contamination — a config error in either direction fails fast instead of silently returning a half-valid cache. ``side`` has no default — callers must spell out their intent. The two GT call sites (``pipeline.py:153`` and ``precompute_cli.py:82``) and the one pred call site (``pipeline.py:160``) all pass it explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drop six 1-line trampolines and ``_ensure_cache_side``. Export three side-agnostic helpers instead — ``fov_masks``, ``fov_cp_features``, ``fov_deep_features`` — each reading ``ctx.side`` for the per-side artifact label, force-recompute key, and manifest source tag. The shared private helpers (``_fov_masks``, ``_fov_deep_features``, ``_load_or_compute_feature_timepoints``) were already side-agnostic; this just removes the redundant public layer above them. The compute backends (``cp_target_regionprops``, ``deep_target_features``) are body-identical to their pred-side twins — only their parameter name differs. The new ``fov_cp_features`` and ``fov_deep_features`` call the target-side functions on both sides as a stop-gap; a follow-up commit drops the unused pred-side variants and renames the survivors to ``cp_regionprops`` / ``deep_features``. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The three ``*_target_*`` / ``*_pred_*`` metric pairs in ``metrics.py`` are body-identical — they only differ by parameter name. Drop the duplicates and rename the survivors: - ``cp_target_regionprops`` + ``cp_pred_regionprops`` → ``cp_regionprops`` - ``build_target_crops`` + ``build_pred_crops`` → ``build_crops`` - ``deep_target_features`` + ``deep_pred_features`` → ``deep_features`` The argument becomes ``image`` (was ``target`` / ``prediction``); the ``Shape mismatch:`` error now reads ``image`` in its message. The eval pipeline's cache-disabled fallback in ``pipeline.py`` and the ``fov_cp_features`` / ``fov_deep_features`` compute backends all switch to the unified names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hoist the per-FOV pred-feature dispatch into ``_fov_pred_features_per_t``, which returns a ``{kind: list[T]}`` dict regardless of whether the pred cache is enabled. The per-timepoint reads inside the inner loop become a straight indexing pass with no ``if pred_cache_ctx.enabled`` branch. The cache-disabled branch keeps the crop-sharing optimization (one set of 2-D crops per timepoint, reused across DINOv3/DynaCLR/CellDINO) so the max-projection + cell-iteration + crop construction still runs once per (FOV, t) instead of once per backbone. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ties Four functions in ``pipeline_cache.py`` independently recomputed ``label_prefix = "pred_" if ctx.side == "pred" else ""`` and ``source = {"source": "prediction"} if ctx.side == "pred" else {}`` from ``ctx.side``. Lift both into ``_CacheContext.label_prefix`` and ``source_tag`` properties so the side-encoding lives next to ``side`` and the call sites just read it. Two more ternaries in ``init_cache_context`` (``cache_dir_key``, ``other_side``, and the ``io.{gt,pred}_path`` / ``io.{gt,pred}_channel_name`` selection) collapse into ``_SIDE_IO_KEYS`` and ``_OTHER_SIDE`` lookup tables. Net: same behavior, one place to edit per axis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…point The cache-disabled branch of ``_fov_pred_features_per_t`` previously materialized a ``t_count``-long ``crops`` list, then walked it three more times in separate comprehensions for DINOv3, DynaCLR, and CellDINO. All ``t_count`` 2-D crop tensors sat in RAM until the function returned. Fuse into one loop: build crops for one timepoint, fan out to every deep backbone, then drop them before the next timepoint. Same total work; peak crops memory drops from ``t_count`` copies to one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cross-FOV/cross-timepoint accumulator for deep features that batches DINOv3/DynaCLR/CellDINO forwards across the whole dataset, splits results by per-(pos, t) row counts, and writes through the existing zarr cache contract. Force flags are snapshotted at construction so mid-scan flushes don't flip semantics for later positions; ctx.force is only cleared in drain() for kinds actually flushed. Lock-tag construction for the per-(family, position) write lock is hoisted to _kind_lock_tag so _load_or_compute_feature_timepoints and _flush_kind cannot drift. No call sites yet; per-FOV path (fov_deep_features) keeps working for the cache-disabled dev path. The wiring lands in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three tests for the cross-FOV batched precompute path: - writes_all_slots: every (pos, t) lands in the cache with the expected (n_cells, feature_dim) shape. - matches_per_fov: precompute_deep_features and fov_deep_features produce bit-exact identical cache contents (golden equivalence with a deterministic stub extractor). - skips_cached_slots: pre-seeded (pos, t) slots are not re-extracted and the original cache contents survive. Helpers (_write_tiny_hcs_plate, _write_tiny_seg_plate, _BatchedConstantExtractor) are local to this test module per the project's no-conftest-import rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run precompute_deep_features once over the whole position set before the per-FOV loop so DINOv3/DynaCLR/CellDINO forwards amortize across the dataset instead of paying Python/CUDA dispatch per (FOV, t). After precompute, the existing fov_deep_features / pred per-t calls hit the cache. Paired pos-name validation is hoisted so precompute cannot write to mismatched cache slots; the in-loop checks stay as defense in depth. Sides whose ctx.require_complete is true are skipped at precompute time — the main loop's _raise_if_require_complete keeps enforcing the fail-loud miss contract on those sides. Threshold is exposed via feature_metrics.deep_feature_batch_threshold (default 256 when the key is absent; explicit default lands in the config-knob commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the per-kind fov_deep_features calls in precompute-gt's per-FOV loop with a single streaming push to DeepFeatureBatcher. The existing loop already loaded the GT plate FOV once for masks + CP; deep features now reuse that read and batch forwards across FOVs/timepoints. build.{masks,cp} still go through the per-FOV helpers because they're not part of the batched pipeline. flush_manifest stays in the loop to preserve crash-recovery progress; batcher.drain() runs after the loop to land any pending crops + reset force flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Set feature_metrics.deep_feature_batch_threshold = 256 so the cross- FOV/cross-timepoint batched precompute path has a clear default that shows up in the resolved Hydra config. Both evaluate_predictions and precompute-gt read this key via OmegaConf.select with a fallback of 256 — explicit default lets downstream leaves (e.g. benchmark configs) override per-run without touching code. CellDINO's internal batch_size is 32, so threshold=256 amortizes 8 internal chunks per flush — comfortable on a single H100. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ecks _flush_kind never read its `side` arg (ctx.side was authoritative), hence the `del side` workaround. Remove the parameter and update both batcher call sites. The per-iteration position-name checks inside evaluate_predictions' main loop are subsumed by the upfront hoisted validation block that walks the same zip; nothing mutates pred_positions / gt_positions / seg_positions between the hoist and the loop. Underscore the unused position names in the unpacking instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous push API took a pre-stacked (n_cells, H, W) ndarray plus a (0, 0, 0) sentinel for zero-cell slots; _flush_kind then destacked it back into a per-cell list to feed features_from_crops. The round trip + the sentinel were pure ceremony — features_from_crops already handles an empty list and the per-cell list shape it needs. Take the raw list[np.ndarray] from build_crops end to end. Both callers (precompute_deep_features, precompute-gt CLI) drop their local stack-or-sentinel expression; _flush_kind drops the per-row Python loop and uses a flat comprehension over items. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

_flush_kind called flush_manifest(ctx) after every batched write, which under typical settings is 3 backbones × ceil(N_cells / flush_threshold) round trips per session — each one locks the global manifest, reloads YAML, merges, and writes. The "crash-resilience" case the inline flush nominally protected was already covered: zarr slot writes are durable per slot, so on resume the lockless prefetch finds them whether or not the manifest is current. Move the flush to the caller, after batcher.drain(). evaluate_predic- tions flushes every active side; precompute-gt flushes once after the loop. Manifest write cadence drops to O(active_sides) for eval and O(1) for precompute, matching the durability the caller wants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Torch metrics (PCC/SSIM/NRMSE/PSNR) ran on CUDA but then both tensors were force-moved to CPU numpy and fed to cubic's spectral_pcc / fsc_resolution — round-tripping through host memory every timepoint. cubic's spectral metrics are array-module-agnostic and accept cupy inputs directly, so the host bounce is pure overhead. Hand the tensors off via cubic.cuda.ascupy / asnumpy (zero-copy on CUDA via the CUDA Array Interface, per cubic AGENTS.md) so PCC/SSIM/ NRMSE/PSNR/Spectral_PCC/FSC all stay on one device. Also defer _require_cubic to the path that actually needs it, so torch-only runs no longer require the cubic/cucim/cupy extras to be installed. Smoke: on a 15x512x512 volume with spectral PCC enabled, per-call wall time drops from ~2.95 s (CPU bounce) to ~70 ms (GPU-resident). CPU and GPU paths agree to ~1e-7 relative on PCC/NRMSE/PSNR/Spectral_PCC and ~6e-5 on SSIM (float32 cuDNN noise). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…doff Three small cleanups in compute_pixel_metrics from the simplify pass: 1. fsc_resolution mean-centers internally before every FFT (see cubic/metrics/spectral/frc.py lines 171/251/696/877), so the outer ``target - target.mean()`` / ``prediction - prediction.mean()`` was wasted: two extra full-volume allocations + reductions per (FOV, t). 2. Hoist ``to_xp = ascupy if cuda else asnumpy`` so the two-line branch collapses to one converter call per tensor. 3. Drop the ``_xp`` suffix; reuse the ``prediction, target`` names after conversion. cubic's AGENTS.md explicitly steers away from importing the ``xp`` convention into callers. Docstring restructured to lead with the load-bearing WHY (host bounce dominated wall time on long timelapses) instead of restating the implementation. Numerics unchanged: CPU and GPU paths still agree to ~1e-7 on PCC/NRMSE/PSNR/Spectral_PCC and ~6e-5 on SSIM on a 15x256x256 smoke run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(viscy-models): add 3D PatchGAN discriminator and LSGAN losses Introduces a new gan subpackage with PatchGAN3D and MultiScalePatchGAN3D (pix2pixHD-style, 2-scale by default), plus lsgan_d_loss / lsgan_g_loss list-based helpers. These will pair with UNetViT3D to form a 3D pix2pix adversarial baseline alongside the existing regression and flow-matching DynaCell baselines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add DynacellGAN Lightning module for 3D pix2pix baseline DynacellGAN composes the existing UNetViT3D generator with the new MultiScalePatchGAN3D discriminator, training with LSGAN + L1 (λ=100) via manual optimization (automatic_optimization=False). A requires_grad toggle separates the D step from the G step so each backward only populates the active side's grads; D grads are also explicitly cleared between optimizers so the test invariant (no D grads after a G step) holds for Phase 2 DDP correctness. Validation logs the existing loss/validate weighted-mean alias, so ModelCheckpoint(monitor="loss/validate") keys off the same metric used across the other DynaCell baselines. configure_optimizers builds two AdamW + WarmupCosine pairs directly to fit the two-optimizer return shape Lightning's manual mode expects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add pix2pix3d_unetvit recipe, model overlays, and smoke leaf The recipe declares dynacell.engine.DynacellGAN with UNetViT3D's production generator config and a 2-scale spectral-normalized PatchGAN3D. fit / predict model overlays mirror the unetvit3d ones with lr_g/lr_d split (no scalar lr) and bf16-mixed precision. The single-GPU smoke leaf for er/pix2pix3d_unetvit/ipsc_confocal sets benchmark.dataset_ref: null to disable the compose-hook resolver so it can point data_path at SEC61B_test48.zarr directly, and uses batch_size=2 to match the inherited num_samples=2 (HCSDataModule divisibility rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add ER iPSC production train leaf for pix2pix3d_unetvit Mirrors er/unetvit3d/ipsc_confocal/train.yml with the model overlay and all model_name / experiment_id / run_root / job_name strings retargeted to pix2pix3d_unetvit. Keeps ModelCheckpoint.monitor: loss/validate unchanged — DynacellGAN logs that key via on_validation_epoch_end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add pix2pix3d_unetvit single-set train leaves across organelles Adds 7 production single-GPU training leaves for pix2pix3d_unetvit: - er/pix2pix3d_unetvit/a549_mantis - {membrane, mito, nucleus}/pix2pix3d_unetvit/{ipsc_confocal, a549_mantis} Each leaf mirrors the matching unetvit3d cell with model_overlay, model_name, experiment_id, logger name/save_dir, ckpt dirpath, launcher run_root/job_name retargeted to pix2pix3d_unetvit. ModelCheckpoint monitor=loss/validate preserved. Joint training leaves (joint_ipsc_confocal_a549_mantis) deferred — still gated on the open BatchedConcatDataModule DDP deadlock (handoff 2026-04-26). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add iPSC predict + eval leaves for er/pix2pix3d_unetvit Mirrors er/unetvit3d/ipsc_confocal/{predict,eval}__ipsc_confocal.* with the model overlay, model_name, experiment_id, ckpt_path, output_store filename, and eval pred_path / save_dir all retargeted to pix2pix3d_unetvit. predict_method stays full_image (the only mode DynacellGAN.predict_step supports for v1). ckpt_path uses an explicit REPLACE_ME_WITH_PRODUCTION_CHECKPOINT_PATH placeholder so torch.load raises FileNotFoundError with the placeholder visible if the user forgets to override it — running predict against an untrained generator would silently produce garbage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add a549_mantis predict + eval triplet for er/pix2pix3d_unetvit/ipsc Adds the 3 a549_mantis test splits (mock, denv, zikv) for the iPSC-trained pix2pix3d_unetvit ER cell — enables cross-dataset transfer evaluation. Mirrors the unetvit3d ipsc_confocal predict/eval triplet with the same retargeting rule already used for the iPSC pair. ckpt_path uses the REPLACE_ME_WITH_PRODUCTION_CHECKPOINT_PATH placeholder so any forgotten override fails loudly via FileNotFoundError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(dynacell): /simplify pass — reuse configure_adamw_scheduler, dedupe _log_samples DynacellGAN.configure_optimizers used to rebuild AdamW + WarmupCosineSchedule inline despite the class docstring claiming it called the shared helper. Now calls configure_adamw_scheduler twice and concatenates, dropping the redundant schedule guard and the unused monai/torch.optim imports. _log_samples was a byte-identical instance method on DynacellUNet, DynacellFlowMatching, and DynacellGAN. Promoted to a module-level free function and updated the five call sites across all three classes. Updated test_dynacell_gan_validate_logs_alias to monkeypatch the module attribute instead of the instance method. Also trimmed three narrating comments inside DynacellGAN (the plan/PR context comments belong in the PR description, not the source) and five per-layer banner comments in patchgan3d.py (the conv literals state the shape and the class docstring covers the layer table). Safe under the running production training's Slurm auto-requeue: no constructor signatures change, AdamW + WarmupCosine pairs are built with the same args, optimizer state shape is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): Phase 2 — DDP topology + 4-GPU smoke + production leaves Adds the DDP infrastructure for pix2pix3d_unetvit so the GAN baseline can train on 4 H200s in ~13 h instead of the ~50 h single-GPU pace measured on job 33132605 (now cancelled). - recipes/topology/ddp_4gpu_gan.yml — DDP variant that wraps the strategy in find_unused_parameters: true. Required because DynacellGAN runs manual_optimization with two optimizers, and the per-step requires_grad toggle leaves one side's params "unused" from the reducer's perspective on each backward. Plain ddp_4gpu.yml stays single-strategy for the other models. - model_overlays/pix2pix3d_unetvit_fit_ddp.yml — mirrors the single-GPU fit overlay but composes ddp_4gpu_gan.yml topology + 4-GPU hardware. - er/pix2pix3d_unetvit/ipsc_confocal/train_smoke_4gpu.yml — 4-GPU smoke variant. Same SEC61B_test48 store, batch_size=2 matching the inherited num_samples=2 divisibility rule. Validates DDP sharding + the find_unused_parameters DDP reducer behavior under the GAN's manual-opt pattern before committing the 4-GPU production run. - er/pix2pix3d_unetvit/ipsc_confocal/train_4gpu.yml — 4-GPU production leaf. Keeps per-rank batch_size=4 (effective 16 across 4 GPUs) and unchanged lr_g/lr_d so the DDP run is comparable to the single-GPU reference at the per-rank optimization settings. logger name and job_name suffixed with _4gpu to disambiguate in W&B/slurm; run_root and ckpt dirpath unchanged so Phase 4 predict leaves keep working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): retarget membrane pix2pix3d_unetvit save paths The membrane single-set train leaves landed with placeholder paths under .../ipsc/memb_temp/ — diverging from the celldiff convention (memb, not memb_temp) and conflating the a549_mantis run dir with the iPSC tree. Both leaves now point at the canonical .../ipsc/memb/ and .../a549_mantis/memb/ trees, matching the celldiff layout so all organelles can be cross-referenced uniformly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): expand pix2pix3d_unetvit predict/eval matrix Mirrors the er/pix2pix3d_unetvit/ipsc_confocal/ predict+eval set across membrane, mito, and nucleus so every iPSC-trained pix2pix3d_unetvit run has an apples-to-apples inference + evaluation path against both the ipsc_confocal self-target and the three a549_mantis conditions (mock, denv, zikv). Predict leaves swap target-inherited normalizations for a source-only NormalizeSampled (Phase3D) and clear the RandWeightedCropd augmentation. Eval leaves enable compute_feature_metrics only on the ipsc_confocal self-target, matching the existing er convention. Ckpt paths are placeholders pending the corresponding production runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add pix2pix3d_unetvit joint train + predict leaves Adds the joint_ipsc_confocal_a549_mantis variant for all four organelles (er, membrane, mito, nucleus): one train.yml authoring BatchedConcatDataModule with two HCSDataModule children plus the four predict leaves (self iPSC + three a549_mantis conditions). The joint train differs from the celldiff joint by using NormalizeSampled (fov_statistics) instead of MinMaxSampled so joint and single-set pix2pix3d_unetvit runs share the same normalization contract for apples-to-apples ablations. batch_size=2 × num_samples=2 matches the single-set 4 GPU-samples/step (per the BatchedConcatDataModule rule in CLAUDE.md that joint does NOT divide by num_samples). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(viscy-models): drop speculative num_layers from PatchGAN3D PatchGAN3D and MultiScalePatchGAN3D accepted a num_layers constructor parameter that only worked with the value 5 — any other value raised ValueError. The argument was justified as future-proofing for ablations that don't exist yet, which is exactly the speculative flexibility CLAUDE.md prohibits: "No 'flexibility' or 'configurability' that wasn't requested. ... No error handling for impossible scenarios." Re-add it only if and when a non-5 variant lands. Recipe + test fixture updated in lockstep; uv run pytest packages/viscy-models/tests/test_gan/ and the DynacellGAN init/forward tests pass (11/11). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): apply TTUR — drop pix2pix3d_unetvit lr_d to lr_g/10 The 1:1 LR schedule collapsed to D-dominance on DDP run 33176091 (wandb 20260520-002629): loss/d_train fell from 1.3e-2 (ep1) to ~1e-4 (ep12+), loss/g_adv pinned at ~1.0 (saturated MSE(d_fake, 1)), loss/g_l1 flat at ~0.45 from epoch ~5 onward — the adversarial gradient was effectively zero, generator was training as pure L1. Slowing the discriminator 10x (lr_d = lr_g / 10 = 3e-5) is the canonical pix2pix-style fix for D-dominance. Applies to both the single-GPU and DDP fit overlays so the entire Phase 3 baseline (all 4 organelles × {iPSC, A549, joint}) inherits the same adversarial balance regime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add eval runtime module + thread budgeting (C1) Adds dynacell.evaluation.runtime with three layered thread-cap entry points: early_apply_env_caps() sets BLAS/OMP env vars before C-extension load (called from __main__.main_cli); apply_thread_budget(threads) is the in-process safety net (env + torch.set_num_threads + threadpool_limits); resolve_runtime(config) materializes the new runtime config block. The runtime block (eval.yaml) is opt-in for parallelism (default executor=serial, fov_workers=1) but unconditionally tightens BLAS/OMP threads to cpu_count by default. On SLURM nodes where --cpus-per-task < host CPU count, this corrects previously-loose BLAS threading. Set DYNACELL_THREADS_PER_WORKER=N to override at the env layer for cases where in-process caps come too late. precompute_gt_artifacts validates that runtime.executor='serial' and fov_workers=1 — FOV-level parallelism is restricted to evaluate_predictions in this commit; precompute parallelism is out of scope (DeepFeatureBatcher accumulates state across FOVs). resolve_runtime supports two-phase resolution for evaluate_predictions: top-of-function for parent BLAS cap, then post-position-list with freeze_threads_per_worker= to keep parent and worker BLAS budgets consistent. Worker primitives (_process_one_fov, ProcessPoolExecutor) land in C3+C4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): per-T memory hygiene + region timing instrumentation (C2) Adds always-on region timers (RegionTimer context manager) that record wall time of FOV-level and per-T regions to <save_dir>/eval_timing.csv. Region taxonomy matches actual execution placement at pipeline.py:341-505: FOV-level (t=None): mask_gt, mask_pred, cp_gt, deep_gt_dinov3, deep_gt_dynaclr, deep_gt_celldino, features_pred_per_t, microssim, seg_write Per-T: pixel_metrics, mask_metrics, feature_pairwise Timing overhead is ~0.1 ms per region per timepoint; ~120 ms total on a 9-h eval (~4e-6 of wall time) — no gating knob. Per-T memory hygiene knobs (runtime.cuda_empty_cache_every_n_timepoints, runtime.gc_collect_every_n_fovs) ship at 0 (no behavior change) in this commit. The plan calls for measuring and flipping defaults in C2 — that measurement requires a real eval run on the test48 zarr that exceeds this session's scope. Defaults stay at 0; observed late-T degradation in prior A549 ER evals (18 -> 61 s/T over 10 timepoints) is the documented motivation but not the measured fix in this commit. DYNACELL_FORCE_PER_T_HYGIENE=1 env var flips both knobs on at runtime regardless of YAML defaults — operator escape hatch for post-ship mitigation without a code change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(dynacell): extract per-FOV body into _process_one_fov + FovResult (C3) Pure refactor: extract the per-FOV body of evaluate_predictions into a module-level function _process_one_fov(...) -> FovResult and an _aggregate_fov_result(...) parent-side aggregator. Serial-path behavior is identical to the prior code (verified: 384/48 pass/fail parity with HEAD~2; the 48 failures are all pre-existing). FovResult carries everything one FOV contributes to the run as a picklable dataclass: - per_t_pixel_rows (with microssim already merged in per pipeline.py:471 semantics — no separate microssim_rows field), - per_t_mask_rows, per_t_feature_rows, - seg_array of shape (T, 2, D, H, W) bool (channel 0 = pred, channel 1 = GT), - 4 _BackboneLists for cp/dinov3/dynaclr/celldino — each holds the per-T pred_feats/gt_feats/pred_fovs/gt_fovs/pred_ts/gt_ts arrays. Workers preserve the existing `if size > 0: append` guard so empty per-T slots never enter the lists; aggregator extends parent lists unconditionally. - timings slice from the per-process region_timer collector. _aggregate_fov_result writes the seg array to the HCS plate (still in the parent — concurrent create_position from workers would race plate metadata), extends per-T row lists, and extends the 24 dataset-level accumulators. This refactor positions C4 to add ProcessPoolExecutor + worker primitives without further changes to the FOV body. flush_manifest stays in the parent's loop for now; under process mode (C4) each worker will flush its own ctx at FOV end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): wire ProcessPoolExecutor + GPU serialization lock (C4) Adds worker primitives in runtime.py (_worker_initializer, gpu_serialization_lock, make_fov_executor) and the worker entry point _worker_run_fov in pipeline.py. evaluate_predictions now branches on runtime.executor: serial (default) runs _process_one_fov inline identical to today; process spawns a ProcessPoolExecutor over FOVs. Worker design: - Spawn context avoids inheriting parent CUDA. _worker_initializer sets BLAS/OMP env caps BEFORE apply_thread_budget lazy-imports torch. - _worker_setup lazy-loads seg_model + extractors under gpu_serialization_lock(): N workers don't init CUDA + load weights concurrently. Plate handles cached per-worker. - GPU touch sites inside _process_one_fov are wrapped in gpu_serialization_lock(): mask_gt/pred, cp_gt, deep_gt_{dinov3,dynaclr, celldino}, features_pred_per_t, pixel_metrics (per-T, GPU-bound end-to-end including cubic spectral), and on-demand cellpose fallback inside mask_metrics. In serial mode the lock has no path set and yields immediately — bit-for-bit-identical behavior. - Lock file at /tmp/dynacell_gpu_<SLURM_JOB_ID-or-PID>.lock (node-local, not NFS). One lock per SLURM job, shared across workers. Phase-2 runtime resolution: after pred_positions is built, resolve_runtime is re-called with n_positions= to clamp "auto" fov_workers to min(provisional, n_positions) and freeze threads_per_worker at the Phase-1 value so worker initializers match the parent's already-applied BLAS cap. Order: results collected via as_completed (arrival order) but aggregated in original pred_positions order so per-FOV CSV + embedding NPZ row order is deterministic regardless of executor. Workers flush their own ctx manifest at FOV end (global manifest lock at _pos_write_lock(ctx, "manifest", "global") fcntl serializes flushes across workers); interrupted process-mode runs preserve cache progress the same way serial mode did. Tests: full dynacell suite 384/48 pass/fail (identical to baseline; all 48 failures pre-existing). No new spawn-mode test in this commit — covered by C5 CPU integration test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(dynacell): runtime + parallel tests, CI wiring, SLURM env caps, docs (C5) Adds test coverage for the C1-C4 runtime module and FOV-parallelism wiring: test_runtime.py (23 tests): apply_thread_budget no-crash semantics under Hydra's main; resolve_runtime auto/freeze/validate paths; region timing collector; DYNACELL_FORCE_PER_T_HYGIENE env-var hook; spawn-context ProcessPoolExecutor smoke via stdlib os.getpid. test_evaluation_pipeline_parallel.py (3 tests): FovResult pickle round-trip with realistic per-backbone shapes (the path that ProcessPoolExecutor takes for Future.result transport); _aggregate_fov_result extends all 24 dataset-level accumulator lists correctly. Re-imports the live pipeline module inside test bodies because test_lazy_init.py pops dynacell.* modules from sys.modules, which would invalidate pickle resolution of a module-level imported FovResult. CI wiring: - Root pyproject.toml registers the "slow" marker and appends -m "not slow" to existing addopts. - test-dynacell-configs runs test_runtime.py + test_evaluation_pipeline_parallel.py alongside existing benchmark-schema/submit-tool tests. No new heavy deps needed (these tests don't import iohub, cubic, transformers). SLURM env caps: - run_eval_jointtrained_nucleus.slurm and run_eval_unext2_jointtrained_membrane.slurm export DYNACELL_THREADS_PER_WORKER=\$SLURM_CPUS_PER_TASK before invoking eval. Docs: - applications/dynacell/CLAUDE.md gains "Eval runtime / parallelism" section documenting the config block, env-var hooks, timing instrumentation, and process-mode caveats. Full dynacell test suite: 410 passed / 48 failed / 65 skipped — the 48 failures are the same set that fails on dynacell-models@HEAD (verified parity across all 5 commits in this branch). End-to-end serial-vs-process parity test on a real iohub+cache fixture is a follow-up (see .claude/plans/eval-parallelism.md §C5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): drop parent seg_model under executor=process to avoid duplicate residency Under executor=process, workers lazy-load their own seg_model copies (under the GPU lock) — but the parent had been holding a duplicate copy GPU-resident since the unconditional prepare_segmentation_model(config) call at the top of evaluate_predictions. On an 80 GB A100 with fov_workers=4, this meant (N+1) × model_weights resident instead of N. After Phase-2 runtime resolution (where the final executor is known), drop the parent's copy when executor=process. The pre-warm side effect (segmenter_model_zoo's validate_model populating the checkpoint dir to avoid N workers racing on a cold quilt download) is already done by then. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(dynacell): resolve config interpolations in-place before worker submit Belt-and-suspenders safety net suggested in PR #426 review: call OmegaConf.resolve(config) in-place after Phase-1 runtime resolution and before any worker pickle.dumps the DictConfig across the spawn boundary. OmegaConf supports lazy interpolation resolution natively, so workers *would* resolve ${...} refs on first attribute access — but doing it once in the parent eliminates a tiny per-worker hot-path cost and guards against the (unlikely) case where an interpolation depends on parent-only state. ??? MISSING fields are untouched: verified OmegaConf 2.3 `_resolve` walks only nodes where `_is_interpolation()` is true, so bare ??? values pass through. Confirmed: 63 tests pass across pipeline / compose / parallel / runtime suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(dynacell): silence inner per-T tqdm in spawn workers Adds a runtime.is_worker() helper that returns True iff _worker_initializer has set _GPU_LOCK_PATH. Wire it into _process_one_fov's inner-T tqdm via disable=is_worker(), so under fov_workers>1 the N concurrent per-T bars don't interleave on the parent's stderr. Outer per-FOV tqdm (in the parent) stays visible because the parent's _GPU_LOCK_PATH is never set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): don't double-count timing rows under executor=serial Bug: _process_one_fov writes region timings to the global _TIMINGS collector via region_timer, and returns the slice in FovResult.timings. _aggregate_fov_result then calls extend_timings(result.timings) unconditionally. In serial mode this re-appends each row to the SAME parent collector — every timed region appears twice in eval_timing.csv. In process mode the second append is correct (workers have separate per-process _TIMINGS, parent's is empty until aggregated). Fix: add extend_worker_timings kwarg to _aggregate_fov_result. Pass True in process mode, False in serial. Verified via 63-test suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): context-manage worker plate handles per FOV _worker_setup no longer caches pred/gt/seg plate handles in _WORKER_STATE. Instead, _worker_run_fov opens them via `with open_ome_zarr(...) as ...` scoped to one FOV — iohub file descriptors close between FOVs and the worker has nothing to clean up on ProcessPoolExecutor shutdown. Models + cache contexts stay cached in _WORKER_STATE (their load cost ~5-10 s per worker is too high to repeat per FOV; ProcessPoolExecutor process exit reclaims them). Position lookup by name uses a linear scan of plate.positions() (iohub has no name->position index). Cheap for the ≤ tens of positions per typical eval — sub-ms per scan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): skip prepare_segmentation_model under require_complete_cache When io.require_complete_cache=true, the caller asserts that all needed masks live in the cache — so loading SuperModel (which hits a quilt download for nucleus/membrane on a cold node) is wasted work. Return None unconditionally in that mode. If a cache miss happens despite the assertion, the cache layer raises StaleCacheError before any segment() call, so the None seg_model never gets dereferenced. Enables the cache-only/no-model path that C5's integration test design relies on (run evaluate_predictions end-to-end against prebuilt mask caches without instantiating cellpose/SuperModel). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(dynacell): add CPU integration test for serial-vs-process parity Drives evaluate_predictions end-to-end on a tiny iohub fixture (3 positions × T=2 × D=4 × H=W=32) with prebuilt mask caches. Runs the pipeline twice — once with runtime.executor=serial, once with runtime.executor=process runtime.fov_workers=2 — and asserts: 1. pixel_metrics + mask_metrics rows are byte-equal after sort by FOV. 2. segmentation_results.zarr per-position content is byte-equal across both channels. Cache-only design (matches plan §C5): target_name=er + require_complete_cache=true + compute_feature_metrics=false + compute_microssim=false + pixel_metrics.fsc=null + pixel_metrics.spectral_pcc=null. The full pipeline runs without instantiating cellpose, SuperModel, DinoV3, DynaCLR, CellDino, or cubic. What this validates: spawn boot, DictConfig pickling, FovResult.timings + arrays round-trip through pickle, _aggregate_fov_result extends parent accumulators correctly, parent-side segmentation_results.zarr writes don't race, sort-by-pos_name aggregation, and the new extend_worker_timings semantics doesn't double-count timing rows. _live_pipeline_module() clears dynacell.evaluation.* from sys.modules before importing — necessary because test_evaluation_pipeline.py stubs metrics/utils via monkeypatch.setitem(sys.modules, ...), and without the clear my pipeline's compute_pixel_metrics resolves to the empty-returning stub. CI: runs ~16 s on the existing test-dynacell-configs job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(dynacell): include CPU integration test in dynacell-configs CI job Adds tests/test_evaluation_pipeline_parallel_cpu.py to the dynacell pytest invocation. The test runs end-to-end evaluate_predictions in both serial and spawn-process modes against a tiny iohub fixture + prebuilt mask cache — no eval extras (cellpose, transformers, cubic) needed because target_name=er + io.require_complete_cache=true short-circuit segmenter and feature-extractor loads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): harden runtime resolution + gate GPU lock on use_gpu Four Copilot-review findings on the runtime layer: 1. apply_thread_budget threadpool_limits re-arm: the old handle becomes unreachable and is GC'd. The previous comment said "stays alive" which was misleading. Document the actual behavior: we WANT the new limit active, dropping the old handle is intentional. Also remove the dead if-branch since both arms did the same assignment. 2. resolve_runtime accepts literal fov_workers=0 / threads_per_worker=0 which produces divide-by-zero in the "auto" math or nonsensical thread counts. Add explicit >= 1 validation with a clear error. 3. _cpu_count() prefers os.sched_getaffinity(0) over os.cpu_count() so SLURM cgroup limits (cpuset etc.) are honored — otherwise auto- resolved fov_workers/threads_per_worker can oversize past the cgroup-visible CPU budget. Falls back to os.cpu_count() on non-Linux. 4. gpu_serialization_lock now takes a gate: bool = True kwarg. Callers in pipeline.py pass gate=use_gpu so under config.use_gpu=false the lock is a no-op even in process mode. Previously the lock fired unconditionally, defeating cross-worker parallelism for CPU-only eval runs. The lock still gates on _GPU_LOCK_PATH for the parent/serial no-op (unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(dynacell): stream aggregation under executor=process to bound peak memory Previously buffered every FovResult in results_by_pos until ALL N worker futures completed, then aggregated in pos_names_in_order. Peak parent memory: N × seg_array (~N × 440 MB for T=10 D=21 H=W=1024 evals) all live simultaneously. Switch to streaming aggregation: drain the buffer opportunistically in pos_names_in_order as futures complete. Only out-of-order FovResults stay buffered; once the next-expected position lands we aggregate (plate write + accumulator extend) and drop the ref. Peak buffer is bounded by the number of out-of-order FOVs in flight, not by N. Deterministic per-FOV CSV / embedding NPZ row order is preserved because aggregation still happens in pos_names_in_order — early completions just wait their turn in the buffer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): grouped multi-condition eval driver (#427) * feat(dynacell): grouped multi-condition eval driver Add ``dynacell evaluate-grouped``: run the same (model, organelle) across multiple I/O variants (the canonical case is the three A549 treatment plates mock/denv/zikv) in one Python process, paying the segmenter + DINOv3 + DynaCLR + CELL-DINO cold-start once instead of N times. On serial-mode interactive jobs and on per-condition SLURM jobs this is 30-90s of weight load + checkpoint warmup amortized across all conditions. Under runtime.executor=process, each condition still spawns its own pool whose workers re-load model copies, so only the parent-side load is amortized; the driver prints a note when it detects this. Mechanism: - ``dynacell.evaluation.model_loader.load_eval_models`` extracts the segmenter + extractor instantiation that previously lived inline in ``evaluate_predictions`` and ``_worker_setup``. The bundle carries identity tags (dinov3_model_name, dynaclr_ckpt source paths, encoder cfg, celldino weights) so ``init_cache_context`` callers get identical cache-identity values across conditions. - ``evaluate_predictions(config, *, models=None)`` accepts a pre-loaded bundle and skips its inline load when supplied. ``models=None`` preserves the historical single-condition behavior; existing callers and tests are unchanged. - ``evaluate_predictions_grouped(config)`` reads a top-level ``conditions:`` list, deep-merges each overlay into a copy of the base, and runs them sequentially with the shared models. Per-condition ``_final_metrics_cache_valid`` short-circuits are honored independently. - ``_check_grouped_field_invariants`` rejects overlays that touch ``target_name``, ``feature_extractor.*``, ``compute_feature_metrics``, or ``use_gpu`` -- those determine which models get loaded and must be identical across conditions. ``io.*``, ``save.*``, ``runtime.*``, ``limit_positions``, ``force_recompute.*`` are free to vary. - Hydra entry point ``evaluate_model_grouped`` with new ``_configs/eval_grouped.yaml`` (inherits ``eval.yaml`` via defaults). - New CLI subcommand registered in ``__main__._HYDRA_COMMANDS``. Tests: ``tests/test_evaluation_grouped.py`` builds a 2-condition cache-only fixture (separate pred plates + cache dirs) and asserts byte-equal parity against sequential per-condition runs, plus rejection cases for empty ``conditions`` and forbidden model-loading-field overrides. Cache-only path means no models actually load -- the test validates the loop structure, not the speedup. Reuses the iohub fixture builders from ``test_evaluation_pipeline_parallel_cpu.py``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(dynacell): simplify grouped eval driver + fold load + cache-context unification Followup polish on d69d970 (feat: grouped multi-condition eval driver). Addresses /simplify findings + the two related cleanups the user asked to fold in: Pipeline driver (pipeline.py): - Lazy model load — defer ``load_eval_models`` until the first condition that misses ``_final_metrics_cache_valid``. Restarts where every condition is cache-hit pay no 30–90 s cold start. - Invariant baseline is now the conditions-stripped input config, not the conditions[0]-merged config. A model-loading override smuggled into condition 0 is rejected symmetrically with conditions 1..N (previously it was silently re-baselined). New test guards the regression: ``test_grouped_rejects_condition0_model_field_override``. - ``_merge_condition`` drops the manual ``to_container`` round-trip deep-copy; ``OmegaConf.merge`` already returns a non-aliased config. - ``_check_grouped_field_invariants`` switched to ``OmegaConf.is_config`` in place of the ``type(OmegaConf.create([]))`` idiom; baseline snapshot is computed once and reused per iteration. - Extracted ``_load_cached_final_metrics`` so the three-NPY reload block is shared between ``evaluate_model`` and the grouped driver. - Removed redundant first-iteration merge + ``apply_dataset_ref`` for condition 0; the loop body now does both for every condition, including idx=0. Model loader (model_loader.py) and precompute-gt: - Added ``LoadFlags`` so ``load_eval_models`` can be called from both ``evaluate_predictions`` (all extractors gated by ``compute_feature_metrics``) and ``precompute_gt_artifacts`` (per-extractor ``build.masks`` / ``build.dinov3`` / ``build.dynaclr`` / ``build.celldino`` flags). ``LoadFlags.for_evaluate`` recovers the prior default. The 30-line inline load block in ``precompute_cli.py`` collapses to a single ``load_eval_models(config, flags=LoadFlags(...))`` call. Stricter celldino weights gate is preserved as an explicit pre-load raise. - Added ``init_cache_contexts(config, models)`` and ``init_gt_cache_context(config, models)`` so the four identical ``init_cache_context(..., side=..., **identity_tags)`` call sites in ``evaluate_predictions``, ``_worker_setup``, and ``precompute_gt_artifacts`` consolidate behind the EvalModels bundle. Test fixtures (tests/_eval_fixtures.py): - Promoted the cache-only fixture builders (``build_eval_config``, ``build_fixture``, ``make_hcs_plate``, ``make_mask_cache``, ``live_pipeline_module``, ``read_position_arrays``, ``N_POSITIONS``/``T``/``D``/``H``/``W``) out of ``test_evaluation_pipeline_parallel_cpu.py`` into a private ``_eval_fixtures.py`` module (underscore-prefixed so pytest skips collection). Both ``test_evaluation_pipeline_parallel_cpu.py`` and ``test_evaluation_grouped.py`` now import normally — no ``sys.path.insert`` (project's "no ``sys.path`` edits" rule). CI: - ``.github/workflows/test.yml`` adds ``test_evaluation_grouped.py`` to the ``test-dynacell-configs`` matrix so the grouped driver gets exercised on every push. All 104 tests in the affected suites pass. Pre-existing 48-test failure baseline (test_benchmark_config_composition + one test_evaluation_metrics case) is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): tier grouped-eval process-mode message by cache state Split the existing ``[grouped] note:`` print into two tiers so users get proportional signal for the configuration they actually picked: - ``executor=process`` + ``require_complete_cache=true`` (cache-only, n_conditions > 1) keeps the mild informational note — workers re-init per condition but skip ``prepare_segmentation_model`` (returns None) and don't instantiate the deep extractors, so the only overhead is N pool spawns. - ``executor=process`` + ``require_complete_cache=false`` (cache-miss, n_conditions > 1) gets a loud multi-line WARNING. This is the actual worst case: each condition's worker pool independently loads SuperModel + DINOv3 + DynaCLR + CELL-DINO, so total wasted cold-start scales as ~30-90 s × N_workers × N_conditions. The warning explicitly points to ``executor=serial`` as the fix, since the parent's pre-loaded ``EvalModels`` is shared across conditions in serial mode but lost across pool spawns in process mode. Two new tests in ``test_evaluation_grouped.py``: - ``test_grouped_warns_loudly_on_process_plus_cache_miss`` confirms the WARNING text fires (executor=process + require_complete_cache=false + 2 conditions) with all the expected keywords. - ``test_grouped_mild_note_on_process_plus_cache_only`` confirms the cache-only fixture (require_complete_cache=true) gets the mild note and does NOT trip the loud WARNING. CLAUDE.md "Grouped multi-condition eval" section updated with the tiered behavior so users searching for the WARNING text find context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): grouped eval struct-mode + require_complete_cache invariant Three fixes triggered by Copilot's review on PR #427 plus an independent struct-mode bug that would have crashed every real Hydra invocation of ``dynacell evaluate-grouped``: 1. ``_merge_condition`` was over-simplified during the /simplify pass (commit c271d3f). ``OmegaConf.merge`` propagates struct mode from its first argument, and Hydra composition always produces struct-mode configs. Result: merging an overlay carrying a ``name`` label (which isn't in the schema) raises ``ConfigKeyError``, and ``del merged["conditions"]`` raises ``ConfigTypeError`` even when the merge succeeds. Restore the ``to_container`` round-trip on the base — it escapes struct mode while still producing an independent config. Same fix applied to the ``models_base`` snapshot a few lines above. Existing tests didn't catch this because they construct configs via ``OmegaConf.create({...})`` directly (non-struct); real users going through ``@hydra.main`` would have hit the crash immediately. 2. Add ``io.require_complete_cache`` to ``_MODEL_LOADING_FIELDS`` so per-condition overlays cannot override it. The flag flips ``prepare_segmentation_model`` between "load SuperModel" and "return None" — letting a condition flip it would mean the shared model bundle is wrong for that condition (None when cache-miss expects a real segmenter, or loaded but never used). Treat as a grouped invariant. (Copilot thread on pipeline.py:1227.) 3. Add ``applications/dynacell/tests/__init__.py``. The new test files use ``from ._eval_fixtures import ...`` relative imports, which work under the project's ``--import-mode=importlib`` setting but are fragile under other pytest import modes. Adding ``__init__.py`` makes ``tests/`` a proper package and the relative imports work regardless of which import mode pytest is invoked with. (Copilot threads on test_evaluation_grouped.py:32 and test_evaluation_pipeline_parallel_cpu.py:42.) Two new regression tests in ``tests/test_evaluation_grouped.py``: - ``test_grouped_rejects_require_complete_cache_override`` covers the new invariant. - ``test_grouped_works_on_struct_mode_config`` wraps the parity fixture in ``OmegaConf.set_struct(grouped_cfg, True)`` to simulate a Hydra-composed config; would have failed before the ``_merge_condition`` fix. All 108 tests in affected suites pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(dynacell): fix two CLAUDE.md doc nits flagged in PR #427 self-review Two 75-confidence findings from the parallel review on PR #427 that scored below the 80-threshold for the inline review comment but are worth fixing now while context is fresh: 1. "Reality check for process mode" subsection referenced ``pipeline.py:171`` as the parent seg-model load site. After the eval-parallelism refactor that line is a dataclass field (``pred_ts: list[np.ndarray] = ...``); the actual load now lives in ``load_eval_models(config)`` called from the top of ``evaluate_predictions``. Switch the reference from a brittle line number to a function-name pointer. 2. "Grouped multi-condition eval" subsection said the per-condition cache-load short-circuit fires "with ``force_recompute.final_metrics =false``", omitting the second gate: ``_final_metrics_cache_valid`` also returns False (forcing recompute) when ``force_recompute.all =true``, regardless of the ``final_metrics`` flag. Make both gates explicit so users who set ``force_recompute.all=true`` for unrelated reasons aren't surprised by unexpectedly recomputed metrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): keep SuperModel under require_complete_cache when pred cache off `prepare_segmentation_model` was returning None unconditionally whenever `io.require_complete_cache=true`, on the assumption that all masks would be served from cache. That assumption breaks for nucleus/membrane targets when `io.pred_cache_dir=None`: the per-T loop in `_process_one_fov` falls back to `segment(predict[t], seg_model=seg_model)`, which raises `ValueError` because `seg_model` is None. Narrow the early return to cases where no path can reach `segment()` without a cached mask — organelle targets (workflow-based, no `seg_model` needed) or nucleus/membrane runs that also have `io.pred_cache_dir` configured. Reported by Copilot on PR #426. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(dynacell): align gpu_serialization_lock docstring with implementation The docstring claimed "no CUDA available in the worker" was one of the no-op conditions, but the implementation only checks `gate` and `_GPU_LOCK_PATH`. On a CPU-only node with `use_gpu=true` the lock would still serialize across workers. Rather than add a runtime CUDA probe (callers already pass `gate=use_gpu` to opt out for CPU-only regions), correct the docstring to describe what the code actually does. Reported by Copilot on PR #426. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(dynacell): rewrite eval README — runtime/grouped/cache-only + trim bloat The eval README had grown stale relative to the runtime/grouped/cache work landed on this branch and PR #427: no mention of `io.require_complete_cache`'s cache-only fast path, no `runtime.*` block, no `dynacell evaluate-grouped`, no CELL-DINO. This rewrite folds those in and restructures: FOV parallelism + grouped eval move out of "Artifact caches" (where they were mis-classified) into a new top-level "Scaling" section; a new "Quick start" sits above Configuration; HF-cache operational details go to a new "HPC notes" section at the end. Bloat trim along the way: components table compressed (related modules combined, trivial files pushed to a note), HF-cache + external-users sections shortened, the 11-row force_recompute table collapsed to a YAML block plus 5 grouped descriptions, redundant Hydra group-discovery paragraphs removed. No flags, footguns, or CLI options dropped — every config knob and env var from the old version is still documented. 466 → 265 lines. Public-repo hygiene: replaced the hard-coded `/hpc/projects/comp.micro/...` default for the team-shared HF cache with a description that points at the `_DEFAULT_SHARED_HF_CACHE` constant in `dynacell/__main__.py` and the `DYNACELL_SHARED_HF_CACHE` env-var override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(dynacell): address PR #426 code-review findings Five fixes from a code-review pass on the controlled FOV-parallel eval work: - Restore the seg_model invariant for grouped multi-condition runs. After 5829175, ``prepare_segmentation_model``'s load decision depends on ``io.pred_cache_dir`` in addition to ``io.require_complete_cache``, but ``_MODEL_LOADING_FIELDS`` only caught the latter. A per-condition overlay setting ``io.pred_cache_dir=None`` against a base that has it set (for nucleus/membrane + ``require_complete_cache=true``) would crash at runtime: the shared ``seg_model`` is None but the per-T loop falls back to ``segment(predict[t], seg_model=None)``. New ``_seg_model_required()`` helper mirrors the load decision and ``_check_grouped_field_invariants`` rejects the dangerous direction only, leaving canonical per-condition pred caches free to differ. - Fix the process-mode progress bar stalling. ``pbar.update(1)`` only fired inside the in-order drain loop, so if FOV 0 was slow and later FOVs completed first, the bar stayed at 0% until FOV 0 drained. Update on each worker completion instead. - Correct the misleading ``[grouped] note:`` for ``executor=process`` + ``require_complete_cache=true``. The message claimed "no models actually load" but deep extractors do load whenever ``compute_feature_metrics=true``, and the segmenter only short- circuits in a subset of cases. Rewrite to match reality. - Move ``runtime.*`` imports (12 symbols) from four inline per-function blocks to one top-of-file import. No spawn-ordering or circular-import constraint blocks them; the inline pattern just violated the project's "no inline imports without strong reason" rule and duplicated a ``flush_manifest`` import that already lives at module top. New regression test ``test_grouped_rejects_pred_cache_dir_flip_for_nucleus`` guards the dangerous direction; an existing test's assertion updates to the corrected note wording. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All other dynacell-era models (FNet3D, FCMAE family) write to /hpc/projects/.../models/dynacell/...; only the legacy comparison- vs-viscy models (celldiff, unetvit3d) and my new pix2pix3d_unetvit baseline (which copied the celldiff/unetvit3d leaf templates) kept the legacy /models/cell_diff_vs_viscy/ prefix. This commit normalizes pix2pix3d_unetvit alone — 31 leaves under benchmarks/virtual_staining/<organelle>/pix2pix3d_unetvit/ — so future Phase 3/4 runs land in the canonical dynacell tree. The in-flight DDP run (33190465) is unaffected: it reads from its already-rendered resolved YAML and writes to the legacy path; that run's outputs will be migrated manually after it completes. Legacy celldiff + unetvit3d leaves are intentionally left alone in this commit (already-trained models, would need on-disk mv too). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed models 16 new predict configs under */celldiff/a549_mantis/ (4 organelles × 4 predict sets: denv/mock/zikv/ipsc_confocal). A549 test sets use z_window_size=48 with a549trained zarr naming; iPSC confocal OOD eval uses z_window_size=40 with a549trained suffix in ipsc/predictions/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The new metric suite — torch-fidelity FID/KID/PRC/MIND, real-vs-pred linear probes (CP/DINOv3/DynaCLR/CellDINO), and the CellDINO encoder baseline — is the intended default for the post-refactor re-eval campaign. Previously the base default was false, so any leaf that did not explicitly set true silently skipped DINOv3+DynaCLR+CellDINO loading and emitted an 11-column feature_metrics.csv instead of the 74-column full schema. Flip the base default in _configs/eval.yaml; add explicit compute_feature_metrics: true to the 72 leaf eval__*.yaml configs that previously relied on the default; flip the three fnet3d mix-trained nucleus shell scripts that hard-coded false. Leaves restate the flag so intent stays visible at the leaf without chasing the defaults graph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

defaults: [eval, _self_] caused eval's expansion (which ends with the ``leaf`` group) to apply BEFORE eval_grouped.yaml's own ``_self_`` body, so ``conditions: []`` from _self_ silently clobbered any leaf-provided ``conditions:`` list. ``evaluate-grouped leaf=<path>`` would then exit immediately with "evaluate_predictions_grouped requires a non-empty top-level 'conditions' list." Swapping to defaults: [_self_, eval] runs the schema-declaring _self_ first, then lets eval (and its trailing ``optional leaf: null``) apply on top, so the leaf's conditions land last and survive composition. Verified on all 12 generated grouped leaves: ``uv run dynacell evaluate-grouped leaf=<path> --cfg job`` now reports the leaf's full conditions list (16-28 entries) with target_name set correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The old example ``uv run dynacell evaluate-grouped -c eval_grouped_a549_mantis_er`` was wrong on two counts: * ``-c`` is Hydra's ``--cfg`` flag, which only accepts ``job``, ``hydra``, or ``all`` for config display. Passing a config name there is a no-op (or argparse error, depending on placement). * The grouped leaves live under ``configs/benchmarks/virtual_staining/_internal/leaf/grouped/`` and are discovered via the ``leaf=`` group override the same way single-condition leaves work — that's the actual entry point baked into the dynacell CLI's _EXTERNAL_SEARCHPATHS. Replace with the correct ``leaf=grouped/<bucket>/eval_grouped`` form and add a one-line note explaining why ``-c`` doesn't work, so the next person who follows the docs gets a runnable command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Lets one tool cover both serial chained predicts (existing behavior, default) and SLURM array submissions (new), so hand-written ``run_*_pred_*.slurm`` scripts can be retired in favor of a single launcher with topology validation + resolved-config artifacts. New flags (all opt-in, no change to existing serial path): * ``--array`` — render ``sbatch_template_array.sbatch`` with ``#SBATCH --array=0-(N-1)``, per-task logs at ``%A_%a.{out,err}``, and a bash ``CONFIGS=(...)`` block dispatched by ``$SLURM_ARRAY_TASK_ID``. * ``--max-array-concurrency N`` — append ``%N`` to the array spec to cap concurrent tasks. Requires ``--array``. * ``--allow-mixed-directives`` — bucket leaves by ``(SBATCH directive identity, run_root)`` and submit one sbatch array per bucket. Each bucket writes artifacts into its own ``run_root``; job names get a ``_g{N}`` suffix. Without this flag, ``--array`` keeps the strict uniform-hardware/uniform-run_root invariant of serial mode. Internals: * Factored out ``_compose_leaves``, ``_validate_uniform_sbatch``, ``_validate_devices``, ``_write_resolved_configs``, ``_render_serial_sbatch``, ``_render_array_sbatch``, and ``_sbatch_submit`` from the monolithic ``submit()`` so the serial/array dispatch reads as a thin top-level loop. * ``_render_sbatch_directives`` learned an ``array_spec`` arg that inserts ``#SBATCH --array=`` and switches the log filenames from ``%j`` to ``%A_%a``. * ``_directive_bucket_key`` produces a stable hashable identity for a leaf's SBATCH block (``nodes``, ``gpus``, ``mem``, ``constraint``, ...), used by ``--allow-mixed-directives`` to group leaves. The new ``sbatch_template_array.sbatch`` mirrors ``sbatch_template_batch.sbatch`` (umask, group-writable mkdirs, cleanup trap, ``cd @@repo_root``, ``ml uv``, ``nvidia-smi``) and ends with ``srun uv run python -m dynacell predict --config "$CFG"`` where ``$CFG`` is selected from ``CONFIGS`` by ``$SLURM_ARRAY_TASK_ID``. ``predict_batch.sh`` already forwards extra args via ``"$@"``; the docstring is updated to advertise the new flags as supported passthrough with worked examples. Tests cover: ``--max-array-concurrency`` / ``--allow-mixed-directives`` rejected without ``--array``; negative concurrency rejected; mixed ``run_root`` rejected without ``--allow-mixed-directives``; serial mode unchanged (no array directive, no ``CONFIGS=``); ``--array`` emits the array directive and ``CONFIGS`` block; ``--max-array-concurrency`` appends ``%N``; ``--allow-mixed-directives`` buckets 4 ER leaves (3 a549-test + 1 ipsc-test) into two sbatch arrays with distinct ``_g0``/``_g1`` job-name suffixes; ``_directive_bucket_key`` distinguishes by constraint and ignores ``time``. Verified end-to-end against the 16 celldiff_r2 a549trained leaves: one invocation auto-buckets into ``a549/predictions/`` (12 tasks) + ``ipsc/predictions/`` (4 tasks) sbatch arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Package

Adds the tooling that drives a 12-bucket re-eval campaign for the post-refactor metric suite (torch-fidelity FID/KID/PRC/MIND, real-vs-pred linear probes, CellDINO baseline, updated DynaCLR ckpt) across every (organelle × train_set) tuple that currently has a prediction zarr on disk. Generator (``tools/generate_grouped_eval_configs.py``) ====================================================== Walks ``{ipsc,a549}/{predictions,joint_predictions}/`` and parses every prediction zarr into a canonical identity ``(organelle, model, variant, train_set, test_set, condition)``. Seven filename patterns are recognized, including the legacy ``<org>_<model>__<gene>_<cond>`` form that pre-dates the modern A549 naming — these are still live on disk for sec61b + tomm20 deterministic and CellDiff_r2 iPSC-trained predictions and must be consumed. ``celldiff_r2`` in ``joint_predictions/`` without a variant suffix is recognized as the joint-default form. Skip allowlist for the known-stale May-5 ``memb_celldiff_*`` zarrs and the ``sec61b_fnet3d``/``sec61b_unext2`` legacy aliases. The original non-r2 CellDiff family is also skipped (only ``celldiff_r2`` is in scope per the 2026-05-21 instruction). Per condition, derives: * ``benchmark.dataset_ref: {dataset, target}`` as a dict (resolver silently returns None on string refs or partial dicts). Manifest slugs are hyphen-form (``aics-hipsc``, ``a549-mantis-<gene>-<cond>``); target keys are gene markers on A549 (``h2b``/``caax``) vs logical organelle names on iPSC. * ``io.pred_path`` (the walked zarr) and ``io.pred_cache_dir`` under ``{dataset}/eval_cache_pred/{train_set}/{model_variant}/{condition}/``. * ``save.save_dir`` at the canonical re-eval path (``{dataset}/{evaluations_with_embeddings,evaluations_a549trained_with_embeddings,evaluations_jointtrained_with_embeddings}/eval_<paper>[_<infix>]_<organelle>[_<cond>]``). * ``name`` — stable per-condition label used by the grouped driver. Pre-checks before emission: every pred_path exists; every gt_cache_dir contains both DynaCLR (sha12 ``e409a5a079aa``) and CellDINO (sha12 ``ef7c17ffb0aa``) feature dirs; no duplicate save_dirs within a bucket. ``--skip-fresh`` filters out conditions whose canonical save_dir already has a ≥70-col ``feature_metrics.csv`` (used after canceling + resubmitting the array to avoid redoing completed work). Emits 12 production leaves (one per organelle × train_set) plus a 13th ``_probe`` leaf used by the Step-5 sanity check (6 conditions × 2 positions, all save_dirs under ``/tmp/reeval_probe/`` so partial runs can't corrupt production), plus a README summarizing per-bucket condition counts. Leaves are pure ``# @Package _global_`` overlays (no ``defaults: [eval_grouped, _self_]`` — that creates a composition cycle since ``evaluate-grouped`` already starts from ``config_name="eval_grouped"``). Tests (``tools/generate_grouped_eval_configs_test.py``) ======================================================= * Grammar dispatch: 14 parametrized cases covering the seven canonical patterns + variants + dedupe-preference-for-joint-dir. * Failure cases raise ``ValueError`` with informative messages (unknown organelle, unknown model, unknown CellDiff variant). * Save_dir/pred_cache_dir/dataset_ref derivation matches the canonical layout (incl. the A549 ``h2b``/``caax`` vs iPSC ``nucleus``/``membrane`` target-key split). * Live-data tests (marked ``requires_data``) walk the actual training tree, assert every emitted ``pred_path`` exists, and compose + resolve every generated leaf via ``OmegaConf.merge`` + ``apply_dataset_ref`` + ``_check_grouped_field_invariants`` — catches what ``--cfg job`` does NOT (Hydra-only composition vs eval-time resolver splice + invariant checks). Campaign runners ================ ``tools/run_reeval_campaign.slurm`` — SLURM array (0-11%4) over the 12 grouped leaves. ``constraint=a100|h100|h200`` (eval-time GPU load is ~15 GB; ``pix2pix3d_needs_h200`` applies to predict-time DDP only). ``--time=48h`` (2× the probe-based projection for the largest 28-cond bucket). ``DYNACELL_THREADS_PER_WORKER=32`` set before any C-extension loads so BLAS caps bite at first thread-allocator load (per ``dynacell/CLAUDE.md`` "Eval runtime / parallelism"). ``tools/verify_reeval_outputs.py`` — post-run sanity walks every emitted save_dir + pred_cache_dir and runs the 5 plan-defined checks: ``feature_metrics.csv`` has ≥70 cols (target 74), 8 ``{gt,pred}_*`` embedding npz files, ``eval_timing.csv`` present, CellDINO/DynaCLR probe + fidelity columns present, ``pred_cache/features/celldino/`` populated. Prints a pass/fail table; exits non-zero on any fail. Grouped leaves ============== 12 leaves + ``_probe`` leaf + README under ``configs/benchmarks/virtual_staining/_internal/leaf/grouped/``, auto-generated by the script above. Committed as a campaign snapshot (generator can regenerate at any time given the prediction zarrs + GT caches on disk). Current state reflects the post-``--skip-fresh`` set after the first array's partial completion: 16/8/1/16/16/6/22/12/9/22/16/6 = 150 production conditions + 6 probe conditions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…style) (#428) * feat(viscy-models): add 3D PatchGAN discriminator and LSGAN losses Introduces a new gan subpackage with PatchGAN3D and MultiScalePatchGAN3D (pix2pixHD-style, 2-scale by default), plus lsgan_d_loss / lsgan_g_loss list-based helpers. These will pair with UNetViT3D to form a 3D pix2pix adversarial baseline alongside the existing regression and flow-matching DynaCell baselines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add DynacellGAN Lightning module for 3D pix2pix baseline DynacellGAN composes the existing UNetViT3D generator with the new MultiScalePatchGAN3D discriminator, training with LSGAN + L1 (λ=100) via manual optimization (automatic_optimization=False). A requires_grad toggle separates the D step from the G step so each backward only populates the active side's grads; D grads are also explicitly cleared between optimizers so the test invariant (no D grads after a G step) holds for Phase 2 DDP correctness. Validation logs the existing loss/validate weighted-mean alias, so ModelCheckpoint(monitor="loss/validate") keys off the same metric used across the other DynaCell baselines. configure_optimizers builds two AdamW + WarmupCosine pairs directly to fit the two-optimizer return shape Lightning's manual mode expects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add pix2pix3d_unetvit recipe, model overlays, and smoke leaf The recipe declares dynacell.engine.DynacellGAN with UNetViT3D's production generator config and a 2-scale spectral-normalized PatchGAN3D. fit / predict model overlays mirror the unetvit3d ones with lr_g/lr_d split (no scalar lr) and bf16-mixed precision. The single-GPU smoke leaf for er/pix2pix3d_unetvit/ipsc_confocal sets benchmark.dataset_ref: null to disable the compose-hook resolver so it can point data_path at SEC61B_test48.zarr directly, and uses batch_size=2 to match the inherited num_samples=2 (HCSDataModule divisibility rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add ER iPSC production train leaf for pix2pix3d_unetvit Mirrors er/unetvit3d/ipsc_confocal/train.yml with the model overlay and all model_name / experiment_id / run_root / job_name strings retargeted to pix2pix3d_unetvit. Keeps ModelCheckpoint.monitor: loss/validate unchanged — DynacellGAN logs that key via on_validation_epoch_end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add pix2pix3d_unetvit single-set train leaves across organelles Adds 7 production single-GPU training leaves for pix2pix3d_unetvit: - er/pix2pix3d_unetvit/a549_mantis - {membrane, mito, nucleus}/pix2pix3d_unetvit/{ipsc_confocal, a549_mantis} Each leaf mirrors the matching unetvit3d cell with model_overlay, model_name, experiment_id, logger name/save_dir, ckpt dirpath, launcher run_root/job_name retargeted to pix2pix3d_unetvit. ModelCheckpoint monitor=loss/validate preserved. Joint training leaves (joint_ipsc_confocal_a549_mantis) deferred — still gated on the open BatchedConcatDataModule DDP deadlock (handoff 2026-04-26). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add iPSC predict + eval leaves for er/pix2pix3d_unetvit Mirrors er/unetvit3d/ipsc_confocal/{predict,eval}__ipsc_confocal.* with the model overlay, model_name, experiment_id, ckpt_path, output_store filename, and eval pred_path / save_dir all retargeted to pix2pix3d_unetvit. predict_method stays full_image (the only mode DynacellGAN.predict_step supports for v1). ckpt_path uses an explicit REPLACE_ME_WITH_PRODUCTION_CHECKPOINT_PATH placeholder so torch.load raises FileNotFoundError with the placeholder visible if the user forgets to override it — running predict against an untrained generator would silently produce garbage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add a549_mantis predict + eval triplet for er/pix2pix3d_unetvit/ipsc Adds the 3 a549_mantis test splits (mock, denv, zikv) for the iPSC-trained pix2pix3d_unetvit ER cell — enables cross-dataset transfer evaluation. Mirrors the unetvit3d ipsc_confocal predict/eval triplet with the same retargeting rule already used for the iPSC pair. ckpt_path uses the REPLACE_ME_WITH_PRODUCTION_CHECKPOINT_PATH placeholder so any forgotten override fails loudly via FileNotFoundError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(dynacell): /simplify pass — reuse configure_adamw_scheduler, dedupe _log_samples DynacellGAN.configure_optimizers used to rebuild AdamW + WarmupCosineSchedule inline despite the class docstring claiming it called the shared helper. Now calls configure_adamw_scheduler twice and concatenates, dropping the redundant schedule guard and the unused monai/torch.optim imports. _log_samples was a byte-identical instance method on DynacellUNet, DynacellFlowMatching, and DynacellGAN. Promoted to a module-level free function and updated the five call sites across all three classes. Updated test_dynacell_gan_validate_logs_alias to monkeypatch the module attribute instead of the instance method. Also trimmed three narrating comments inside DynacellGAN (the plan/PR context comments belong in the PR description, not the source) and five per-layer banner comments in patchgan3d.py (the conv literals state the shape and the class docstring covers the layer table). Safe under the running production training's Slurm auto-requeue: no constructor signatures change, AdamW + WarmupCosine pairs are built with the same args, optimizer state shape is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): Phase 2 — DDP topology + 4-GPU smoke + production leaves Adds the DDP infrastructure for pix2pix3d_unetvit so the GAN baseline can train on 4 H200s in ~13 h instead of the ~50 h single-GPU pace measured on job 33132605 (now cancelled). - recipes/topology/ddp_4gpu_gan.yml — DDP variant that wraps the strategy in find_unused_parameters: true. Required because DynacellGAN runs manual_optimization with two optimizers, and the per-step requires_grad toggle leaves one side's params "unused" from the reducer's perspective on each backward. Plain ddp_4gpu.yml stays single-strategy for the other models. - model_overlays/pix2pix3d_unetvit_fit_ddp.yml — mirrors the single-GPU fit overlay but composes ddp_4gpu_gan.yml topology + 4-GPU hardware. - er/pix2pix3d_unetvit/ipsc_confocal/train_smoke_4gpu.yml — 4-GPU smoke variant. Same SEC61B_test48 store, batch_size=2 matching the inherited num_samples=2 divisibility rule. Validates DDP sharding + the find_unused_parameters DDP reducer behavior under the GAN's manual-opt pattern before committing the 4-GPU production run. - er/pix2pix3d_unetvit/ipsc_confocal/train_4gpu.yml — 4-GPU production leaf. Keeps per-rank batch_size=4 (effective 16 across 4 GPUs) and unchanged lr_g/lr_d so the DDP run is comparable to the single-GPU reference at the per-rank optimization settings. logger name and job_name suffixed with _4gpu to disambiguate in W&B/slurm; run_root and ckpt dirpath unchanged so Phase 4 predict leaves keep working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): retarget membrane pix2pix3d_unetvit save paths The membrane single-set train leaves landed with placeholder paths under .../ipsc/memb_temp/ — diverging from the celldiff convention (memb, not memb_temp) and conflating the a549_mantis run dir with the iPSC tree. Both leaves now point at the canonical .../ipsc/memb/ and .../a549_mantis/memb/ trees, matching the celldiff layout so all organelles can be cross-referenced uniformly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): expand pix2pix3d_unetvit predict/eval matrix Mirrors the er/pix2pix3d_unetvit/ipsc_confocal/ predict+eval set across membrane, mito, and nucleus so every iPSC-trained pix2pix3d_unetvit run has an apples-to-apples inference + evaluation path against both the ipsc_confocal self-target and the three a549_mantis conditions (mock, denv, zikv). Predict leaves swap target-inherited normalizations for a source-only NormalizeSampled (Phase3D) and clear the RandWeightedCropd augmentation. Eval leaves enable compute_feature_metrics only on the ipsc_confocal self-target, matching the existing er convention. Ckpt paths are placeholders pending the corresponding production runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add pix2pix3d_unetvit joint train + predict leaves Adds the joint_ipsc_confocal_a549_mantis variant for all four organelles (er, membrane, mito, nucleus): one train.yml authoring BatchedConcatDataModule with two HCSDataModule children plus the four predict leaves (self iPSC + three a549_mantis conditions). The joint train differs from the celldiff joint by using NormalizeSampled (fov_statistics) instead of MinMaxSampled so joint and single-set pix2pix3d_unetvit runs share the same normalization contract for apples-to-apples ablations. batch_size=2 × num_samples=2 matches the single-set 4 GPU-samples/step (per the BatchedConcatDataModule rule in CLAUDE.md that joint does NOT divide by num_samples). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(viscy-models): drop speculative num_layers from PatchGAN3D PatchGAN3D and MultiScalePatchGAN3D accepted a num_layers constructor parameter that only worked with the value 5 — any other value raised ValueError. The argument was justified as future-proofing for ablations that don't exist yet, which is exactly the speculative flexibility CLAUDE.md prohibits: "No 'flexibility' or 'configurability' that wasn't requested. ... No error handling for impossible scenarios." Re-add it only if and when a non-5 variant lands. Recipe + test fixture updated in lockstep; uv run pytest packages/viscy-models/tests/test_gan/ and the DynacellGAN init/forward tests pass (11/11). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): apply TTUR — drop pix2pix3d_unetvit lr_d to lr_g/10 The 1:1 LR schedule collapsed to D-dominance on DDP run 33176091 (wandb 20260520-002629): loss/d_train fell from 1.3e-2 (ep1) to ~1e-4 (ep12+), loss/g_adv pinned at ~1.0 (saturated MSE(d_fake, 1)), loss/g_l1 flat at ~0.45 from epoch ~5 onward — the adversarial gradient was effectively zero, generator was training as pure L1. Slowing the discriminator 10x (lr_d = lr_g / 10 = 3e-5) is the canonical pix2pix-style fix for D-dominance. Applies to both the single-GPU and DDP fit overlays so the entire Phase 3 baseline (all 4 organelles × {iPSC, A549, joint}) inherits the same adversarial balance regime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(viscy-models): add nonsat / RpGAN losses + R1/R2 grad penalties Adds three loss families and two gradient penalties to viscy_models.gan, extending the existing multi-scale 3D PatchGAN convention to the modern StyleGAN2 / R3GAN recipe: - nonsat_d_loss / nonsat_g_loss: softplus non-saturating loss, verbatim from stylegan2-ada-pytorch/training/loss.py. - rpgan_d_loss / rpgan_g_loss: relativistic pairing loss, verbatim from brownvc/R3GAN/Trainer.py. - r1_penalty / r2_penalty: Mescheder zero-centered gradient penalty on real / fake input. Multi-scale aggregation = mean of per-scale ||grad||^2 (NOT ||grad of sum||^2, which would mix cross-scale derivatives through MultiScalePatchGAN3D's pool-derived second scale). All losses preserve the existing list-of-per-scale-Tensors convention. LSGAN losses are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): wire DynacellGAN modernization knobs (StyleGAN2/R3GAN recipe) Adds opt-in modernization to DynacellGAN with legacy-safe defaults (loss_type=lsgan, r1_gamma=0, ema_kimg=None, lecam_gamma=0). The 14 existing leaves composing pix2pix3d_unetvit_fit.yml are unaffected because none of them set the new knobs and the new defaults preserve pre-modernization behavior. New constructor knobs: - loss_type {lsgan, nonsat, rpgan} + lambda_adv - r1_gamma / r2_gamma / r1_every (lazy schedule with StyleGAN2-style * r1_every unbiased rescaling). RpGAN raises ValueError unless r1_gamma > 0 (R3GAN Theorem 3.1 convergence requirement). - ema_kimg + use_ema_at_predict for the generator EMA shadow. Decay derived per-step as 0.5 ** (global_batch / (ema_kimg * 1000)). - lecam_gamma + lecam_decay (decay=0.9 matches google/lecam-gan reference). Training-step structural changes: - D-step counter (_d_step_count buffer) drives the lazy R1 schedule; self.global_step would advance per opt.step() (2 per training_step with manual_optimization) so it can't be used directly. - R1 / R2 path is outer-guarded on (r1_gamma > 0 or r2_gamma > 0) so legacy leaves skip the grad-of-grad compute and the spectral-norm power-iteration advance. - R1 / R2 compute is wrapped in autocast(enabled=False); Lightning's bf16-mixed plugin would otherwise put bf16 into D forwards which is fragile under create_graph=True grad-of-grad. - RpGAN G step recomputes fresh d_real against the post-update discriminator (matches R3GAN Trainer.py convention). - LeCam EMA update uses sync_dist'd batch means via self.all_gather so all ranks update the buffers identically without explicit all_reduce. - Generator EMA update after opt_g.step() (params are DDP-synced at that point, so per-rank EMA stays in lockstep). Inference paths: - _inference_generator helper returns EMA when available AND use_ema_at_predict; used by forward and predict_step. - validation_step runs both raw and EMA generator forwards (when EMA exists); emits both loss/validate (back-compat) and loss/validate_ema aliases. Checkpoint loading at engine.py:802 changed to strict=False with filtered warning logs (expected_missing for generator_ema.* / _d_step_count / _lecam_ema_*). When loading a pre-modernization checkpoint with EMA enabled, the EMA shadow is seeded from the loaded generator weights so inference paths return the loaded weights instead of the random-init deepcopy. 16 DynacellGAN tests cover legacy-default safety, modernized smoke, the R1/R2 gamma guard, lazy-R1 step counter, RpGAN constructor validation, EMA seeding on legacy load, LeCam buffer registration gating, two-pass validation logging, and inference-helper EMA selection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dynacell): add modernized pix2pix3d overlays + SEC61B Run A leaf New opt-in configs for the StyleGAN2/R3GAN-style modernized pix2pix3d recipe: - pix2pix3d_unetvit_fit_modernized.yml: single-GPU overlay with loss_type=nonsat, r1_gamma=10, r1_every=16, ema_kimg=10, lr_g=lr_d=2e-4. Existing pix2pix3d_unetvit_fit.yml stays untouched so the 14 legacy leaves continue running LSGAN unchanged. - pix2pix3d_unetvit_fit_ddp_modernized.yml: 4-GPU DDP sibling, same model knobs, composes ddp_4gpu_gan topology (find_unused_parameters=True). - er/pix2pix3d_unetvit/ipsc_confocal/train_4gpu_modernized.yml: Run A of the plan. SEC61B headline experiment. ModelCheckpoint monitors loss/validate_ema (the EMA generator's val loss = inference path). Checkpoints land in a distinct subtree from the LSGAN baseline so there is no collision risk with in-flight runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): harden DynacellGAN against silent failures (PR #428 review) Addresses findings from the PR review subagents: - ema_kimg=0 raises ValueError (silent EMA freeze prevention). - RpGAN now requires both r1_gamma > 0 AND r2_gamma > 0 (R3GAN Theorem 3.1 requires both penalties for sharp-distribution convergence; previously only r1_gamma was validated). - ckpt_path strict=False load: missing-but-not-expected keys now RAISE RuntimeError instead of warnings.warn (silent half-loaded models burn hours of training before users notice). Unexpected-keys path switched from warnings.warn to _logger.warning. - Partial-EMA checkpoints (some but not all generator_ema.* keys) now RAISE — silently half-seeding the EMA shadow would produce nonsense outputs in the missing layers. - on_predict_start logs which generator (EMA vs raw) will be used at inference, so users notice silent fallbacks (predict overlay forgot ema_kimg, use_ema_at_predict=False, etc.). New tests: - test_dynacell_gan_rpgan_requires_r2 - test_dynacell_gan_ema_kimg_zero_raises - test_dynacell_gan_rpgan_smoke (end-to-end trainer.fit with rpgan + R1+R2; exercises the fresh-d_real RpGAN G-step path) - test_dynacell_gan_ema_update_math (closed-form EMA decay verification) - test_dynacell_gan_use_ema_at_predict_false (predict_step returns raw output when EMA is gated off) - test_dynacell_gan_ckpt_unexpected_missing_raises (corrupted ckpt surfaces as RuntimeError, not a swallowed warning) Existing test_dynacell_gan_ema_seed_on_legacy_load updated to use use_spectral_norm=False on both sides — with the stricter unexpected-missing check, SN parametrization name mismatches would otherwise trip the raise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): validate r1_every >= 1 in DynacellGAN.__init__ r1_every is the modulo period for the lazy R1/R2 schedule and is also used as the unbiased rescaling factor on the penalty term. Without an init-time guard, r1_every=0 would ZeroDivisionError on the first D step when R1/R2 is enabled, and a negative value would silently fire the gate on the wrong cadence. Raise at construction so the bad config fails before any training time is burned. Surfaced by Copilot review on PR #428. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dynacell): gate val sample logging on use_ema_at_predict DynacellGAN.validation_step assigned pred_for_samples=pred_ema whenever the EMA submodule existed, regardless of use_ema_at_predict. This contradicted predict_step / forward (both routed through _inference_generator, which respects the flag) — the dashboard could show EMA-generator samples while real predictions were emitted by the raw generator. Now matches the inference path. Regression test stubs out both generators with distinguishable constant modules and reads back validation_step_outputs to confirm which one fed the dashboard for each value of use_ema_at_predict. Surfaced by Copilot review on PR #428. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(dynacell): keep _d_step_count as Python int to drop per-step sync Lazy-reg gate previously read `int(self._d_step_count)` each training step where the counter was a CUDA buffer. Each .item() call forced a device→host sync. Switch to a plain Python int (incremented in training_step, modulo-gated against r1_every) and persist it via on_save_checkpoint / on_load_checkpoint hooks. All DDP ranks start at 0 and advance deterministically, so the buffer-was-broadcast invariant is preserved without per-step sync overhead. Side benefit: `_d_step_count` is no longer a state_dict key, so the strict-load filter drops one expected-missing prefix. Legacy checkpoints without `_d_step_count` resume cleanly at 0 via the dict.get fallback. Surfaced by Copilot review on PR #428. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(dynacell): point modernized overlay header at PR #428, not local path The header comment referenced /home/alex.kalinin/.claude/plans/..., which won't exist for other contributors or in CI. Swap for the PR URL so the recipe rationale is reachable from the file. Surfaced by Copilot review on PR #428. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(dynacell): point SEC61B Run A leaf header at PR #428, not local path Same fix as the modernized overlay header — replace /home/alex.kalinin/.claude/plans/... reference with the PR URL so it resolves for everyone. Surfaced by Copilot review on PR #428. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g/dynacell-dynadtw PR #432 base modular-viscy-staging diverged from staging/dynacell-dynadtw when #404 squashed all of dynacell-models history into a single commit (393abdf). dynacell-models has 30 newer commits since dynadtw's last sync at f1df36b — re-eval campaign generator, controlled FOV-parallel eval (#426), 3D pix2pix GAN baseline (#425), DeepFeatureBatcher / precompute deep features, cache identity refactors, pix2pix3d_unetvit modernization (#428), backbone bookkeeping refactor onto _BackboneLists. Conflict triage (92 conflicts): - 89 files where staging/dynacell-dynadtw carried stale copies of dynacell-models content (no post-branch edits on dynadtw side): 84 AA conflicts (eval YAML configs, dynacell evaluation sources/tests, CLAUDE.md, tools) + 5 UU conflicts (engine.py, __init__.py, __main__.py, test_engine.py, .github/workflows/test.yml). Resolved by taking the modular-viscy-staging post-#404 version. - 3 files with genuine concurrent edits, resolved per direction: * .gitignore — dynadtw side is a strict superset (added local evaluation_matrix.md exclusion). Take dynadtw. * packages/viscy-models/.../foundation/cell_dino.py — dynadtw added CellDinoModel wrapper + per-image normalize flag; modular-viscy-staging didn't touch post-branch. Take dynadtw. * pyproject.toml — modular-viscy-staging added pytest "slow" marker + addopts "-m not slow" (no-overlap, kept); both sides identically added aics* git sources, py312 bump, pythonpath. iohub: modular-viscy-staging kept a git override pin at afe11f6a (~0.3.3); dynadtw dropped it for iohub>=0.3.5 from PyPI. Took dynadtw's strategy (PyPI 0.3.5 is released; viscy-data dep pin already at >=0.3.5 on this side). - uv.lock regenerated from the merged pyproject.toml (iohub 0.3.5 from PyPI registry). Sanity check: uv sync --all-packages --all-extras + uv run pytest -m "not slow" applications/dynacell/tests/ shows 51 failed / 459 passed / 78 skipped. All 51 failures are pre-existing on dynacell-models HEAD (test_benchmark_config_composition strategy/leaf composition issues + test_evaluation_metrics gain-and-offset) — verified by running the same suite on the clean backbone-refactor worktree. The merge introduces no new regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves the 258 conflicts blocking PR #429 (252 AA from independent adds across both branches' dynacell scaffolding, plus 6 UU). AA resolution: - 250 files → take ours (backbone-refactor). The diffs are uniformly "ours is the newer evolution": cache_dir organelle-namespacing (fdf71bb), any-GPU launcher profile (7943915), Phase B preprocess_version (6c23471), _BackboneLists refactor (PR #429 scope), microssim → cubic (PR #429 scope), pixel-metric unification (PR #429 scope), tests targeting all of the above. Theirs is the older PR #404 squash + PR #432 absorption state. - 2 files → take theirs: * packages/viscy-models/src/viscy_models/gan/__init__.py * packages/viscy-models/src/viscy_models/gan/losses.py Base side has the pix2pix3d_unetvit GAN modernization (lsgan/nonsat/rpgan losses, R1/R2 penalties, EMA, LeCam). Job 33232131 depends on this — must not be regressed. UU resolution: - .gitignore → theirs (add evaluation_matrix.md path). - pyproject.toml (root) → theirs (drop temp cubic/iohub git overrides; both are released on PyPI now). - applications/dynacell/pyproject.toml → ours (cubic==0.7.0a5, no microssim — this IS PR #429's stated change). - applications/dynacell/src/dynacell/engine.py → theirs (DynacellGAN modernization, same reason as the GAN package files). - applications/dynacell/tests/test_engine.py → theirs (GAN modernization test suite). - uv.lock → regenerated via uv sync after fixing pyprojects. Sanity: cross-module imports (dynacell.evaluation + DynacellGAN + viscy_models.gan including new nonsat/rpgan/r1/r2 losses) succeed. pytest applications/dynacell/tests/test_pipeline_cache.py applications/dynacell/tests/test_engine.py → 83/83 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alxndrkalinin changed the title ~~Dynacell models~~ Refactor applications/DynaCell into a reusable benchmark package Apr 14, 2026

alxndrkalinin requested a review from Copilot April 14, 2026 22:35

Copilot started reviewing on behalf of alxndrkalinin April 14, 2026 22:35 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

alxndrkalinin requested a review from Copilot April 15, 2026 19:47

Copilot started reviewing on behalf of alxndrkalinin April 15, 2026 19:48 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

alxndrkalinin requested a review from Copilot April 17, 2026 03:52

Copilot started reviewing on behalf of alxndrkalinin April 17, 2026 03:53 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread pyproject.toml

Comment thread applications/dynacell/tools/submit_benchmark_job.py Outdated

Comment thread packages/viscy-utils/tests/test_compose.py Outdated

alxndrkalinin and others added 9 commits April 20, 2026 18:47

docs(configs): remove stale benchmark schema doc

94ed736

alxndrkalinin requested a review from Copilot April 21, 2026 05:47

Copilot started reviewing on behalf of alxndrkalinin April 21, 2026 05:48 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread applications/dynacell/src/dynacell/evaluation/segmentation.py

alxndrkalinin and others added 6 commits April 21, 2026 08:09

alxndrkalinin and others added 27 commits May 19, 2026 17:15

edyoshikun merged commit 393abdf into modular-viscy-staging May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor applications/DynaCell into a reusable benchmark package#404

Refactor applications/DynaCell into a reusable benchmark package#404
edyoshikun merged 387 commits into
modular-viscy-stagingfrom
dynacell-models

alxndrkalinin commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alxndrkalinin commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Config and package reorg

Absorb reusable benchmark modules from dynacell-paper

Package and CLI shape

Runtime, training, and transform fixes

Evaluation and reporting follow-up fixes

Benchmark schema + launcher (landed earlier on this branch)

Evaluation pipeline scale-up

Grouped multi-condition eval + re-eval campaigns

Feature metrics migration to torch-fidelity + linear probes

3D pix2pix GAN baseline + modernization

Topology / trainer-recipe ownership cleanup

Dependency and packaging updates

Why

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alxndrkalinin commented Apr 7, 2026 •

edited

Loading

Absorb reusable benchmark modules from `dynacell-paper`

Feature metrics migration to `torch-fidelity` + linear probes