Skip to content

Refactor applications/DynaCell into a reusable benchmark package#404

Merged
edyoshikun merged 387 commits into
modular-viscy-stagingfrom
dynacell-models
May 22, 2026
Merged

Refactor applications/DynaCell into a reusable benchmark package#404
edyoshikun merged 387 commits into
modular-viscy-stagingfrom
dynacell-models

Conversation

@alxndrkalinin
Copy link
Copy Markdown
Collaborator

@alxndrkalinin alxndrkalinin commented Apr 7, 2026

Summary

Makes applications/dynacell into the reusable dynacell package in VisCy.

Splits config responsibilities more cleanly:

  • VisCy keeps reusable dynacell code plus generic model recipes/examples.
  • Paper-specific benchmark state such as SEC61B run configs is removed from VisCy and is expected to live in dynacell-paper.

This PR also adds opt-in validation loss support for CellDiff flow-matching so benchmark reruns can log a scalar validation curve without forcing extra validation work on every flow-matching job.

What changed

Config and package reorg

  • Move reusable configs from applications/dynacell/examples/configs/ into applications/dynacell/configs/.
  • Keep configs/recipes/ for reusable fragments and configs/examples/ for generic runnable examples.
  • Refresh generic fit/predict examples for CellDiff, FNet3D, and UNetViT3D, and add generic UNeXt2 fit/predict examples.
  • Remove committed lightning_logs/, SEC61B paper configs, and the hard-coded hcs_sec61b_3d.yml recipe.
  • Update the dynacell README, config-discovery tests, and local ignore rules to match the new layout.

Absorb reusable benchmark modules from dynacell-paper

  • Add dynacell.data with path-based manifest/collection/spec loaders and Pydantic schemas.
  • Add dynacell.reporting with Hydra config, tables, figures, and tests.
  • Add dynacell.evaluation with configs, spectral-PCC tooling, I/O helpers, segmentation, metrics, and tests.
  • Add dynacell.preprocess with the reusable config.py and zarr_utils.py utilities.

Package and CLI shape

  • Extend the dynacell CLI so evaluate and report route to Hydra entry points while fit/predict/test/validate still go through viscy_utils.
  • Ship Hydra _configs inside the Python package so installed dynacell evaluate/report commands resolve config from package data instead of the source tree.
  • Make dynacell.__init__ lazy so lightweight imports do not pull in the full training stack.
  • Add optional dependency extras for eval, report, and preprocess, plus clearer missing-extra install hints from the CLI router.

Runtime, training, and transform fixes

  • Add overwrite= support to HCSPredictionWriter, including explicit handling/logging when prediction channels already exist.
  • Adjust VisCyCLI checkpoint hparam precedence so predict/test/validate honor user config over stale checkpoint values, while fit resume still keeps checkpoint training hparams.
  • Add direct ckpt_path loading to DynacellFlowMatching so predict-time settings come from config rather than checkpoint metadata.
  • Add opt-in compute_validation_loss to DynacellFlowMatching. When enabled it logs loss/val/<idx> and aggregate loss/validate while preserving epoch-end ODE sample generation; when disabled it keeps the previous cheaper validation path.
  • Extract a shared _aggregate_validation_losses() helper so UNet and flow-matching use the same weighted aggregation logic.
  • Skip the strict HCS spatial-shape check when GPU augmentations intentionally handle cropping.
  • Extend BatchedRandAffined with configurable padding_mode, safe_crop_size, and safe_crop_coverage, plus warnings/tests around unsupported X/Y-rotation guarantees.

Evaluation and reporting follow-up fixes

  • Fix corr_coef/PCC computation and add regression tests.
  • Make cached evaluate_model() returns consistent with fresh evaluation output types.
  • Fix report barplots for mixed-result inputs where some models are missing requested metrics.
  • Respect use_gpu in evaluation instead of always taking the CUDA path when visible.
  • Raise FileNotFoundError for missing preprocess configs instead of failing less clearly.

Benchmark schema + launcher (landed earlier on this branch)

  • Commit the configs/benchmarks/virtual_staining/ layout with shared axes (train_sets/, targets/, model_overlays/, launcher_profiles/, predict_sets/) and per-leaf train/predict files composed via viscy_utils.compose.load_composed_config.
  • Migrate the CellDiff train + predict configs for ER/mito/nucleus/membrane onto schema leaves; add FNet3D paper, UNetViT3D, and UNeXt2 train leaves for ER.
  • Add applications/dynacell/tools/submit_benchmark_job.py with --dry-run, --print-script, --print-resolved-config, and --override. It composes a leaf, strips the launcher: + benchmark: reserved keys, freezes the resolved config to {run_root}/resolved/, and renders an sbatch script to {run_root}/slurm/.
  • Extend _maybe_compose_config in viscy_utils/cli.py so direct dynacell fit -c <leaf> also strips those reserved keys.
  • Add a ckpt_sha256_12 sidecar to the evaluation cache so repeated lookups skip full-file hashing.

Evaluation pipeline scale-up

  • DeepFeatureBatcher + upfront precompute_deep_features pass amortize deep-extractor forwards across FOVs and timepoints; per-side DeepFeatureBatcher.push streams raw crops with a single flush per side.
  • Prediction-side artifact cache (organelle masks + per-cell features for CP/DINOv3/DynaCLR/CELL-DINO) keyed alongside the existing GT cache; pred reruns hit cache after the first eval.
  • Controlled FOV-level parallelism (runtime.executor=process / fov_workers / threads_per_worker) via spawn-context ProcessPoolExecutor, fcntl GPU serialization lock, layered BLAS/OMP thread caps, and region timers dumped to eval_timing.csv (feat(dynacell): controlled FOV-parallel eval + thread budgeting #426).
  • Keep compute_pixel_metrics tensors device-resident (zero-copy ascupy/asnumpy handoff to cubic) so spectral-PCC and FSC stay on GPU on long timelapses.

Grouped multi-condition eval + re-eval campaigns

  • dynacell evaluate-grouped driver loads SuperModel + DINOv3 + DynaCLR + CELL-DINO once and loops over N I/O variants (typically A549 mock/denv/zikv) sharing the same target. Per-condition overlays may override io.* / save.* / runtime.* / cache / limit_positions / force_recompute; model-loading fields are guarded against override at runtime.
  • submit_benchmark_batch.py --array mode for cluster-scale re-eval submission; re-eval campaign generator emits grouped leaves under configs/benchmarks/virtual_staining/_internal/leaf/grouped/.
  • CELL-DINO GT cache warmable via dynacell precompute-gt; compute_feature_metrics=true is now the default in eval configs.

Feature metrics migration to torch-fidelity + linear probes

  • Replace the in-tree FID/KID/PRC math with torch-fidelity's eigvals-based composition; ship per-(FOV, t) FID/KID/median-cosine and dataset-level FID/KID/PRC/MIND/cosine (feat(dynacell): migrate eval metrics to torch-fidelity, add probes #423).
  • Add FOV-stratified linear-probe AUROC (paired [gt; pred] with GroupKFold and per-fold MADScaler fit, no leakage) + indistinguishability mapping; cross-condition probe scaffolding.
  • CP regionprops track gets variance + correlation pruning (vendored pycytominer math) before z-score and metric computation; per-(FOV, t) drop-target-zero + z-score for stability with few cells.
  • Short-circuit FID/KID to NaN at N < 2 cells; auto-shrink KID subset size for small cohorts.

3D pix2pix GAN baseline + modernization

Topology / trainer-recipe ownership cleanup

Four commits (9734d07 → f9b8f1e → 5b7eaae → 3fdb7cf) untangle three layers that were previously mingled in recipes/trainer/fit_*.yml:

  • Runtime profile renamed. runtime_single_gpu.yml → runtime_shared.yml: the file's content (srun + PYTHONUNBUFFERED/NCCL_DEBUG/PYTHONFAULTHANDLER env vars) has nothing single-GPU-specific, and every leaf — including the 4-GPU UNeXt2 leaf — composed it. Rename + 14 leaf-base updates + schema-doc references.
  • New topology layer (recipes/topology/single_gpu.yml + ddp_4gpu.yml) owns trainer.accelerator/devices/strategy/num_nodes. Duplicated independently under applications/dynacell/configs/recipes/topology/ and applications/cytoland/examples/configs/recipes/topology/ because CLAUDE.md forbids cross-application imports.
  • Unified trainer recipesrecipes/trainer/fit.yml + predict.yml per app, replacing fit_1gpu/fit_4gpu/fit_fm_4gpu/predict_gpu. Mode invariants only (seed, log cadence, callbacks, WandbLogger class pinned with per-app project). Topology and precision are owned by the topology recipes and model overlays respectively. Every benchmark model overlay and example leaf now composes [fit.yml + topology/*.yml] and sets precision + max_epochs explicitly.
  • Hardware profiles (hardware_h200_single, hardware_gpu_any_long, hardware_4gpu) drop their trainer: block; they now own only launcher.sbatch.*.
  • Intentional behavior deltas (spelled out, not masked): strategy: ddp → auto on every single-GPU leaf (Lightning-equivalent at devices=1, just intent-honest); dynacell examples fnet3d/unetvit3d/unext2/fit.yml and benchmark unext2.yml now pin WandbLogger project=dynacell instead of falling through to Lightning's default TensorBoard; fit_1gpu-derived paths move from save_top_k: 4 to save_top_k: 5.
  • Preserved via leaf-body overrides: FNet3D paper keeps precision: 32-true; examples/celldiff/fit.yml keeps FM-style epoch-based checkpointing (save_top_k: -1, every_n_epochs: 10, no monitor); cytoland FCMAE pretraining keeps strategy: ddp_find_unused_parameters_true; cytoland→dynacell bridge configs keep project: dynacell override over cytoland's default.
  • LEGACY deletion: applications/dynacell/tools/LEGACY/examples_configs/ (pre-schema CellDiff configs) is removed outright per CLAUDE.md's "avoid backwards-compatibility hacks" rule. The equivalence tests that composed LEGACY (test_*_leaf_matches_legacy, test_fnet3d_paper_leaf_matches_ran_config, test_byte_equivalence_sec61b_train_leaf) are replaced with forward-looking composition sanity tests.
  • Plan + review trail: plan is in ~/.claude/plans/vectorized-sleeping-clock.md; commits ran through /review-plan and /verify-plan subagent rounds before landing.

Dependency and packaging updates

  • Pin microssim to a compatible Git revision and refresh uv.lock so the new dynacell[eval] extra resolves with the current workspace.
  • Allow direct references in hatch metadata so the workspace can build/package those dependency pins cleanly.

Why

This is the VisCy side of the dynacell-paper split:

  • dynacell in VisCy becomes the reusable runtime.
  • dynacell-paper owns benchmark-instance configs, orchestration, and manuscript-specific assets.

That gives us a cleaner boundary for future work like multi-experiment collections and benchmark-specific runners without keeping paper state in the shared application package.

The flow-matching validation-loss change is intentionally opt-in because it adds extra validation compute and the resulting metric is still stochastic; default flow-matching checkpointing remains epoch-based.

The topology / trainer-recipe cleanup at the end removes a silent-drop-DDP trap: before this change, pairing a mismatched trainer recipe and benchmark hardware profile let the hardware profile's trainer.devices win (both layers redundantly set it), which could silently run a DDP training on 1 GPU. Topology is now the single source of truth.

Verification

  • uvx ruff check applications/dynacell/
  • uvx ruff format --check applications/dynacell/
  • .venv/bin/python -m pytest applications/dynacell/tests -q
  • .venv/bin/python -m pytest applications/dynacell/tests/test_data_manifests.py applications/dynacell/tests/test_reporting_tables.py applications/dynacell/tests/test_reporting_tables_extended.py applications/dynacell/tests/test_reporting_figures.py applications/dynacell/tests/test_evaluation_metrics.py applications/dynacell/tests/test_evaluation_pipeline.py applications/dynacell/tests/test_evaluation_io.py applications/dynacell/tests/test_preprocess_config.py applications/dynacell/tests/test_preprocess_zarr_utils.py applications/dynacell/tests/test_cli_routing.py -q
  • .venv/bin/python -m pytest applications/dynacell/tests/test_engine.py -q
  • .venv/bin/python -m pytest applications/dynacell/tests/test_training_integration.py::test_celldiff_fm_validation_loss_keeps_generation applications/dynacell/tests/test_training_integration.py::test_celldiff_fm_warmup_cosine_fast_dev_run applications/dynacell/tests/test_training_integration.py::test_celldiff_fm_constant_schedule_fast_dev_run -q
  • .venv/bin/dynacell fit --print_config -c applications/dynacell/configs/examples/unext2/fit.yml
  • uv run pytest applications/dynacell/tests/test_benchmark_config_composition.py applications/dynacell/tests/test_submit_benchmark_job.py -v (benchmark schema + submit-tool tests)
  • Topology/trainer cleanup end-to-end:
    • Composition snapshot of 35 touched leaves before vs after the cleanup; only expected intentional deltas (strategy ddp → auto at devices=1, WandbLogger added where previously default TB).
    • uv run dynacell fit -c <benchmark-unext2-leaf> --trainer.fast_dev_run=true --trainer.devices=1 --trainer.strategy=auto ran one training step cleanly on an A40 interactive node (loss 1.048).
    • submit_benchmark_job.py <benchmark-unext2-leaf> queued a real 4-GPU DDP job (31122607) via the new schema path.

@alxndrkalinin alxndrkalinin changed the title Dynacell models Refactor applications/DynaCell into a reusable benchmark package Apr 14, 2026
@alxndrkalinin alxndrkalinin requested a review from Copilot April 14, 2026 22:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors applications/dynacell from a thin training app into a reusable dynacell benchmark package, reorganizing configs and bringing evaluation/reporting/preprocess/data modules (with tests) into VisCy while updating shared runtime utilities to support the new package shape.

Changes:

  • Reorganize Dynacell configs into configs/recipes/ + configs/examples/, remove paper-/HPC-specific SEC61B artifacts, and update docs/tests for config discovery.
  • Add reusable dynacell.data, dynacell.evaluation, dynacell.reporting, and dynacell.preprocess modules plus substantial test coverage.
  • Update shared VisCy components (CLI checkpoint merging, prediction writer overwrite behavior, affine augmentation safety options, HCS training shape checks) to support the new workflows.

Reviewed changes

Copilot reviewed 74 out of 85 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
packages/viscy-utils/src/viscy_utils/cli.py Adjust checkpoint-hparam vs user-config precedence during LightningCLI parsing.
packages/viscy-utils/src/viscy_utils/callbacks/prediction_writer.py Add overwrite option to allow reusing existing output stores.
packages/viscy-transforms/src/viscy_transforms/_affine.py Add configurable padding_mode and safe-crop scale clamping for GPU affine aug.
packages/viscy-transforms/tests/test_affine.py Extend affine tests to cover scale-floor computation and safe-crop behavior.
packages/viscy-data/src/viscy_data/hcs.py Skip strict training shape check when GPU augmentations handle cropping.
applications/dynacell/src/dynacell/main.py Route evaluate/report to Hydra entry points; keep fit/predict via viscy_utils.
applications/dynacell/src/dynacell/engine.py Add ckpt_path init-time loading to decouple predict-time settings from ckpt hparams.
applications/dynacell/src/dynacell/data/_yaml.py Add shared OmegaConf→Pydantic YAML loader.
applications/dynacell/src/dynacell/data/manifests.py Add dataset manifest + split schemas and path-based loaders.
applications/dynacell/src/dynacell/data/collections.py Add frozen benchmark collection schemas and loader.
applications/dynacell/src/dynacell/data/specs.py Add benchmark-spec schema/loader for tying pipeline stages together.
applications/dynacell/src/dynacell/data/init.py Re-export dynacell.data public API.
applications/dynacell/src/dynacell/evaluation/init.py Initialize evaluation package.
applications/dynacell/src/dynacell/evaluation/pipeline.py Add Hydra-based evaluation pipeline with caching + output writing.
applications/dynacell/src/dynacell/evaluation/metrics.py Add pixel/mask/feature metric computations (optional deps guarded).
applications/dynacell/src/dynacell/evaluation/utils.py Add feature extractors + plotting and distribution-distance helpers.
applications/dynacell/src/dynacell/evaluation/io.py Add eval I/O dispatch helpers (Zarr vs TIFF-like) and preprocessing utilities.
applications/dynacell/src/dynacell/evaluation/segmentation.py Add segmentation workflows with optional dependency guards.
applications/dynacell/src/dynacell/evaluation/formatting.py Add DataFrame formatting utilities for evaluation outputs.
applications/dynacell/src/dynacell/evaluation/torch_ssim.py Add torch-native SSIM implementation used by metrics.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/init.py Initialize spectral_pcc subpackage.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/plot_shading_analysis.py Add analysis plotting script for shading artifact comparison.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/plot_combined.py Add combined metrics plotting + summary script.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/diagnostic_real.py Add Hydra diagnostic workflow for real-data spectra/metrics analysis.
applications/dynacell/src/dynacell/reporting/init.py Re-export reporting public API.
applications/dynacell/src/dynacell/reporting/tables.py Add table aggregation + LaTeX rendering utilities.
applications/dynacell/src/dynacell/reporting/figures.py Add matplotlib figure generation for aggregated metrics.
applications/dynacell/src/dynacell/reporting/cli.py Add Hydra reporting CLI entry point and config path wiring.
applications/dynacell/src/dynacell/preprocess/init.py Re-export preprocess utilities.
applications/dynacell/src/dynacell/preprocess/config.py Add OmegaConf-based preprocess config loader (with fallback).
applications/dynacell/src/dynacell/preprocess/zarr_utils.py Add OME-Zarr rewrite utility (rechunk/reshard).
applications/dynacell/configs/examples/unext2/fit.yml Add generic UNeXt2 fit example in new config layout.
applications/dynacell/configs/examples/unext2/predict.yml Add generic UNeXt2 predict example with init-time ckpt_path.
applications/dynacell/configs/examples/fnet3d/fit.yml Update FNet3D fit example paths to new layout.
applications/dynacell/configs/examples/fnet3d/predict.yml Add generic FNet3D predict example with init-time ckpt_path.
applications/dynacell/configs/examples/unetvit3d/fit.yml Update UNetViT3D fit example paths to new layout.
applications/dynacell/configs/examples/unetvit3d/predict.yml Update UNetViT3D predict example paths and use init-time ckpt_path.
applications/dynacell/configs/examples/celldiff/fit.yml Update CellDiff fit example paths to new layout.
applications/dynacell/configs/examples/celldiff/predict.yml Add CellDiff predict example with init-time ckpt_path and overlap config.
applications/dynacell/configs/recipes/trainer/fit_1gpu.yml Add reusable 1-GPU trainer recipe.
applications/dynacell/configs/recipes/trainer/fit_4gpu.yml Add reusable 4-GPU trainer recipe.
applications/dynacell/configs/recipes/trainer/fit_fm_4gpu.yml Add reusable 4-GPU flow-matching trainer recipe.
applications/dynacell/configs/recipes/trainer/predict_gpu.yml Remove ckpt_path from trainer recipe (moved to model init args).
applications/dynacell/configs/recipes/models/unext2_3d.yml Add reusable UNeXt2 model recipe (z=15).
applications/dynacell/configs/recipes/models/unext2_3d_z8.yml Add reusable UNeXt2 model recipe (z=8).
applications/dynacell/configs/recipes/models/fnet3d.yml Add reusable FNet3D model recipe.
applications/dynacell/configs/recipes/models/fnet3d_z8.yml Add reusable FNet3D model recipe (z=8).
applications/dynacell/configs/recipes/models/unetvit3d.yml Add reusable UNetViT3D model recipe.
applications/dynacell/configs/recipes/models/celldiff_fm.yml Add reusable CellDiff flow-matching model recipe.
applications/dynacell/configs/recipes/modes/spotlight.yml Add Spotlight mode recipe for foreground-aware loss.
applications/dynacell/configs/recipes/data/hcs_phase_fluor_3d.yml Adjust data recipe defaults (e.g., preload false).
applications/dynacell/configs/evaluate/eval.yaml Add Hydra eval base config.
applications/dynacell/configs/evaluate/spectral_pcc/base.yaml Add Hydra config for spectral PCC workflows.
applications/dynacell/configs/evaluate/spectral_pcc/simulate.yaml Add simulation config for metric validation.
applications/dynacell/configs/evaluate/spectral_pcc/diagnostic_real.yaml Add real-data diagnostic config.
applications/dynacell/configs/report/base.yaml Add Hydra report config.
applications/dynacell/tests/test_training_integration.py Update config discovery test to new configs/examples location.
applications/dynacell/tests/test_cli_routing.py Add tests for CLI routing between Lightning and Hydra subcommands.
applications/dynacell/tests/test_data_manifests.py Add tests for dataset manifest/collection/spec schemas and loaders.
applications/dynacell/tests/test_evaluation_io.py Add tests for eval I/O dispatching based on path type.
applications/dynacell/tests/test_evaluation_metrics.py Add regression tests for shared-scale pixel metrics behavior.
applications/dynacell/tests/test_evaluation_pipeline.py Add caching regression test for evaluation pipeline.
applications/dynacell/tests/test_reporting_tables.py Add tests for reporting tables loading/aggregation/LaTeX output.
applications/dynacell/tests/test_reporting_tables_extended.py Add extended reporting table tests (caption/label, lower-is-better).
applications/dynacell/tests/test_reporting_figures.py Add tests for figure creation/saving and empty-data behavior.
applications/dynacell/tests/test_preprocess_config.py Add tests for preprocess config loader fallback behavior.
applications/dynacell/tests/test_preprocess_zarr_utils.py Add tests for zarr rewrite correctness (data, chunks, metadata, shards).
applications/dynacell/README.md Update usage + config layout documentation; remove paper-specific SEC61B instructions.
applications/dynacell/.gitignore Ignore runtime artifacts (lightning_logs, outputs, pycache).
applications/dynacell/pyproject.toml Add optional extras for eval/report/preprocess and allow direct references.
applications/dynacell/examples/configs/sec61b/run_unext2_continue.slurm Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/run_unext2.slurm Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/run_fnet3d_paper.slurm Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/run_fnet3d.slurm Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/fit_unext2.yml Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/sec61b/fit_fnet3d_paper.yml Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/sec61b/fit_fnet3d.yml Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/sec61b/fit_celldiff.yml Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/recipes/data/hcs_sec61b_3d.yml Remove hard-coded SEC61B data recipe.
applications/dynacell/examples/configs/fnet3d/predict.yml Remove old predict example path (superseded by new configs layout).
applications/dynacell/examples/configs/celldiff/predict.yml Remove old predict example path (superseded by new configs layout).
CLAUDE.md Update repo/dev workflow documentation and conventions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread applications/dynacell/src/dynacell/__main__.py
Comment thread packages/viscy-transforms/src/viscy_transforms/_affine.py
Comment thread packages/viscy-transforms/src/viscy_transforms/_affine.py
Comment thread applications/dynacell/src/dynacell/reporting/tables.py
Comment thread applications/dynacell/src/dynacell/preprocess/zarr_utils.py
Comment thread applications/dynacell/pyproject.toml
Comment thread packages/viscy-utils/src/viscy_utils/cli.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 78 out of 91 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/viscy-utils/src/viscy_utils/callbacks/prediction_writer.py Outdated
Comment thread applications/dynacell/src/dynacell/preprocess/config.py Outdated
Comment thread applications/dynacell/src/dynacell/evaluation/pipeline.py Outdated
Comment thread applications/dynacell/src/dynacell/reporting/tables.py
Comment thread applications/dynacell/src/dynacell/reporting/figures.py Outdated
alxndrkalinin added a commit that referenced this pull request Apr 16, 2026
Addresses two PR #404 review findings:

1. `_CacheContext._manifest_dirty` was mutated directly from helper
   call sites, leaking implementation detail. Adds `mark_manifest_dirty`
   and `consume_manifest_dirty` methods and routes every external touch
   through them. Only the dataclass itself now references the private
   field.

2. `resolve_dynaclr_encoder_cfg` used `except Exception` to detect a
   missing nested config key — wider than needed and against CLAUDE.md
   guidance. Replaced with `OmegaConf.select(..., default=None)`, which
   handles missing keys natively without a try/except.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alxndrkalinin added a commit that referenced this pull request Apr 16, 2026
Addresses a PR #404 review finding: the split GT/pred feature API had
structural tests (empty inputs, column-drop, shape mismatch) but no
pinned-value regression guard. Adds two tests that seed deterministic
synthetic inputs and assert exact output values for CP_FID / CP_KID /
CP_Median_Cosine_Similarity and the DINOv3 equivalents.

If anyone later changes the column-drop, per-side z-score, or
FID/KID/cosine pairing logic — or a dependency shifts numerics — these
tests will fail rather than silently drifting metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alxndrkalinin alxndrkalinin requested a review from Copilot April 17, 2026 03:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 131 out of 143 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyproject.toml
Comment thread applications/dynacell/tools/submit_benchmark_job.py Outdated
Comment thread packages/viscy-utils/tests/test_compose.py Outdated
alxndrkalinin and others added 9 commits April 20, 2026 18:47
Restructure applications/dynacell/configs/benchmarks/virtual_staining/
so train, predict, and eval leaves for the same benchmark cell live
under one directory `<org>/<train_set>/<model>/`, with shared building
blocks split by consumer:

- `shared/model/` — train/predict building blocks (train_sets,
  predict_sets, targets, model_overlays, launcher_profiles)
- `shared/eval/` — HPC-bound Hydra eval groups (target, feature_extractor/dynaclr)
- `<org>/<train_set>/<model>/{train.yml, predict/<predset>.yml, eval/<predset>.yaml}`

Fold `configs/evaluation/` into the benchmark tree and delete the
sibling root. The eval selector renames from `benchmark=<path>` to
`leaf=<path>` to reflect the new filesystem model; leaves are addressed
through a committed symlink tree at `virtual_staining/leaf/` because
Hydra's group resolution requires a physical `leaf/` prefix. Eval
leaves keep `.yaml` (Hydra only discovers `.yaml` for groups); train
and predict leaves stay `.yml`.

`dynacell.__main__` now injects two `hydra.searchpath` roots — the
benchmark tree (for `leaf/`) and `shared/eval/` (for `target/` and
`feature_extractor/dynaclr/`). The `_EXTERNAL_CONFIGS_SUBPATH` constant
becomes `_EXTERNAL_SEARCHPATHS`, and the helper is renamed to
`_external_configs_dirs` (plural, returning a list).

Also fixes a pre-existing bug: the three unetvit3d train leaves under
mito/nucleus/membrane referenced a non-existent `runtime_single_gpu.yml`
launcher profile; they now use `runtime_shared.yml`. Model-overlay
`base:` entries targeting `recipes/` gain an extra `..` segment to
account for the new `shared/model/` depth.

Tests:
- test_benchmark_config_composition.py: path literals, three new
  (unetvit3d × {mito,nucleus,membrane}) pairs, new
  test_eval_leaf_symlink_resolves parametrised over all 8 eval leaves.
- test_cli_routing.py: three `_external_configs_dir` patches renamed
  (plural) and argv selector strings updated to `leaf=`.
- test_submit_benchmark_job.py: parametrised leaf paths.

Docs: virtual_staining/README.md (full rewrite to match new tree),
evaluation/README.md (groups table, external-users section, benchmark
eval leaves section), UNEXT2_VS_FCMAE_CLASSES.md (stale path), dynacell/
README.md (sbatch example path), BENCHMARK_CONFIG_SCHEMA.md (Hydra
search-path contract note about the `leaf/` symlink adapter).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l/ move

The preceding refactor commit renamed train and predict leaves to
<org>/<train_set>/<model>/{train.yml, predict/<predset>.yml} but the
content updates that prepend the shared/model/ segment to every base:
line were unstaged when that commit landed. This catches the 25 train
and predict leaves up with the new shared/model/ layout, and tightens a
prose comment in fnet3d_paper_fit.yml that still referenced the
pre-refactor shared/targets/ path.

Without this fix a fresh checkout of the reorg commit has train and
predict leaves pointing at shared/train_sets/, shared/targets/, etc. —
paths that no longer exist — so `load_composed_config` would raise
FileNotFoundError on every leaf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trim the WHAT-narration from docstrings and comments in
dynacell.__main__: drop the redundant "two roots are injected"
enumeration (the tuple is self-describing), fix the "repo root"
claim in _external_configs_dirs (the walk stops at the app's
pyproject.toml, not the workspace root), and drop the
"joined with commas" line from _inject_external_configs' docstring.
Bind ``p = parent / sub`` via walrus inside the existence filter
so the path is computed once instead of twice per subpath.

In the composition tests, delete the ``EVAL_LEAVES = PREDICT_LEAVES``
alias (the aliased list was the same object, causing silent coupling)
and drive the symlink-integrity test off ``PREDICT_LEAVES`` directly.
Drop the ``@package _global_`` header assertion from the symlink test
— that check belongs to composition, not to symlink integrity. Add a
``REPO_ROOT`` sanity assert so drift of the ``parents[3]`` magic
count fails loudly.

Tighten ``TestInjectExternalConfigs`` class docstring to a single
line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the "in-package / repo checkout" shorthand in the config-groups
table with concrete source paths, so readers see exactly where each
group lives after the reorg: shared/eval/ for HPC-bound groups, leaf/
for the eval-leaf symlink tree, _configs/ for packaged schema.

Tighten the "Running an evaluation" intro so it no longer implies all
groups live under _configs/, and spell out the two hydra.searchpath
roots dynacell.__main__ injects.

Add a footgun entry for external users about Hydra's .yaml-only group
resolution — saving a group as .yml makes the selector silently fail
with MissingConfigException.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the top level of virtual_staining/ biology-only: `er/`, `membrane/`,
`mito/`, `nucleus/` plus README.md. Everything else — shared building
blocks and the leaf/ Hydra symlink adapter — now lives under a single
_internal/ directory whose leading underscore signals "implementation
detail; don't browse here for science."

Layout is now:

    virtual_staining/
      <org>/<train_set>/<model>/{train.yml, predict/, eval/}
      _internal/
        shared/
          model/         # train/predict building blocks
          eval/          # target/, feature_extractor/dynaclr/
        leaf/            # symlink tree aliasing canonical eval leaves

dynacell.__main__ injects two hydra.searchpath roots under _internal/
(the tree root for leaf/, and shared/eval/ for target + dynaclr) —
same cardinality as before, just under the hidden support tree.

Train/predict leaves get `_internal/` prepended to every base: path.
Model overlays gain one more `../` to reach configs/recipes/ (now 6
segments deep instead of 5). The 8 leaf/ symlinks are recreated with
6 relative-dot segments (was 5 at virtual_staining/leaf/, now 6 at
virtual_staining/_internal/leaf/).

Docs refreshed across BENCHMARK_CONFIG_SCHEMA.md, virtual_staining/
README.md, and evaluation/README.md: every layout tree, base: example,
group-source column, and search-path reference points at _internal/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related bugs in predict-path tiling:

P1. DynacellFlowMatching.predict_step routed predict_method='sliding_window'
to CELLDiff3DVS.generate_sliding_window(), which is the non-overlapping
tiler and does not consume predict_overlap. Any config that set both
predict_method='sliding_window' and predict_overlap=[...] silently got
non-overlapping tiles, producing different predictions than the user
expected. Now raise ValueError when sliding_window is combined with a
non-zero overlap, pointing users to predict_method='iterative' (the
overlap-anchored velocity tiler) instead. Also refresh the docstring
to spell out which methods consume predict_overlap. Fix the membrane
predict leaf to match its peers (er, mito): predict_method=iterative.
Example config at configs/examples/celldiff/predict.yml gets the same
fix plus an inline comment clarifying which methods overlap.

P2. DynacellUNet.predict_sliding_window allocated both accumulator
buffers (prediction_sum, prediction_count) with zeros_like(source),
so their channel dimension was always in_channels. For models whose
out_channels != in_channels (common: 1 phase in -> 2 target out), the
first `prediction_sum[slicer] += patch_out` raised a broadcast error.
Allocate the accumulators lazily from the first patch output so they
take the model's actual out_channels and dtype.

Regression tests added for both paths:
- sliding_window + non-zero overlap must raise ValueError
- predict_sliding_window returns the correct shape when out_channels=2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier "always append" strategy broke argparse when users placed
diagnostic flags (-c job, --info, --help, ...) after positional
overrides:

    $ dynacell evaluate leaf=er/... -c job
    dynacell: error: unrecognized arguments: hydra.searchpath=[file://...]

Hydra's argparse declares ``overrides`` as a single positional with
nargs="*". Once that nargs run is interrupted by a flag, any later
positional is reported unrecognized. Appending the injected token at
the end scatters a positional across the flag in the override-first
layout; prepending scatters it across the flag in the flag-first layout.

Fix: insert the token adjacent to an existing positional override when
one is present, so positionals stay contiguous whichever side the
flags sit on. Fall back to append when the user passed no positional
overrides (e.g. bare --help).

Three regression tests cover the three layouts: flag-leading,
flag-trailing, and no-positional. The former "appends always" test is
replaced by the layout-specific pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tighten the "Running an evaluation" lead so readers see up front that
repo-checkout groups live under virtual_staining/_internal/ (the
hidden support tree), not loose under virtual_staining/. The detailed
bullet list below already spelled this out; the intro now matches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 191 out of 201 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread applications/dynacell/src/dynacell/evaluation/segmentation.py
alxndrkalinin and others added 6 commits April 21, 2026 08:09
Lightning's SLURMEnvironment rejects --ntasks at runtime
(slurm.py:_validate_srun_variables) and demands --ntasks-per-node.
Yesterday's fix emitted the wrong directive; all four FCMAE jobs
failed within 50 s on the cluster. Switch the sbatch generator,
hardware profiles, pre-submit validator, and invariant test to use
ntasks_per_node throughout. The enforced invariant is now
devices == ntasks_per_node and gpus == nodes × devices.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gated HF repos (DINOv3) need per-user access grants, but the downloaded
weights are identical. Before this change every team member who ran
`dynacell evaluate compute_feature_metrics=true` re-downloaded ~2 GB
of DINOv3 weights into their own `~/.cache/huggingface/`. Even ignoring
the network cost, only one team member has HF gate access right now —
every other member's runs 403 at model fetch.

Fix: point HF_HOME at a team-shared project-storage dir. First user
with gated access downloads once; everyone else reads the shared copy.

Two layers so the behavior is automatic regardless of invocation:

1. `dynacell.__main__._maybe_set_shared_hf_cache()` runs before Hydra
   subcommands (evaluate, precompute-gt, report). It defaults HF_HOME
   to
   `/hpc/projects/comp.micro/virtual_staining/models/dynacell/evaluation/hf_cache/`
   when three conditions hold: the user hasn't set HF_HOME themselves,
   we're on a repo checkout (same signal as external configs), and the
   shared dir exists on this machine. Wheel installs and non-HPC
   machines fall through to the normal ~/.cache/huggingface default.

2. `runtime_shared.yml` also sets HF_HOME for sbatch-rendered fit /
   predict jobs. Those jobs don't currently load HF models, but having
   the env var declared in the launcher profile future-proofs any
   later workflow that does and keeps the team convention visible in
   config.

Four regression tests cover the auto-setter: user-set wins, wheel
install skips, missing shared dir skips, repo checkout + existing dir
sets. README documents the mechanism and first-time download contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three defaults were required on every eval / precompute-gt invocation
even though they are effectively fixed for our benchmark set:

  feature_extractor/dinov3=lvd1689m       (only DINOv3 we evaluate with)
  feature_extractor/dynaclr=default       (team DynaCLR checkpoint)
  io.gt_cache_dir=/hpc/.../eval_cache/<TARGET>   (one dir per target)

Hoist them so the minimal command

  dynacell precompute-gt target=er_sec61b predict_set=ipsc_confocal

composes everything needed for all four cache families, and

  dynacell evaluate leaf=<...>

carries the same defaults without the redundant override directives.

- eval.yaml defaults list switches
    optional feature_extractor/dinov3: null   -> lvd1689m
    optional feature_extractor/dynaclr: null  -> default
  The `optional` keyword means wheel installs that don't ship
  dynaclr=default (it lives in the repo's shared/eval/ tree) silently
  skip the default — wheel users supply their own via --config-dir.
- Each target YAML now carries io.gt_cache_dir for its target's cache
  path (SEC61B, TOMM20, nucleus, membrane).
- The 8 canonical eval leaves drop the redundant
    override /feature_extractor/dinov3: lvd1689m
    override /feature_extractor/dynaclr: default
  and the per-leaf io.gt_cache_dir — all now inherited.
- README updated: "Enable feature metrics" and "Priming the cache"
  examples show the one-line minimal forms.

To swap variants at call-time nothing changes:
  dynacell evaluate ... feature_extractor/dinov3=<other_variant>
  dynacell evaluate ... feature_extractor.dynaclr.checkpoint=/other.ckpt
  dynacell evaluate ... feature_extractor/dinov3=null   # disable

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues from /simplify agents:

- Hardcoded HPC path in _SHARED_HF_CACHE makes the module CZ Biohub-
  specific; other sites couldn't override without editing source. Read
  DYNACELL_SHARED_HF_CACHE at module load and fall back to the CZ
  Biohub default only when the env var is absent. Extension story is
  now: set DYNACELL_SHARED_HF_CACHE in your shell / sbatch env block
  and the rest of the mechanism just works.
- Docstring on _maybe_set_shared_hf_cache named the sibling helper
  _external_configs_dirs; rephrase in semantic terms ("external Hydra
  searchpaths resolve") so the docstring doesn't rot if the detection
  mechanism ever changes.
- Four test methods in TestMaybeSetSharedHfCache imported `os` inline
  instead of at module top (violates CLAUDE.md). Plus a redundant
  monkeypatch.delenv() cleanup at the end of the last test that
  pytest's monkeypatch fixture already handles on teardown. Lift
  `import os` to module top, drop the inline imports, drop the
  redundant cleanup.

Skipped /simplify nitpicks:
- functools.cache on _external_configs_dirs (6-8 stat calls saved
  per CLI invocation; not worth the cognitive cost of a cache that
  has to coexist with tests that patch the function reference)
- Parametrize the 4 tests (4 is small enough to leave explicit)
- Inline comment on opaque group names (marginal signal)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting HF_HOME to the team-shared project-storage dir inadvertently
broke per-user gated-repo authentication: HF_HOME relocates the entire
HF directory tree including the auth token file. The first precompute
attempt with DINOv3 enabled came back 401 "Access to model ... is
restricted. You must have access to it and be authenticated to access
it" — not a missing grant (the user had been granted access, which
would have returned 403), but a missing token. HF was looking under
$HF_HOME/token, where no token file lives.

Switch to HF_HUB_CACHE, which only relocates the weights/datasets
cache and leaves auth state at its default ~/.cache/huggingface/token.
Each user's personal token now enforces gate ACLs on their own
account; the cache of downloaded weights is shared on project
storage as intended.

The _maybe_set_shared_hf_cache helper and its four tests flip from
HF_HOME → HF_HUB_CACHE. The runtime_shared.yml launcher env block
does the same. README gains a paragraph explaining *why* we chose
HF_HUB_CACHE over HF_HOME, and clarifies that subsequent reads from
the shared cache are plain disk reads that don't round-trip to HF and
so don't need a token at all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous 4-GPU FCMAE runs OOM-killed before first training step. On
gpu-c-1 sstat showed MaxVMSize=103G/rank at startup — 4 ranks × 103G
≈ 412G of virtual memory, plus mmap_preload /dev/shm staging, pushed
past the 512G cgroup cap. num_workers=8 per rank meant 32 dataloader
workers inflating per-rank VM before the first step. Halving to 4
cuts startup VM roughly in half while keeping prefetch above the
single-worker bottleneck.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alxndrkalinin and others added 27 commits May 19, 2026 17:15
Collapse the 5-kwarg ``(gt_plate_path, gt_channel_name, ...,
pred_plate_path, pred_channel_name)`` signatures on ``check_cache_identity``
and ``seed_cache_identity`` into a single ``source: Literal["gt", "pred"]
| None`` + ``plate_path`` + ``channel_name`` triple. The paired-None
runtime guards collapse to one check and the public surface no longer
encodes the GT/pred axis in parameter names.

The new ``seed_cache_identity`` accepts only one source per call; the
test that previously seeded both sides in a single invocation splits
into two sequential calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collapse ``init_cache_context`` and ``init_pred_cache_context`` into one
function with a required ``side: Literal["gt", "pred"]`` kwarg. The two
near-duplicate ~100-LOC bodies become one ~80-LOC body; ``side`` selects
which manifest key, which cache dir, and which identity fields to check.

The cross-side contamination guard (``manifest.get(other_side) is not
None``) now runs in both directions. Today only the pred path rejected
GT-contaminated manifests; the GT path will now also reject pred-side
contamination — a config error in either direction fails fast instead of
silently returning a half-valid cache.

``side`` has no default — callers must spell out their intent. The two
GT call sites (``pipeline.py:153`` and ``precompute_cli.py:82``) and
the one pred call site (``pipeline.py:160``) all pass it explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop six 1-line trampolines and ``_ensure_cache_side``. Export three
side-agnostic helpers instead — ``fov_masks``, ``fov_cp_features``,
``fov_deep_features`` — each reading ``ctx.side`` for the per-side
artifact label, force-recompute key, and manifest source tag.

The shared private helpers (``_fov_masks``, ``_fov_deep_features``,
``_load_or_compute_feature_timepoints``) were already side-agnostic;
this just removes the redundant public layer above them.

The compute backends (``cp_target_regionprops``,
``deep_target_features``) are body-identical to their pred-side twins
— only their parameter name differs. The new ``fov_cp_features`` and
``fov_deep_features`` call the target-side functions on both sides as
a stop-gap; a follow-up commit drops the unused pred-side variants and
renames the survivors to ``cp_regionprops`` / ``deep_features``.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The three ``*_target_*`` / ``*_pred_*`` metric pairs in ``metrics.py`` are
body-identical — they only differ by parameter name. Drop the duplicates
and rename the survivors:

- ``cp_target_regionprops`` + ``cp_pred_regionprops`` → ``cp_regionprops``
- ``build_target_crops`` + ``build_pred_crops`` → ``build_crops``
- ``deep_target_features`` + ``deep_pred_features`` → ``deep_features``

The argument becomes ``image`` (was ``target`` / ``prediction``); the
``Shape mismatch:`` error now reads ``image`` in its message.

The eval pipeline's cache-disabled fallback in ``pipeline.py`` and the
``fov_cp_features`` / ``fov_deep_features`` compute backends all switch
to the unified names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hoist the per-FOV pred-feature dispatch into ``_fov_pred_features_per_t``,
which returns a ``{kind: list[T]}`` dict regardless of whether the pred
cache is enabled. The per-timepoint reads inside the inner loop become a
straight indexing pass with no ``if pred_cache_ctx.enabled`` branch.

The cache-disabled branch keeps the crop-sharing optimization (one set
of 2-D crops per timepoint, reused across DINOv3/DynaCLR/CellDINO) so
the max-projection + cell-iteration + crop construction still runs once
per (FOV, t) instead of once per backbone.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ties

Four functions in ``pipeline_cache.py`` independently recomputed
``label_prefix = "pred_" if ctx.side == "pred" else ""`` and
``source = {"source": "prediction"} if ctx.side == "pred" else {}`` from
``ctx.side``. Lift both into ``_CacheContext.label_prefix`` and
``source_tag`` properties so the side-encoding lives next to ``side`` and
the call sites just read it.

Two more ternaries in ``init_cache_context`` (``cache_dir_key``,
``other_side``, and the ``io.{gt,pred}_path`` / ``io.{gt,pred}_channel_name``
selection) collapse into ``_SIDE_IO_KEYS`` and ``_OTHER_SIDE`` lookup
tables. Net: same behavior, one place to edit per axis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…point

The cache-disabled branch of ``_fov_pred_features_per_t`` previously
materialized a ``t_count``-long ``crops`` list, then walked it three
more times in separate comprehensions for DINOv3, DynaCLR, and CellDINO.
All ``t_count`` 2-D crop tensors sat in RAM until the function returned.

Fuse into one loop: build crops for one timepoint, fan out to every deep
backbone, then drop them before the next timepoint. Same total work;
peak crops memory drops from ``t_count`` copies to one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cross-FOV/cross-timepoint accumulator for deep features that batches
DINOv3/DynaCLR/CellDINO forwards across the whole dataset, splits
results by per-(pos, t) row counts, and writes through the existing
zarr cache contract. Force flags are snapshotted at construction so
mid-scan flushes don't flip semantics for later positions; ctx.force
is only cleared in drain() for kinds actually flushed.

Lock-tag construction for the per-(family, position) write lock is
hoisted to _kind_lock_tag so _load_or_compute_feature_timepoints and
_flush_kind cannot drift.

No call sites yet; per-FOV path (fov_deep_features) keeps working for
the cache-disabled dev path. The wiring lands in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three tests for the cross-FOV batched precompute path:
- writes_all_slots: every (pos, t) lands in the cache with the
  expected (n_cells, feature_dim) shape.
- matches_per_fov: precompute_deep_features and fov_deep_features
  produce bit-exact identical cache contents (golden equivalence
  with a deterministic stub extractor).
- skips_cached_slots: pre-seeded (pos, t) slots are not re-extracted
  and the original cache contents survive.

Helpers (_write_tiny_hcs_plate, _write_tiny_seg_plate,
_BatchedConstantExtractor) are local to this test module per the
project's no-conftest-import rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run precompute_deep_features once over the whole position set before
the per-FOV loop so DINOv3/DynaCLR/CellDINO forwards amortize across
the dataset instead of paying Python/CUDA dispatch per (FOV, t).
After precompute, the existing fov_deep_features / pred per-t calls
hit the cache.

Paired pos-name validation is hoisted so precompute cannot write to
mismatched cache slots; the in-loop checks stay as defense in depth.

Sides whose ctx.require_complete is true are skipped at precompute
time — the main loop's _raise_if_require_complete keeps enforcing the
fail-loud miss contract on those sides.

Threshold is exposed via feature_metrics.deep_feature_batch_threshold
(default 256 when the key is absent; explicit default lands in the
config-knob commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the per-kind fov_deep_features calls in precompute-gt's per-FOV
loop with a single streaming push to DeepFeatureBatcher. The existing
loop already loaded the GT plate FOV once for masks + CP; deep features
now reuse that read and batch forwards across FOVs/timepoints.

build.{masks,cp} still go through the per-FOV helpers because they're
not part of the batched pipeline. flush_manifest stays in the loop to
preserve crash-recovery progress; batcher.drain() runs after the loop
to land any pending crops + reset force flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Set feature_metrics.deep_feature_batch_threshold = 256 so the cross-
FOV/cross-timepoint batched precompute path has a clear default that
shows up in the resolved Hydra config. Both evaluate_predictions and
precompute-gt read this key via OmegaConf.select with a fallback of
256 — explicit default lets downstream leaves (e.g. benchmark configs)
override per-run without touching code.

CellDINO's internal batch_size is 32, so threshold=256 amortizes 8
internal chunks per flush — comfortable on a single H100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecks

_flush_kind never read its `side` arg (ctx.side was authoritative),
hence the `del side` workaround. Remove the parameter and update both
batcher call sites.

The per-iteration position-name checks inside evaluate_predictions'
main loop are subsumed by the upfront hoisted validation block that
walks the same zip; nothing mutates pred_positions / gt_positions /
seg_positions between the hoist and the loop. Underscore the unused
position names in the unpacking instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous push API took a pre-stacked (n_cells, H, W) ndarray plus
a (0, 0, 0) sentinel for zero-cell slots; _flush_kind then destacked
it back into a per-cell list to feed features_from_crops. The round
trip + the sentinel were pure ceremony — features_from_crops already
handles an empty list and the per-cell list shape it needs.

Take the raw list[np.ndarray] from build_crops end to end. Both
callers (precompute_deep_features, precompute-gt CLI) drop their
local stack-or-sentinel expression; _flush_kind drops the per-row
Python loop and uses a flat comprehension over items.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_flush_kind called flush_manifest(ctx) after every batched write,
which under typical settings is 3 backbones × ceil(N_cells /
flush_threshold) round trips per session — each one locks the global
manifest, reloads YAML, merges, and writes. The "crash-resilience"
case the inline flush nominally protected was already covered: zarr
slot writes are durable per slot, so on resume the lockless prefetch
finds them whether or not the manifest is current.

Move the flush to the caller, after batcher.drain(). evaluate_predic-
tions flushes every active side; precompute-gt flushes once after the
loop. Manifest write cadence drops to O(active_sides) for eval and
O(1) for precompute, matching the durability the caller wants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Torch metrics (PCC/SSIM/NRMSE/PSNR) ran on CUDA but then both tensors
were force-moved to CPU numpy and fed to cubic's spectral_pcc /
fsc_resolution — round-tripping through host memory every timepoint.
cubic's spectral metrics are array-module-agnostic and accept cupy
inputs directly, so the host bounce is pure overhead.

Hand the tensors off via cubic.cuda.ascupy / asnumpy (zero-copy on
CUDA via the CUDA Array Interface, per cubic AGENTS.md) so PCC/SSIM/
NRMSE/PSNR/Spectral_PCC/FSC all stay on one device. Also defer
_require_cubic to the path that actually needs it, so torch-only runs
no longer require the cubic/cucim/cupy extras to be installed.

Smoke: on a 15x512x512 volume with spectral PCC enabled, per-call wall
time drops from ~2.95 s (CPU bounce) to ~70 ms (GPU-resident). CPU and
GPU paths agree to ~1e-7 relative on PCC/NRMSE/PSNR/Spectral_PCC and
~6e-5 on SSIM (float32 cuDNN noise).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…doff

Three small cleanups in compute_pixel_metrics from the simplify pass:

1. fsc_resolution mean-centers internally before every FFT (see
   cubic/metrics/spectral/frc.py lines 171/251/696/877), so the outer
   ``target - target.mean()`` / ``prediction - prediction.mean()`` was
   wasted: two extra full-volume allocations + reductions per (FOV, t).
2. Hoist ``to_xp = ascupy if cuda else asnumpy`` so the two-line branch
   collapses to one converter call per tensor.
3. Drop the ``_xp`` suffix; reuse the ``prediction, target`` names after
   conversion. cubic's AGENTS.md explicitly steers away from importing
   the ``xp`` convention into callers.

Docstring restructured to lead with the load-bearing WHY (host bounce
dominated wall time on long timelapses) instead of restating the
implementation.

Numerics unchanged: CPU and GPU paths still agree to ~1e-7 on
PCC/NRMSE/PSNR/Spectral_PCC and ~6e-5 on SSIM on a 15x256x256 smoke run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(viscy-models): add 3D PatchGAN discriminator and LSGAN losses

Introduces a new gan subpackage with PatchGAN3D and MultiScalePatchGAN3D
(pix2pixHD-style, 2-scale by default), plus lsgan_d_loss / lsgan_g_loss
list-based helpers. These will pair with UNetViT3D to form a 3D pix2pix
adversarial baseline alongside the existing regression and flow-matching
DynaCell baselines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add DynacellGAN Lightning module for 3D pix2pix baseline

DynacellGAN composes the existing UNetViT3D generator with the new
MultiScalePatchGAN3D discriminator, training with LSGAN + L1 (λ=100) via
manual optimization (automatic_optimization=False). A requires_grad
toggle separates the D step from the G step so each backward only
populates the active side's grads; D grads are also explicitly cleared
between optimizers so the test invariant (no D grads after a G step)
holds for Phase 2 DDP correctness.

Validation logs the existing loss/validate weighted-mean alias, so
ModelCheckpoint(monitor="loss/validate") keys off the same metric used
across the other DynaCell baselines. configure_optimizers builds two
AdamW + WarmupCosine pairs directly to fit the two-optimizer return
shape Lightning's manual mode expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add pix2pix3d_unetvit recipe, model overlays, and smoke leaf

The recipe declares dynacell.engine.DynacellGAN with UNetViT3D's
production generator config and a 2-scale spectral-normalized PatchGAN3D.
fit / predict model overlays mirror the unetvit3d ones with lr_g/lr_d
split (no scalar lr) and bf16-mixed precision.

The single-GPU smoke leaf for er/pix2pix3d_unetvit/ipsc_confocal sets
benchmark.dataset_ref: null to disable the compose-hook resolver so it
can point data_path at SEC61B_test48.zarr directly, and uses batch_size=2
to match the inherited num_samples=2 (HCSDataModule divisibility rule).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add ER iPSC production train leaf for pix2pix3d_unetvit

Mirrors er/unetvit3d/ipsc_confocal/train.yml with the model overlay and
all model_name / experiment_id / run_root / job_name strings retargeted
to pix2pix3d_unetvit. Keeps ModelCheckpoint.monitor: loss/validate
unchanged — DynacellGAN logs that key via on_validation_epoch_end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add pix2pix3d_unetvit single-set train leaves across organelles

Adds 7 production single-GPU training leaves for pix2pix3d_unetvit:
- er/pix2pix3d_unetvit/a549_mantis
- {membrane, mito, nucleus}/pix2pix3d_unetvit/{ipsc_confocal, a549_mantis}

Each leaf mirrors the matching unetvit3d cell with model_overlay,
model_name, experiment_id, logger name/save_dir, ckpt dirpath, launcher
run_root/job_name retargeted to pix2pix3d_unetvit. ModelCheckpoint
monitor=loss/validate preserved.

Joint training leaves (joint_ipsc_confocal_a549_mantis) deferred — still
gated on the open BatchedConcatDataModule DDP deadlock (handoff
2026-04-26).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add iPSC predict + eval leaves for er/pix2pix3d_unetvit

Mirrors er/unetvit3d/ipsc_confocal/{predict,eval}__ipsc_confocal.* with
the model overlay, model_name, experiment_id, ckpt_path, output_store
filename, and eval pred_path / save_dir all retargeted to
pix2pix3d_unetvit. predict_method stays full_image (the only mode
DynacellGAN.predict_step supports for v1).

ckpt_path uses an explicit REPLACE_ME_WITH_PRODUCTION_CHECKPOINT_PATH
placeholder so torch.load raises FileNotFoundError with the placeholder
visible if the user forgets to override it — running predict against an
untrained generator would silently produce garbage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add a549_mantis predict + eval triplet for er/pix2pix3d_unetvit/ipsc

Adds the 3 a549_mantis test splits (mock, denv, zikv) for the
iPSC-trained pix2pix3d_unetvit ER cell — enables cross-dataset transfer
evaluation. Mirrors the unetvit3d ipsc_confocal predict/eval triplet
with the same retargeting rule already used for the iPSC pair.

ckpt_path uses the REPLACE_ME_WITH_PRODUCTION_CHECKPOINT_PATH placeholder
so any forgotten override fails loudly via FileNotFoundError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(dynacell): /simplify pass — reuse configure_adamw_scheduler, dedupe _log_samples

DynacellGAN.configure_optimizers used to rebuild AdamW + WarmupCosineSchedule
inline despite the class docstring claiming it called the shared helper.
Now calls configure_adamw_scheduler twice and concatenates, dropping the
redundant schedule guard and the unused monai/torch.optim imports.

_log_samples was a byte-identical instance method on DynacellUNet,
DynacellFlowMatching, and DynacellGAN. Promoted to a module-level free
function and updated the five call sites across all three classes.
Updated test_dynacell_gan_validate_logs_alias to monkeypatch the module
attribute instead of the instance method.

Also trimmed three narrating comments inside DynacellGAN (the plan/PR
context comments belong in the PR description, not the source) and five
per-layer banner comments in patchgan3d.py (the conv literals state the
shape and the class docstring covers the layer table).

Safe under the running production training's Slurm auto-requeue: no
constructor signatures change, AdamW + WarmupCosine pairs are built with
the same args, optimizer state shape is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): Phase 2 — DDP topology + 4-GPU smoke + production leaves

Adds the DDP infrastructure for pix2pix3d_unetvit so the GAN baseline
can train on 4 H200s in ~13 h instead of the ~50 h single-GPU pace
measured on job 33132605 (now cancelled).

- recipes/topology/ddp_4gpu_gan.yml — DDP variant that wraps the strategy
  in find_unused_parameters: true. Required because DynacellGAN runs
  manual_optimization with two optimizers, and the per-step requires_grad
  toggle leaves one side's params "unused" from the reducer's perspective
  on each backward. Plain ddp_4gpu.yml stays single-strategy for the
  other models.

- model_overlays/pix2pix3d_unetvit_fit_ddp.yml — mirrors the single-GPU
  fit overlay but composes ddp_4gpu_gan.yml topology + 4-GPU hardware.

- er/pix2pix3d_unetvit/ipsc_confocal/train_smoke_4gpu.yml — 4-GPU smoke
  variant. Same SEC61B_test48 store, batch_size=2 matching the inherited
  num_samples=2 divisibility rule. Validates DDP sharding + the
  find_unused_parameters DDP reducer behavior under the GAN's manual-opt
  pattern before committing the 4-GPU production run.

- er/pix2pix3d_unetvit/ipsc_confocal/train_4gpu.yml — 4-GPU production
  leaf. Keeps per-rank batch_size=4 (effective 16 across 4 GPUs) and
  unchanged lr_g/lr_d so the DDP run is comparable to the single-GPU
  reference at the per-rank optimization settings. logger name and
  job_name suffixed with _4gpu to disambiguate in W&B/slurm; run_root
  and ckpt dirpath unchanged so Phase 4 predict leaves keep working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): retarget membrane pix2pix3d_unetvit save paths

The membrane single-set train leaves landed with placeholder paths under
.../ipsc/memb_temp/ — diverging from the celldiff convention (memb, not
memb_temp) and conflating the a549_mantis run dir with the iPSC tree.
Both leaves now point at the canonical .../ipsc/memb/ and
.../a549_mantis/memb/ trees, matching the celldiff layout so all
organelles can be cross-referenced uniformly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): expand pix2pix3d_unetvit predict/eval matrix

Mirrors the er/pix2pix3d_unetvit/ipsc_confocal/ predict+eval set across
membrane, mito, and nucleus so every iPSC-trained pix2pix3d_unetvit run
has an apples-to-apples inference + evaluation path against both the
ipsc_confocal self-target and the three a549_mantis conditions (mock,
denv, zikv). Predict leaves swap target-inherited normalizations for a
source-only NormalizeSampled (Phase3D) and clear the RandWeightedCropd
augmentation. Eval leaves enable compute_feature_metrics only on the
ipsc_confocal self-target, matching the existing er convention. Ckpt
paths are placeholders pending the corresponding production runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add pix2pix3d_unetvit joint train + predict leaves

Adds the joint_ipsc_confocal_a549_mantis variant for all four
organelles (er, membrane, mito, nucleus): one train.yml authoring
BatchedConcatDataModule with two HCSDataModule children plus the four
predict leaves (self iPSC + three a549_mantis conditions). The joint
train differs from the celldiff joint by using NormalizeSampled
(fov_statistics) instead of MinMaxSampled so joint and single-set
pix2pix3d_unetvit runs share the same normalization contract for
apples-to-apples ablations. batch_size=2 × num_samples=2 matches the
single-set 4 GPU-samples/step (per the BatchedConcatDataModule rule
in CLAUDE.md that joint does NOT divide by num_samples).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(viscy-models): drop speculative num_layers from PatchGAN3D

PatchGAN3D and MultiScalePatchGAN3D accepted a num_layers constructor
parameter that only worked with the value 5 — any other value raised
ValueError. The argument was justified as future-proofing for ablations
that don't exist yet, which is exactly the speculative flexibility
CLAUDE.md prohibits: "No 'flexibility' or 'configurability' that wasn't
requested. ... No error handling for impossible scenarios." Re-add it
only if and when a non-5 variant lands. Recipe + test fixture updated
in lockstep; uv run pytest packages/viscy-models/tests/test_gan/ and
the DynacellGAN init/forward tests pass (11/11).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): apply TTUR — drop pix2pix3d_unetvit lr_d to lr_g/10

The 1:1 LR schedule collapsed to D-dominance on DDP run 33176091
(wandb 20260520-002629): loss/d_train fell from 1.3e-2 (ep1) to
~1e-4 (ep12+), loss/g_adv pinned at ~1.0 (saturated MSE(d_fake, 1)),
loss/g_l1 flat at ~0.45 from epoch ~5 onward — the adversarial
gradient was effectively zero, generator was training as pure L1.

Slowing the discriminator 10x (lr_d = lr_g / 10 = 3e-5) is the
canonical pix2pix-style fix for D-dominance. Applies to both the
single-GPU and DDP fit overlays so the entire Phase 3 baseline
(all 4 organelles × {iPSC, A549, joint}) inherits the same
adversarial balance regime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): add eval runtime module + thread budgeting (C1)

Adds dynacell.evaluation.runtime with three layered thread-cap entry points:
early_apply_env_caps() sets BLAS/OMP env vars before C-extension load (called
from __main__.main_cli); apply_thread_budget(threads) is the in-process safety
net (env + torch.set_num_threads + threadpool_limits); resolve_runtime(config)
materializes the new runtime config block.

The runtime block (eval.yaml) is opt-in for parallelism (default executor=serial,
fov_workers=1) but unconditionally tightens BLAS/OMP threads to cpu_count by
default. On SLURM nodes where --cpus-per-task < host CPU count, this corrects
previously-loose BLAS threading. Set DYNACELL_THREADS_PER_WORKER=N to override
at the env layer for cases where in-process caps come too late.

precompute_gt_artifacts validates that runtime.executor='serial' and
fov_workers=1 — FOV-level parallelism is restricted to evaluate_predictions
in this commit; precompute parallelism is out of scope (DeepFeatureBatcher
accumulates state across FOVs).

resolve_runtime supports two-phase resolution for evaluate_predictions:
top-of-function for parent BLAS cap, then post-position-list with
freeze_threads_per_worker= to keep parent and worker BLAS budgets consistent.
Worker primitives (_process_one_fov, ProcessPoolExecutor) land in C3+C4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): per-T memory hygiene + region timing instrumentation (C2)

Adds always-on region timers (RegionTimer context manager) that record
wall time of FOV-level and per-T regions to <save_dir>/eval_timing.csv.
Region taxonomy matches actual execution placement at pipeline.py:341-505:

  FOV-level (t=None): mask_gt, mask_pred, cp_gt, deep_gt_dinov3,
  deep_gt_dynaclr, deep_gt_celldino, features_pred_per_t, microssim, seg_write
  Per-T: pixel_metrics, mask_metrics, feature_pairwise

Timing overhead is ~0.1 ms per region per timepoint; ~120 ms total on a
9-h eval (~4e-6 of wall time) — no gating knob.

Per-T memory hygiene knobs (runtime.cuda_empty_cache_every_n_timepoints,
runtime.gc_collect_every_n_fovs) ship at 0 (no behavior change) in this
commit. The plan calls for measuring and flipping defaults in C2 — that
measurement requires a real eval run on the test48 zarr that exceeds this
session's scope. Defaults stay at 0; observed late-T degradation in
prior A549 ER evals (18 -> 61 s/T over 10 timepoints) is the documented
motivation but not the measured fix in this commit.

DYNACELL_FORCE_PER_T_HYGIENE=1 env var flips both knobs on at runtime
regardless of YAML defaults — operator escape hatch for post-ship
mitigation without a code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(dynacell): extract per-FOV body into _process_one_fov + FovResult (C3)

Pure refactor: extract the per-FOV body of evaluate_predictions into a
module-level function _process_one_fov(...) -> FovResult and an
_aggregate_fov_result(...) parent-side aggregator. Serial-path behavior
is identical to the prior code (verified: 384/48 pass/fail parity with
HEAD~2; the 48 failures are all pre-existing).

FovResult carries everything one FOV contributes to the run as a
picklable dataclass:
- per_t_pixel_rows (with microssim already merged in per pipeline.py:471
  semantics — no separate microssim_rows field),
- per_t_mask_rows, per_t_feature_rows,
- seg_array of shape (T, 2, D, H, W) bool (channel 0 = pred, channel 1 = GT),
- 4 _BackboneLists for cp/dinov3/dynaclr/celldino — each holds the per-T
  pred_feats/gt_feats/pred_fovs/gt_fovs/pred_ts/gt_ts arrays. Workers
  preserve the existing `if size > 0: append` guard so empty per-T slots
  never enter the lists; aggregator extends parent lists unconditionally.
- timings slice from the per-process region_timer collector.

_aggregate_fov_result writes the seg array to the HCS plate (still in the
parent — concurrent create_position from workers would race plate
metadata), extends per-T row lists, and extends the 24 dataset-level
accumulators.

This refactor positions C4 to add ProcessPoolExecutor + worker primitives
without further changes to the FOV body. flush_manifest stays in the
parent's loop for now; under process mode (C4) each worker will flush
its own ctx at FOV end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): wire ProcessPoolExecutor + GPU serialization lock (C4)

Adds worker primitives in runtime.py (_worker_initializer,
gpu_serialization_lock, make_fov_executor) and the worker entry point
_worker_run_fov in pipeline.py. evaluate_predictions now branches on
runtime.executor: serial (default) runs _process_one_fov inline
identical to today; process spawns a ProcessPoolExecutor over FOVs.

Worker design:
- Spawn context avoids inheriting parent CUDA. _worker_initializer
  sets BLAS/OMP env caps BEFORE apply_thread_budget lazy-imports torch.
- _worker_setup lazy-loads seg_model + extractors under
  gpu_serialization_lock(): N workers don't init CUDA + load weights
  concurrently. Plate handles cached per-worker.
- GPU touch sites inside _process_one_fov are wrapped in
  gpu_serialization_lock(): mask_gt/pred, cp_gt, deep_gt_{dinov3,dynaclr,
  celldino}, features_pred_per_t, pixel_metrics (per-T, GPU-bound
  end-to-end including cubic spectral), and on-demand cellpose fallback
  inside mask_metrics. In serial mode the lock has no path set and yields
  immediately — bit-for-bit-identical behavior.
- Lock file at /tmp/dynacell_gpu_<SLURM_JOB_ID-or-PID>.lock (node-local,
  not NFS). One lock per SLURM job, shared across workers.

Phase-2 runtime resolution: after pred_positions is built,
resolve_runtime is re-called with n_positions= to clamp "auto"
fov_workers to min(provisional, n_positions) and freeze
threads_per_worker at the Phase-1 value so worker initializers match
the parent's already-applied BLAS cap.

Order: results collected via as_completed (arrival order) but aggregated
in original pred_positions order so per-FOV CSV + embedding NPZ row
order is deterministic regardless of executor.

Workers flush their own ctx manifest at FOV end (global manifest lock at
_pos_write_lock(ctx, "manifest", "global") fcntl serializes flushes
across workers); interrupted process-mode runs preserve cache progress
the same way serial mode did.

Tests: full dynacell suite 384/48 pass/fail (identical to baseline; all
48 failures pre-existing). No new spawn-mode test in this commit —
covered by C5 CPU integration test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(dynacell): runtime + parallel tests, CI wiring, SLURM env caps, docs (C5)

Adds test coverage for the C1-C4 runtime module and FOV-parallelism wiring:

  test_runtime.py (23 tests): apply_thread_budget no-crash semantics under
  Hydra's main; resolve_runtime auto/freeze/validate paths; region timing
  collector; DYNACELL_FORCE_PER_T_HYGIENE env-var hook; spawn-context
  ProcessPoolExecutor smoke via stdlib os.getpid.

  test_evaluation_pipeline_parallel.py (3 tests): FovResult pickle round-trip
  with realistic per-backbone shapes (the path that ProcessPoolExecutor takes
  for Future.result transport); _aggregate_fov_result extends all 24
  dataset-level accumulator lists correctly. Re-imports the live pipeline
  module inside test bodies because test_lazy_init.py pops dynacell.* modules
  from sys.modules, which would invalidate pickle resolution of a
  module-level imported FovResult.

CI wiring:
  - Root pyproject.toml registers the "slow" marker and appends -m "not slow"
    to existing addopts.
  - test-dynacell-configs runs test_runtime.py + test_evaluation_pipeline_parallel.py
    alongside existing benchmark-schema/submit-tool tests. No new heavy deps
    needed (these tests don't import iohub, cubic, transformers).

SLURM env caps:
  - run_eval_jointtrained_nucleus.slurm and
    run_eval_unext2_jointtrained_membrane.slurm export
    DYNACELL_THREADS_PER_WORKER=\$SLURM_CPUS_PER_TASK before invoking eval.

Docs:
  - applications/dynacell/CLAUDE.md gains "Eval runtime / parallelism"
    section documenting the config block, env-var hooks, timing
    instrumentation, and process-mode caveats.

Full dynacell test suite: 410 passed / 48 failed / 65 skipped — the 48
failures are the same set that fails on dynacell-models@HEAD (verified
parity across all 5 commits in this branch).

End-to-end serial-vs-process parity test on a real iohub+cache fixture is
a follow-up (see .claude/plans/eval-parallelism.md §C5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): drop parent seg_model under executor=process to avoid duplicate residency

Under executor=process, workers lazy-load their own seg_model copies (under
the GPU lock) — but the parent had been holding a duplicate copy GPU-resident
since the unconditional prepare_segmentation_model(config) call at the top of
evaluate_predictions. On an 80 GB A100 with fov_workers=4, this meant
(N+1) × model_weights resident instead of N.

After Phase-2 runtime resolution (where the final executor is known), drop
the parent's copy when executor=process. The pre-warm side effect
(segmenter_model_zoo's validate_model populating the checkpoint dir to
avoid N workers racing on a cold quilt download) is already done by then.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(dynacell): resolve config interpolations in-place before worker submit

Belt-and-suspenders safety net suggested in PR #426 review: call
OmegaConf.resolve(config) in-place after Phase-1 runtime resolution and
before any worker pickle.dumps the DictConfig across the spawn boundary.

OmegaConf supports lazy interpolation resolution natively, so workers
*would* resolve ${...} refs on first attribute access — but doing it once
in the parent eliminates a tiny per-worker hot-path cost and guards against
the (unlikely) case where an interpolation depends on parent-only state.

??? MISSING fields are untouched: verified OmegaConf 2.3 `_resolve` walks
only nodes where `_is_interpolation()` is true, so bare ??? values pass
through. Confirmed: 63 tests pass across pipeline / compose / parallel /
runtime suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(dynacell): silence inner per-T tqdm in spawn workers

Adds a runtime.is_worker() helper that returns True iff
_worker_initializer has set _GPU_LOCK_PATH. Wire it into
_process_one_fov's inner-T tqdm via disable=is_worker(), so under
fov_workers>1 the N concurrent per-T bars don't interleave on the
parent's stderr. Outer per-FOV tqdm (in the parent) stays visible
because the parent's _GPU_LOCK_PATH is never set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): don't double-count timing rows under executor=serial

Bug: _process_one_fov writes region timings to the global _TIMINGS
collector via region_timer, and returns the slice in FovResult.timings.
_aggregate_fov_result then calls extend_timings(result.timings)
unconditionally. In serial mode this re-appends each row to the SAME
parent collector — every timed region appears twice in eval_timing.csv.

In process mode the second append is correct (workers have separate
per-process _TIMINGS, parent's is empty until aggregated).

Fix: add extend_worker_timings kwarg to _aggregate_fov_result. Pass
True in process mode, False in serial. Verified via 63-test suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): context-manage worker plate handles per FOV

_worker_setup no longer caches pred/gt/seg plate handles in _WORKER_STATE.
Instead, _worker_run_fov opens them via `with open_ome_zarr(...) as ...`
scoped to one FOV — iohub file descriptors close between FOVs and the
worker has nothing to clean up on ProcessPoolExecutor shutdown.

Models + cache contexts stay cached in _WORKER_STATE (their load cost
~5-10 s per worker is too high to repeat per FOV; ProcessPoolExecutor
process exit reclaims them).

Position lookup by name uses a linear scan of plate.positions() (iohub
has no name->position index). Cheap for the ≤ tens of positions per
typical eval — sub-ms per scan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): skip prepare_segmentation_model under require_complete_cache

When io.require_complete_cache=true, the caller asserts that all needed
masks live in the cache — so loading SuperModel (which hits a quilt
download for nucleus/membrane on a cold node) is wasted work. Return
None unconditionally in that mode. If a cache miss happens despite the
assertion, the cache layer raises StaleCacheError before any segment()
call, so the None seg_model never gets dereferenced.

Enables the cache-only/no-model path that C5's integration test design
relies on (run evaluate_predictions end-to-end against prebuilt mask
caches without instantiating cellpose/SuperModel).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(dynacell): add CPU integration test for serial-vs-process parity

Drives evaluate_predictions end-to-end on a tiny iohub fixture (3 positions
× T=2 × D=4 × H=W=32) with prebuilt mask caches. Runs the pipeline twice —
once with runtime.executor=serial, once with runtime.executor=process
runtime.fov_workers=2 — and asserts:

  1. pixel_metrics + mask_metrics rows are byte-equal after sort by FOV.
  2. segmentation_results.zarr per-position content is byte-equal across
     both channels.

Cache-only design (matches plan §C5): target_name=er +
require_complete_cache=true + compute_feature_metrics=false +
compute_microssim=false + pixel_metrics.fsc=null +
pixel_metrics.spectral_pcc=null. The full pipeline runs without
instantiating cellpose, SuperModel, DinoV3, DynaCLR, CellDino, or cubic.

What this validates: spawn boot, DictConfig pickling, FovResult.timings +
arrays round-trip through pickle, _aggregate_fov_result extends parent
accumulators correctly, parent-side segmentation_results.zarr writes
don't race, sort-by-pos_name aggregation, and the new
extend_worker_timings semantics doesn't double-count timing rows.

_live_pipeline_module() clears dynacell.evaluation.* from sys.modules
before importing — necessary because test_evaluation_pipeline.py stubs
metrics/utils via monkeypatch.setitem(sys.modules, ...), and without the
clear my pipeline's compute_pixel_metrics resolves to the empty-returning
stub.

CI: runs ~16 s on the existing test-dynacell-configs job.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(dynacell): include CPU integration test in dynacell-configs CI job

Adds tests/test_evaluation_pipeline_parallel_cpu.py to the dynacell
pytest invocation. The test runs end-to-end evaluate_predictions in
both serial and spawn-process modes against a tiny iohub fixture +
prebuilt mask cache — no eval extras (cellpose, transformers, cubic)
needed because target_name=er + io.require_complete_cache=true
short-circuit segmenter and feature-extractor loads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): harden runtime resolution + gate GPU lock on use_gpu

Four Copilot-review findings on the runtime layer:

1. apply_thread_budget threadpool_limits re-arm: the old handle becomes
   unreachable and is GC'd. The previous comment said "stays alive"
   which was misleading. Document the actual behavior: we WANT the new
   limit active, dropping the old handle is intentional. Also remove
   the dead if-branch since both arms did the same assignment.

2. resolve_runtime accepts literal fov_workers=0 / threads_per_worker=0
   which produces divide-by-zero in the "auto" math or nonsensical
   thread counts. Add explicit >= 1 validation with a clear error.

3. _cpu_count() prefers os.sched_getaffinity(0) over os.cpu_count() so
   SLURM cgroup limits (cpuset etc.) are honored — otherwise auto-
   resolved fov_workers/threads_per_worker can oversize past the
   cgroup-visible CPU budget. Falls back to os.cpu_count() on non-Linux.

4. gpu_serialization_lock now takes a gate: bool = True kwarg. Callers
   in pipeline.py pass gate=use_gpu so under config.use_gpu=false the
   lock is a no-op even in process mode. Previously the lock fired
   unconditionally, defeating cross-worker parallelism for CPU-only
   eval runs. The lock still gates on _GPU_LOCK_PATH for the
   parent/serial no-op (unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(dynacell): stream aggregation under executor=process to bound peak memory

Previously buffered every FovResult in results_by_pos until ALL N
worker futures completed, then aggregated in pos_names_in_order. Peak
parent memory: N × seg_array (~N × 440 MB for T=10 D=21 H=W=1024 evals)
all live simultaneously.

Switch to streaming aggregation: drain the buffer opportunistically in
pos_names_in_order as futures complete. Only out-of-order FovResults
stay buffered; once the next-expected position lands we aggregate
(plate write + accumulator extend) and drop the ref. Peak buffer is
bounded by the number of out-of-order FOVs in flight, not by N.

Deterministic per-FOV CSV / embedding NPZ row order is preserved
because aggregation still happens in pos_names_in_order — early
completions just wait their turn in the buffer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): grouped multi-condition eval driver (#427)

* feat(dynacell): grouped multi-condition eval driver

Add ``dynacell evaluate-grouped``: run the same (model, organelle) across
multiple I/O variants (the canonical case is the three A549 treatment
plates mock/denv/zikv) in one Python process, paying the segmenter +
DINOv3 + DynaCLR + CELL-DINO cold-start once instead of N times. On
serial-mode interactive jobs and on per-condition SLURM jobs this is
30-90s of weight load + checkpoint warmup amortized across all
conditions. Under runtime.executor=process, each condition still spawns
its own pool whose workers re-load model copies, so only the parent-side
load is amortized; the driver prints a note when it detects this.

Mechanism:
- ``dynacell.evaluation.model_loader.load_eval_models`` extracts the
  segmenter + extractor instantiation that previously lived inline in
  ``evaluate_predictions`` and ``_worker_setup``. The bundle carries
  identity tags (dinov3_model_name, dynaclr_ckpt source paths, encoder
  cfg, celldino weights) so ``init_cache_context`` callers get identical
  cache-identity values across conditions.
- ``evaluate_predictions(config, *, models=None)`` accepts a pre-loaded
  bundle and skips its inline load when supplied. ``models=None``
  preserves the historical single-condition behavior; existing callers
  and tests are unchanged.
- ``evaluate_predictions_grouped(config)`` reads a top-level
  ``conditions:`` list, deep-merges each overlay into a copy of the
  base, and runs them sequentially with the shared models. Per-condition
  ``_final_metrics_cache_valid`` short-circuits are honored independently.
- ``_check_grouped_field_invariants`` rejects overlays that touch
  ``target_name``, ``feature_extractor.*``, ``compute_feature_metrics``,
  or ``use_gpu`` -- those determine which models get loaded and must be
  identical across conditions. ``io.*``, ``save.*``, ``runtime.*``,
  ``limit_positions``, ``force_recompute.*`` are free to vary.
- Hydra entry point ``evaluate_model_grouped`` with new
  ``_configs/eval_grouped.yaml`` (inherits ``eval.yaml`` via defaults).
- New CLI subcommand registered in ``__main__._HYDRA_COMMANDS``.

Tests: ``tests/test_evaluation_grouped.py`` builds a 2-condition
cache-only fixture (separate pred plates + cache dirs) and asserts
byte-equal parity against sequential per-condition runs, plus rejection
cases for empty ``conditions`` and forbidden model-loading-field
overrides. Cache-only path means no models actually load -- the test
validates the loop structure, not the speedup. Reuses the iohub fixture
builders from ``test_evaluation_pipeline_parallel_cpu.py``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(dynacell): simplify grouped eval driver + fold load + cache-context unification

Followup polish on d69d970 (feat: grouped multi-condition eval driver).
Addresses /simplify findings + the two related cleanups the user asked to
fold in:

Pipeline driver (pipeline.py):
- Lazy model load — defer ``load_eval_models`` until the first condition
  that misses ``_final_metrics_cache_valid``. Restarts where every
  condition is cache-hit pay no 30–90 s cold start.
- Invariant baseline is now the conditions-stripped input config, not
  the conditions[0]-merged config. A model-loading override smuggled
  into condition 0 is rejected symmetrically with conditions 1..N
  (previously it was silently re-baselined). New test guards the
  regression: ``test_grouped_rejects_condition0_model_field_override``.
- ``_merge_condition`` drops the manual ``to_container`` round-trip
  deep-copy; ``OmegaConf.merge`` already returns a non-aliased config.
- ``_check_grouped_field_invariants`` switched to ``OmegaConf.is_config``
  in place of the ``type(OmegaConf.create([]))`` idiom; baseline
  snapshot is computed once and reused per iteration.
- Extracted ``_load_cached_final_metrics`` so the three-NPY reload block
  is shared between ``evaluate_model`` and the grouped driver.
- Removed redundant first-iteration merge + ``apply_dataset_ref`` for
  condition 0; the loop body now does both for every condition,
  including idx=0.

Model loader (model_loader.py) and precompute-gt:
- Added ``LoadFlags`` so ``load_eval_models`` can be called from both
  ``evaluate_predictions`` (all extractors gated by
  ``compute_feature_metrics``) and ``precompute_gt_artifacts``
  (per-extractor ``build.masks`` / ``build.dinov3`` / ``build.dynaclr``
  / ``build.celldino`` flags). ``LoadFlags.for_evaluate`` recovers the
  prior default. The 30-line inline load block in ``precompute_cli.py``
  collapses to a single ``load_eval_models(config, flags=LoadFlags(...))``
  call. Stricter celldino weights gate is preserved as an explicit
  pre-load raise.
- Added ``init_cache_contexts(config, models)`` and
  ``init_gt_cache_context(config, models)`` so the four identical
  ``init_cache_context(..., side=..., **identity_tags)`` call sites in
  ``evaluate_predictions``, ``_worker_setup``, and
  ``precompute_gt_artifacts`` consolidate behind the EvalModels bundle.

Test fixtures (tests/_eval_fixtures.py):
- Promoted the cache-only fixture builders (``build_eval_config``,
  ``build_fixture``, ``make_hcs_plate``, ``make_mask_cache``,
  ``live_pipeline_module``, ``read_position_arrays``,
  ``N_POSITIONS``/``T``/``D``/``H``/``W``) out of
  ``test_evaluation_pipeline_parallel_cpu.py`` into a private
  ``_eval_fixtures.py`` module (underscore-prefixed so pytest skips
  collection). Both ``test_evaluation_pipeline_parallel_cpu.py`` and
  ``test_evaluation_grouped.py`` now import normally — no
  ``sys.path.insert`` (project's "no ``sys.path`` edits" rule).

CI:
- ``.github/workflows/test.yml`` adds ``test_evaluation_grouped.py`` to
  the ``test-dynacell-configs`` matrix so the grouped driver gets
  exercised on every push.

All 104 tests in the affected suites pass. Pre-existing 48-test failure
baseline (test_benchmark_config_composition + one test_evaluation_metrics
case) is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): tier grouped-eval process-mode message by cache state

Split the existing ``[grouped] note:`` print into two tiers so users get
proportional signal for the configuration they actually picked:

- ``executor=process`` + ``require_complete_cache=true`` (cache-only,
  n_conditions > 1) keeps the mild informational note — workers re-init
  per condition but skip ``prepare_segmentation_model`` (returns None)
  and don't instantiate the deep extractors, so the only overhead is
  N pool spawns.
- ``executor=process`` + ``require_complete_cache=false`` (cache-miss,
  n_conditions > 1) gets a loud multi-line WARNING. This is the actual
  worst case: each condition's worker pool independently loads
  SuperModel + DINOv3 + DynaCLR + CELL-DINO, so total wasted cold-start
  scales as ~30-90 s × N_workers × N_conditions. The warning explicitly
  points to ``executor=serial`` as the fix, since the parent's
  pre-loaded ``EvalModels`` is shared across conditions in serial mode
  but lost across pool spawns in process mode.

Two new tests in ``test_evaluation_grouped.py``:
- ``test_grouped_warns_loudly_on_process_plus_cache_miss`` confirms the
  WARNING text fires (executor=process + require_complete_cache=false +
  2 conditions) with all the expected keywords.
- ``test_grouped_mild_note_on_process_plus_cache_only`` confirms the
  cache-only fixture (require_complete_cache=true) gets the mild note
  and does NOT trip the loud WARNING.

CLAUDE.md "Grouped multi-condition eval" section updated with the
tiered behavior so users searching for the WARNING text find context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): grouped eval struct-mode + require_complete_cache invariant

Three fixes triggered by Copilot's review on PR #427 plus an independent
struct-mode bug that would have crashed every real Hydra invocation of
``dynacell evaluate-grouped``:

1. ``_merge_condition`` was over-simplified during the /simplify pass
   (commit c271d3f). ``OmegaConf.merge`` propagates struct mode from
   its first argument, and Hydra composition always produces struct-mode
   configs. Result: merging an overlay carrying a ``name`` label (which
   isn't in the schema) raises ``ConfigKeyError``, and ``del
   merged["conditions"]`` raises ``ConfigTypeError`` even when the merge
   succeeds. Restore the ``to_container`` round-trip on the base — it
   escapes struct mode while still producing an independent config. Same
   fix applied to the ``models_base`` snapshot a few lines above.
   Existing tests didn't catch this because they construct configs via
   ``OmegaConf.create({...})`` directly (non-struct); real users going
   through ``@hydra.main`` would have hit the crash immediately.

2. Add ``io.require_complete_cache`` to ``_MODEL_LOADING_FIELDS`` so
   per-condition overlays cannot override it. The flag flips
   ``prepare_segmentation_model`` between "load SuperModel" and "return
   None" — letting a condition flip it would mean the shared model
   bundle is wrong for that condition (None when cache-miss expects a
   real segmenter, or loaded but never used). Treat as a grouped
   invariant. (Copilot thread on pipeline.py:1227.)

3. Add ``applications/dynacell/tests/__init__.py``. The new test files
   use ``from ._eval_fixtures import ...`` relative imports, which work
   under the project's ``--import-mode=importlib`` setting but are
   fragile under other pytest import modes. Adding ``__init__.py``
   makes ``tests/`` a proper package and the relative imports work
   regardless of which import mode pytest is invoked with. (Copilot
   threads on test_evaluation_grouped.py:32 and
   test_evaluation_pipeline_parallel_cpu.py:42.)

Two new regression tests in ``tests/test_evaluation_grouped.py``:
- ``test_grouped_rejects_require_complete_cache_override`` covers the
  new invariant.
- ``test_grouped_works_on_struct_mode_config`` wraps the parity fixture
  in ``OmegaConf.set_struct(grouped_cfg, True)`` to simulate a
  Hydra-composed config; would have failed before the
  ``_merge_condition`` fix.

All 108 tests in affected suites pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(dynacell): fix two CLAUDE.md doc nits flagged in PR #427 self-review

Two 75-confidence findings from the parallel review on PR #427 that
scored below the 80-threshold for the inline review comment but are
worth fixing now while context is fresh:

1. "Reality check for process mode" subsection referenced
   ``pipeline.py:171`` as the parent seg-model load site. After the
   eval-parallelism refactor that line is a dataclass field
   (``pred_ts: list[np.ndarray] = ...``); the actual load now lives in
   ``load_eval_models(config)`` called from the top of
   ``evaluate_predictions``. Switch the reference from a brittle line
   number to a function-name pointer.

2. "Grouped multi-condition eval" subsection said the per-condition
   cache-load short-circuit fires "with ``force_recompute.final_metrics
   =false``", omitting the second gate: ``_final_metrics_cache_valid``
   also returns False (forcing recompute) when ``force_recompute.all
   =true``, regardless of the ``final_metrics`` flag. Make both gates
   explicit so users who set ``force_recompute.all=true`` for unrelated
   reasons aren't surprised by unexpectedly recomputed metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): keep SuperModel under require_complete_cache when pred cache off

`prepare_segmentation_model` was returning None unconditionally
whenever `io.require_complete_cache=true`, on the assumption that all
masks would be served from cache. That assumption breaks for
nucleus/membrane targets when `io.pred_cache_dir=None`: the per-T loop
in `_process_one_fov` falls back to `segment(predict[t],
seg_model=seg_model)`, which raises `ValueError` because `seg_model` is
None.

Narrow the early return to cases where no path can reach `segment()`
without a cached mask — organelle targets (workflow-based, no
`seg_model` needed) or nucleus/membrane runs that also have
`io.pred_cache_dir` configured. Reported by Copilot on PR #426.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(dynacell): align gpu_serialization_lock docstring with implementation

The docstring claimed "no CUDA available in the worker" was one of the
no-op conditions, but the implementation only checks `gate` and
`_GPU_LOCK_PATH`. On a CPU-only node with `use_gpu=true` the lock would
still serialize across workers. Rather than add a runtime CUDA probe
(callers already pass `gate=use_gpu` to opt out for CPU-only regions),
correct the docstring to describe what the code actually does.
Reported by Copilot on PR #426.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(dynacell): rewrite eval README — runtime/grouped/cache-only + trim bloat

The eval README had grown stale relative to the runtime/grouped/cache
work landed on this branch and PR #427: no mention of
`io.require_complete_cache`'s cache-only fast path, no `runtime.*`
block, no `dynacell evaluate-grouped`, no CELL-DINO.

This rewrite folds those in and restructures: FOV parallelism + grouped
eval move out of "Artifact caches" (where they were mis-classified)
into a new top-level "Scaling" section; a new "Quick start" sits above
Configuration; HF-cache operational details go to a new "HPC notes"
section at the end.

Bloat trim along the way: components table compressed (related modules
combined, trivial files pushed to a note), HF-cache + external-users
sections shortened, the 11-row force_recompute table collapsed to a
YAML block plus 5 grouped descriptions, redundant Hydra
group-discovery paragraphs removed. No flags, footguns, or CLI
options dropped — every config knob and env var from the old version
is still documented. 466 → 265 lines.

Public-repo hygiene: replaced the hard-coded
`/hpc/projects/comp.micro/...` default for the team-shared HF cache
with a description that points at the `_DEFAULT_SHARED_HF_CACHE`
constant in `dynacell/__main__.py` and the `DYNACELL_SHARED_HF_CACHE`
env-var override.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(dynacell): address PR #426 code-review findings

Five fixes from a code-review pass on the controlled FOV-parallel eval
work:

- Restore the seg_model invariant for grouped multi-condition runs.
  After 5829175, ``prepare_segmentation_model``'s load decision
  depends on ``io.pred_cache_dir`` in addition to
  ``io.require_complete_cache``, but ``_MODEL_LOADING_FIELDS`` only
  caught the latter. A per-condition overlay setting
  ``io.pred_cache_dir=None`` against a base that has it set (for
  nucleus/membrane + ``require_complete_cache=true``) would crash
  at runtime: the shared ``seg_model`` is None but the per-T loop
  falls back to ``segment(predict[t], seg_model=None)``. New
  ``_seg_model_required()`` helper mirrors the load decision and
  ``_check_grouped_field_invariants`` rejects the dangerous direction
  only, leaving canonical per-condition pred caches free to differ.

- Fix the process-mode progress bar stalling. ``pbar.update(1)`` only
  fired inside the in-order drain loop, so if FOV 0 was slow and later
  FOVs completed first, the bar stayed at 0% until FOV 0 drained.
  Update on each worker completion instead.

- Correct the misleading ``[grouped] note:`` for ``executor=process`` +
  ``require_complete_cache=true``. The message claimed "no models
  actually load" but deep extractors do load whenever
  ``compute_feature_metrics=true``, and the segmenter only short-
  circuits in a subset of cases. Rewrite to match reality.

- Move ``runtime.*`` imports (12 symbols) from four inline
  per-function blocks to one top-of-file import. No spawn-ordering or
  circular-import constraint blocks them; the inline pattern just
  violated the project's "no inline imports without strong reason"
  rule and duplicated a ``flush_manifest`` import that already lives
  at module top.

New regression test ``test_grouped_rejects_pred_cache_dir_flip_for_nucleus``
guards the dangerous direction; an existing test's assertion updates to
the corrected note wording.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All other dynacell-era models (FNet3D, FCMAE family) write to
/hpc/projects/.../models/dynacell/...; only the legacy comparison-
vs-viscy models (celldiff, unetvit3d) and my new pix2pix3d_unetvit
baseline (which copied the celldiff/unetvit3d leaf templates) kept
the legacy /models/cell_diff_vs_viscy/ prefix.

This commit normalizes pix2pix3d_unetvit alone — 31 leaves under
benchmarks/virtual_staining/<organelle>/pix2pix3d_unetvit/ — so
future Phase 3/4 runs land in the canonical dynacell tree. The
in-flight DDP run (33190465) is unaffected: it reads from its
already-rendered resolved YAML and writes to the legacy path; that
run's outputs will be migrated manually after it completes.

Legacy celldiff + unetvit3d leaves are intentionally left alone in
this commit (already-trained models, would need on-disk mv too).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed models

16 new predict configs under */celldiff/a549_mantis/ (4 organelles × 4 predict
sets: denv/mock/zikv/ipsc_confocal). A549 test sets use z_window_size=48 with
a549trained zarr naming; iPSC confocal OOD eval uses z_window_size=40 with
a549trained suffix in ipsc/predictions/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The new metric suite — torch-fidelity FID/KID/PRC/MIND, real-vs-pred
linear probes (CP/DINOv3/DynaCLR/CellDINO), and the CellDINO encoder
baseline — is the intended default for the post-refactor re-eval
campaign. Previously the base default was false, so any leaf that did
not explicitly set true silently skipped DINOv3+DynaCLR+CellDINO loading
and emitted an 11-column feature_metrics.csv instead of the 74-column
full schema.

Flip the base default in _configs/eval.yaml; add explicit
compute_feature_metrics: true to the 72 leaf eval__*.yaml configs that
previously relied on the default; flip the three fnet3d mix-trained
nucleus shell scripts that hard-coded false. Leaves restate the flag so
intent stays visible at the leaf without chasing the defaults graph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
defaults: [eval, _self_] caused eval's expansion (which ends with the
``leaf`` group) to apply BEFORE eval_grouped.yaml's own ``_self_`` body,
so ``conditions: []`` from _self_ silently clobbered any leaf-provided
``conditions:`` list. ``evaluate-grouped leaf=<path>`` would then exit
immediately with "evaluate_predictions_grouped requires a non-empty
top-level 'conditions' list."

Swapping to defaults: [_self_, eval] runs the schema-declaring _self_
first, then lets eval (and its trailing ``optional leaf: null``) apply on
top, so the leaf's conditions land last and survive composition.

Verified on all 12 generated grouped leaves: ``uv run dynacell
evaluate-grouped leaf=<path> --cfg job`` now reports the leaf's full
conditions list (16-28 entries) with target_name set correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old example ``uv run dynacell evaluate-grouped -c eval_grouped_a549_mantis_er``
was wrong on two counts:

* ``-c`` is Hydra's ``--cfg`` flag, which only accepts ``job``,
  ``hydra``, or ``all`` for config display. Passing a config name
  there is a no-op (or argparse error, depending on placement).
* The grouped leaves live under
  ``configs/benchmarks/virtual_staining/_internal/leaf/grouped/`` and
  are discovered via the ``leaf=`` group override the same way
  single-condition leaves work — that's the actual entry point baked
  into the dynacell CLI's _EXTERNAL_SEARCHPATHS.

Replace with the correct ``leaf=grouped/<bucket>/eval_grouped`` form
and add a one-line note explaining why ``-c`` doesn't work, so the
next person who follows the docs gets a runnable command.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lets one tool cover both serial chained predicts (existing behavior,
default) and SLURM array submissions (new), so hand-written
``run_*_pred_*.slurm`` scripts can be retired in favor of a single
launcher with topology validation + resolved-config artifacts.

New flags (all opt-in, no change to existing serial path):

* ``--array`` — render ``sbatch_template_array.sbatch`` with
  ``#SBATCH --array=0-(N-1)``, per-task logs at ``%A_%a.{out,err}``,
  and a bash ``CONFIGS=(...)`` block dispatched by
  ``$SLURM_ARRAY_TASK_ID``.
* ``--max-array-concurrency N`` — append ``%N`` to the array spec
  to cap concurrent tasks. Requires ``--array``.
* ``--allow-mixed-directives`` — bucket leaves by ``(SBATCH directive
  identity, run_root)`` and submit one sbatch array per bucket. Each
  bucket writes artifacts into its own ``run_root``; job names get a
  ``_g{N}`` suffix. Without this flag, ``--array`` keeps the strict
  uniform-hardware/uniform-run_root invariant of serial mode.

Internals:

* Factored out ``_compose_leaves``, ``_validate_uniform_sbatch``,
  ``_validate_devices``, ``_write_resolved_configs``,
  ``_render_serial_sbatch``, ``_render_array_sbatch``, and
  ``_sbatch_submit`` from the monolithic ``submit()`` so the
  serial/array dispatch reads as a thin top-level loop.
* ``_render_sbatch_directives`` learned an ``array_spec`` arg that
  inserts ``#SBATCH --array=`` and switches the log filenames from
  ``%j`` to ``%A_%a``.
* ``_directive_bucket_key`` produces a stable hashable identity for
  a leaf's SBATCH block (``nodes``, ``gpus``, ``mem``, ``constraint``,
  ...), used by ``--allow-mixed-directives`` to group leaves.

The new ``sbatch_template_array.sbatch`` mirrors
``sbatch_template_batch.sbatch`` (umask, group-writable mkdirs,
cleanup trap, ``cd @@repo_root``, ``ml uv``, ``nvidia-smi``) and ends
with ``srun uv run python -m dynacell predict --config "$CFG"`` where
``$CFG`` is selected from ``CONFIGS`` by ``$SLURM_ARRAY_TASK_ID``.

``predict_batch.sh`` already forwards extra args via ``"$@"``; the
docstring is updated to advertise the new flags as supported
passthrough with worked examples.

Tests cover: ``--max-array-concurrency`` / ``--allow-mixed-directives``
rejected without ``--array``; negative concurrency rejected; mixed
``run_root`` rejected without ``--allow-mixed-directives``; serial
mode unchanged (no array directive, no ``CONFIGS=``); ``--array``
emits the array directive and ``CONFIGS`` block; ``--max-array-concurrency``
appends ``%N``; ``--allow-mixed-directives`` buckets 4 ER leaves
(3 a549-test + 1 ipsc-test) into two sbatch arrays with distinct
``_g0``/``_g1`` job-name suffixes; ``_directive_bucket_key``
distinguishes by constraint and ignores ``time``.

Verified end-to-end against the 16 celldiff_r2 a549trained leaves:
one invocation auto-buckets into ``a549/predictions/`` (12 tasks)
+ ``ipsc/predictions/`` (4 tasks) sbatch arrays.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the tooling that drives a 12-bucket re-eval campaign for the
post-refactor metric suite (torch-fidelity FID/KID/PRC/MIND, real-vs-pred
linear probes, CellDINO baseline, updated DynaCLR ckpt) across every
(organelle × train_set) tuple that currently has a prediction zarr on
disk.

Generator (``tools/generate_grouped_eval_configs.py``)
======================================================

Walks ``{ipsc,a549}/{predictions,joint_predictions}/`` and parses every
prediction zarr into a canonical identity ``(organelle, model, variant,
train_set, test_set, condition)``. Seven filename patterns are
recognized, including the legacy ``<org>_<model>__<gene>_<cond>`` form
that pre-dates the modern A549 naming — these are still live on disk for
sec61b + tomm20 deterministic and CellDiff_r2 iPSC-trained predictions
and must be consumed. ``celldiff_r2`` in ``joint_predictions/`` without a
variant suffix is recognized as the joint-default form. Skip allowlist
for the known-stale May-5 ``memb_celldiff_*`` zarrs and the
``sec61b_fnet3d``/``sec61b_unext2`` legacy aliases. The original
non-r2 CellDiff family is also skipped (only ``celldiff_r2`` is in
scope per the 2026-05-21 instruction).

Per condition, derives:

* ``benchmark.dataset_ref: {dataset, target}`` as a dict (resolver
  silently returns None on string refs or partial dicts). Manifest
  slugs are hyphen-form (``aics-hipsc``, ``a549-mantis-<gene>-<cond>``);
  target keys are gene markers on A549 (``h2b``/``caax``) vs logical
  organelle names on iPSC.
* ``io.pred_path`` (the walked zarr) and
  ``io.pred_cache_dir`` under
  ``{dataset}/eval_cache_pred/{train_set}/{model_variant}/{condition}/``.
* ``save.save_dir`` at the canonical re-eval path
  (``{dataset}/{evaluations_with_embeddings,evaluations_a549trained_with_embeddings,evaluations_jointtrained_with_embeddings}/eval_<paper>[_<infix>]_<organelle>[_<cond>]``).
* ``name`` — stable per-condition label used by the grouped driver.

Pre-checks before emission: every pred_path exists; every gt_cache_dir
contains both DynaCLR (sha12 ``e409a5a079aa``) and CellDINO (sha12
``ef7c17ffb0aa``) feature dirs; no duplicate save_dirs within a bucket.
``--skip-fresh`` filters out conditions whose canonical save_dir
already has a ≥70-col ``feature_metrics.csv`` (used after canceling
+ resubmitting the array to avoid redoing completed work).

Emits 12 production leaves (one per organelle × train_set) plus a 13th
``_probe`` leaf used by the Step-5 sanity check (6 conditions × 2
positions, all save_dirs under ``/tmp/reeval_probe/`` so partial runs
can't corrupt production), plus a README summarizing per-bucket
condition counts. Leaves are pure ``# @Package _global_`` overlays
(no ``defaults: [eval_grouped, _self_]`` — that creates a composition
cycle since ``evaluate-grouped`` already starts from
``config_name="eval_grouped"``).

Tests (``tools/generate_grouped_eval_configs_test.py``)
=======================================================

* Grammar dispatch: 14 parametrized cases covering the seven canonical
  patterns + variants + dedupe-preference-for-joint-dir.
* Failure cases raise ``ValueError`` with informative messages
  (unknown organelle, unknown model, unknown CellDiff variant).
* Save_dir/pred_cache_dir/dataset_ref derivation matches the canonical
  layout (incl. the A549 ``h2b``/``caax`` vs iPSC ``nucleus``/``membrane``
  target-key split).
* Live-data tests (marked ``requires_data``) walk the actual training
  tree, assert every emitted ``pred_path`` exists, and compose +
  resolve every generated leaf via ``OmegaConf.merge`` +
  ``apply_dataset_ref`` + ``_check_grouped_field_invariants`` — catches
  what ``--cfg job`` does NOT (Hydra-only composition vs eval-time
  resolver splice + invariant checks).

Campaign runners
================

``tools/run_reeval_campaign.slurm`` — SLURM array (0-11%4) over the 12
grouped leaves. ``constraint=a100|h100|h200`` (eval-time GPU load is
~15 GB; ``pix2pix3d_needs_h200`` applies to predict-time DDP only).
``--time=48h`` (2× the probe-based projection for the largest 28-cond
bucket). ``DYNACELL_THREADS_PER_WORKER=32`` set before any C-extension
loads so BLAS caps bite at first thread-allocator load (per
``dynacell/CLAUDE.md`` "Eval runtime / parallelism").

``tools/verify_reeval_outputs.py`` — post-run sanity walks every
emitted save_dir + pred_cache_dir and runs the 5 plan-defined checks:
``feature_metrics.csv`` has ≥70 cols (target 74), 8 ``{gt,pred}_*``
embedding npz files, ``eval_timing.csv`` present, CellDINO/DynaCLR
probe + fidelity columns present, ``pred_cache/features/celldino/``
populated. Prints a pass/fail table; exits non-zero on any fail.

Grouped leaves
==============

12 leaves + ``_probe`` leaf + README under
``configs/benchmarks/virtual_staining/_internal/leaf/grouped/``,
auto-generated by the script above. Committed as a campaign snapshot
(generator can regenerate at any time given the prediction zarrs + GT
caches on disk). Current state reflects the post-``--skip-fresh`` set
after the first array's partial completion: 16/8/1/16/16/6/22/12/9/22/16/6
= 150 production conditions + 6 probe conditions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…style) (#428)

* feat(viscy-models): add 3D PatchGAN discriminator and LSGAN losses

Introduces a new gan subpackage with PatchGAN3D and MultiScalePatchGAN3D
(pix2pixHD-style, 2-scale by default), plus lsgan_d_loss / lsgan_g_loss
list-based helpers. These will pair with UNetViT3D to form a 3D pix2pix
adversarial baseline alongside the existing regression and flow-matching
DynaCell baselines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add DynacellGAN Lightning module for 3D pix2pix baseline

DynacellGAN composes the existing UNetViT3D generator with the new
MultiScalePatchGAN3D discriminator, training with LSGAN + L1 (λ=100) via
manual optimization (automatic_optimization=False). A requires_grad
toggle separates the D step from the G step so each backward only
populates the active side's grads; D grads are also explicitly cleared
between optimizers so the test invariant (no D grads after a G step)
holds for Phase 2 DDP correctness.

Validation logs the existing loss/validate weighted-mean alias, so
ModelCheckpoint(monitor="loss/validate") keys off the same metric used
across the other DynaCell baselines. configure_optimizers builds two
AdamW + WarmupCosine pairs directly to fit the two-optimizer return
shape Lightning's manual mode expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add pix2pix3d_unetvit recipe, model overlays, and smoke leaf

The recipe declares dynacell.engine.DynacellGAN with UNetViT3D's
production generator config and a 2-scale spectral-normalized PatchGAN3D.
fit / predict model overlays mirror the unetvit3d ones with lr_g/lr_d
split (no scalar lr) and bf16-mixed precision.

The single-GPU smoke leaf for er/pix2pix3d_unetvit/ipsc_confocal sets
benchmark.dataset_ref: null to disable the compose-hook resolver so it
can point data_path at SEC61B_test48.zarr directly, and uses batch_size=2
to match the inherited num_samples=2 (HCSDataModule divisibility rule).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add ER iPSC production train leaf for pix2pix3d_unetvit

Mirrors er/unetvit3d/ipsc_confocal/train.yml with the model overlay and
all model_name / experiment_id / run_root / job_name strings retargeted
to pix2pix3d_unetvit. Keeps ModelCheckpoint.monitor: loss/validate
unchanged — DynacellGAN logs that key via on_validation_epoch_end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add pix2pix3d_unetvit single-set train leaves across organelles

Adds 7 production single-GPU training leaves for pix2pix3d_unetvit:
- er/pix2pix3d_unetvit/a549_mantis
- {membrane, mito, nucleus}/pix2pix3d_unetvit/{ipsc_confocal, a549_mantis}

Each leaf mirrors the matching unetvit3d cell with model_overlay,
model_name, experiment_id, logger name/save_dir, ckpt dirpath, launcher
run_root/job_name retargeted to pix2pix3d_unetvit. ModelCheckpoint
monitor=loss/validate preserved.

Joint training leaves (joint_ipsc_confocal_a549_mantis) deferred — still
gated on the open BatchedConcatDataModule DDP deadlock (handoff
2026-04-26).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add iPSC predict + eval leaves for er/pix2pix3d_unetvit

Mirrors er/unetvit3d/ipsc_confocal/{predict,eval}__ipsc_confocal.* with
the model overlay, model_name, experiment_id, ckpt_path, output_store
filename, and eval pred_path / save_dir all retargeted to
pix2pix3d_unetvit. predict_method stays full_image (the only mode
DynacellGAN.predict_step supports for v1).

ckpt_path uses an explicit REPLACE_ME_WITH_PRODUCTION_CHECKPOINT_PATH
placeholder so torch.load raises FileNotFoundError with the placeholder
visible if the user forgets to override it — running predict against an
untrained generator would silently produce garbage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add a549_mantis predict + eval triplet for er/pix2pix3d_unetvit/ipsc

Adds the 3 a549_mantis test splits (mock, denv, zikv) for the
iPSC-trained pix2pix3d_unetvit ER cell — enables cross-dataset transfer
evaluation. Mirrors the unetvit3d ipsc_confocal predict/eval triplet
with the same retargeting rule already used for the iPSC pair.

ckpt_path uses the REPLACE_ME_WITH_PRODUCTION_CHECKPOINT_PATH placeholder
so any forgotten override fails loudly via FileNotFoundError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(dynacell): /simplify pass — reuse configure_adamw_scheduler, dedupe _log_samples

DynacellGAN.configure_optimizers used to rebuild AdamW + WarmupCosineSchedule
inline despite the class docstring claiming it called the shared helper.
Now calls configure_adamw_scheduler twice and concatenates, dropping the
redundant schedule guard and the unused monai/torch.optim imports.

_log_samples was a byte-identical instance method on DynacellUNet,
DynacellFlowMatching, and DynacellGAN. Promoted to a module-level free
function and updated the five call sites across all three classes.
Updated test_dynacell_gan_validate_logs_alias to monkeypatch the module
attribute instead of the instance method.

Also trimmed three narrating comments inside DynacellGAN (the plan/PR
context comments belong in the PR description, not the source) and five
per-layer banner comments in patchgan3d.py (the conv literals state the
shape and the class docstring covers the layer table).

Safe under the running production training's Slurm auto-requeue: no
constructor signatures change, AdamW + WarmupCosine pairs are built with
the same args, optimizer state shape is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): Phase 2 — DDP topology + 4-GPU smoke + production leaves

Adds the DDP infrastructure for pix2pix3d_unetvit so the GAN baseline
can train on 4 H200s in ~13 h instead of the ~50 h single-GPU pace
measured on job 33132605 (now cancelled).

- recipes/topology/ddp_4gpu_gan.yml — DDP variant that wraps the strategy
  in find_unused_parameters: true. Required because DynacellGAN runs
  manual_optimization with two optimizers, and the per-step requires_grad
  toggle leaves one side's params "unused" from the reducer's perspective
  on each backward. Plain ddp_4gpu.yml stays single-strategy for the
  other models.

- model_overlays/pix2pix3d_unetvit_fit_ddp.yml — mirrors the single-GPU
  fit overlay but composes ddp_4gpu_gan.yml topology + 4-GPU hardware.

- er/pix2pix3d_unetvit/ipsc_confocal/train_smoke_4gpu.yml — 4-GPU smoke
  variant. Same SEC61B_test48 store, batch_size=2 matching the inherited
  num_samples=2 divisibility rule. Validates DDP sharding + the
  find_unused_parameters DDP reducer behavior under the GAN's manual-opt
  pattern before committing the 4-GPU production run.

- er/pix2pix3d_unetvit/ipsc_confocal/train_4gpu.yml — 4-GPU production
  leaf. Keeps per-rank batch_size=4 (effective 16 across 4 GPUs) and
  unchanged lr_g/lr_d so the DDP run is comparable to the single-GPU
  reference at the per-rank optimization settings. logger name and
  job_name suffixed with _4gpu to disambiguate in W&B/slurm; run_root
  and ckpt dirpath unchanged so Phase 4 predict leaves keep working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): retarget membrane pix2pix3d_unetvit save paths

The membrane single-set train leaves landed with placeholder paths under
.../ipsc/memb_temp/ — diverging from the celldiff convention (memb, not
memb_temp) and conflating the a549_mantis run dir with the iPSC tree.
Both leaves now point at the canonical .../ipsc/memb/ and
.../a549_mantis/memb/ trees, matching the celldiff layout so all
organelles can be cross-referenced uniformly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): expand pix2pix3d_unetvit predict/eval matrix

Mirrors the er/pix2pix3d_unetvit/ipsc_confocal/ predict+eval set across
membrane, mito, and nucleus so every iPSC-trained pix2pix3d_unetvit run
has an apples-to-apples inference + evaluation path against both the
ipsc_confocal self-target and the three a549_mantis conditions (mock,
denv, zikv). Predict leaves swap target-inherited normalizations for a
source-only NormalizeSampled (Phase3D) and clear the RandWeightedCropd
augmentation. Eval leaves enable compute_feature_metrics only on the
ipsc_confocal self-target, matching the existing er convention. Ckpt
paths are placeholders pending the corresponding production runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add pix2pix3d_unetvit joint train + predict leaves

Adds the joint_ipsc_confocal_a549_mantis variant for all four
organelles (er, membrane, mito, nucleus): one train.yml authoring
BatchedConcatDataModule with two HCSDataModule children plus the four
predict leaves (self iPSC + three a549_mantis conditions). The joint
train differs from the celldiff joint by using NormalizeSampled
(fov_statistics) instead of MinMaxSampled so joint and single-set
pix2pix3d_unetvit runs share the same normalization contract for
apples-to-apples ablations. batch_size=2 × num_samples=2 matches the
single-set 4 GPU-samples/step (per the BatchedConcatDataModule rule
in CLAUDE.md that joint does NOT divide by num_samples).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(viscy-models): drop speculative num_layers from PatchGAN3D

PatchGAN3D and MultiScalePatchGAN3D accepted a num_layers constructor
parameter that only worked with the value 5 — any other value raised
ValueError. The argument was justified as future-proofing for ablations
that don't exist yet, which is exactly the speculative flexibility
CLAUDE.md prohibits: "No 'flexibility' or 'configurability' that wasn't
requested. ... No error handling for impossible scenarios." Re-add it
only if and when a non-5 variant lands. Recipe + test fixture updated
in lockstep; uv run pytest packages/viscy-models/tests/test_gan/ and
the DynacellGAN init/forward tests pass (11/11).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): apply TTUR — drop pix2pix3d_unetvit lr_d to lr_g/10

The 1:1 LR schedule collapsed to D-dominance on DDP run 33176091
(wandb 20260520-002629): loss/d_train fell from 1.3e-2 (ep1) to
~1e-4 (ep12+), loss/g_adv pinned at ~1.0 (saturated MSE(d_fake, 1)),
loss/g_l1 flat at ~0.45 from epoch ~5 onward — the adversarial
gradient was effectively zero, generator was training as pure L1.

Slowing the discriminator 10x (lr_d = lr_g / 10 = 3e-5) is the
canonical pix2pix-style fix for D-dominance. Applies to both the
single-GPU and DDP fit overlays so the entire Phase 3 baseline
(all 4 organelles × {iPSC, A549, joint}) inherits the same
adversarial balance regime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(viscy-models): add nonsat / RpGAN losses + R1/R2 grad penalties

Adds three loss families and two gradient penalties to viscy_models.gan,
extending the existing multi-scale 3D PatchGAN convention to the modern
StyleGAN2 / R3GAN recipe:

- nonsat_d_loss / nonsat_g_loss: softplus non-saturating loss, verbatim
  from stylegan2-ada-pytorch/training/loss.py.
- rpgan_d_loss / rpgan_g_loss: relativistic pairing loss, verbatim from
  brownvc/R3GAN/Trainer.py.
- r1_penalty / r2_penalty: Mescheder zero-centered gradient penalty on
  real / fake input. Multi-scale aggregation = mean of per-scale
  ||grad||^2 (NOT ||grad of sum||^2, which would mix cross-scale
  derivatives through MultiScalePatchGAN3D's pool-derived second scale).

All losses preserve the existing list-of-per-scale-Tensors convention.
LSGAN losses are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): wire DynacellGAN modernization knobs (StyleGAN2/R3GAN recipe)

Adds opt-in modernization to DynacellGAN with legacy-safe defaults
(loss_type=lsgan, r1_gamma=0, ema_kimg=None, lecam_gamma=0). The 14
existing leaves composing pix2pix3d_unetvit_fit.yml are unaffected
because none of them set the new knobs and the new defaults preserve
pre-modernization behavior.

New constructor knobs:
- loss_type {lsgan, nonsat, rpgan} + lambda_adv
- r1_gamma / r2_gamma / r1_every (lazy schedule with StyleGAN2-style
  * r1_every unbiased rescaling). RpGAN raises ValueError unless
  r1_gamma > 0 (R3GAN Theorem 3.1 convergence requirement).
- ema_kimg + use_ema_at_predict for the generator EMA shadow. Decay
  derived per-step as 0.5 ** (global_batch / (ema_kimg * 1000)).
- lecam_gamma + lecam_decay (decay=0.9 matches google/lecam-gan reference).

Training-step structural changes:
- D-step counter (_d_step_count buffer) drives the lazy R1 schedule;
  self.global_step would advance per opt.step() (2 per training_step
  with manual_optimization) so it can't be used directly.
- R1 / R2 path is outer-guarded on (r1_gamma > 0 or r2_gamma > 0) so
  legacy leaves skip the grad-of-grad compute and the spectral-norm
  power-iteration advance.
- R1 / R2 compute is wrapped in autocast(enabled=False); Lightning's
  bf16-mixed plugin would otherwise put bf16 into D forwards which
  is fragile under create_graph=True grad-of-grad.
- RpGAN G step recomputes fresh d_real against the post-update
  discriminator (matches R3GAN Trainer.py convention).
- LeCam EMA update uses sync_dist'd batch means via self.all_gather so
  all ranks update the buffers identically without explicit all_reduce.
- Generator EMA update after opt_g.step() (params are DDP-synced at
  that point, so per-rank EMA stays in lockstep).

Inference paths:
- _inference_generator helper returns EMA when available AND
  use_ema_at_predict; used by forward and predict_step.
- validation_step runs both raw and EMA generator forwards (when EMA
  exists); emits both loss/validate (back-compat) and loss/validate_ema
  aliases.

Checkpoint loading at engine.py:802 changed to strict=False with
filtered warning logs (expected_missing for generator_ema.* /
_d_step_count / _lecam_ema_*). When loading a pre-modernization
checkpoint with EMA enabled, the EMA shadow is seeded from the loaded
generator weights so inference paths return the loaded weights instead
of the random-init deepcopy.

16 DynacellGAN tests cover legacy-default safety, modernized smoke, the
R1/R2 gamma guard, lazy-R1 step counter, RpGAN constructor validation,
EMA seeding on legacy load, LeCam buffer registration gating, two-pass
validation logging, and inference-helper EMA selection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): add modernized pix2pix3d overlays + SEC61B Run A leaf

New opt-in configs for the StyleGAN2/R3GAN-style modernized pix2pix3d
recipe:

- pix2pix3d_unetvit_fit_modernized.yml: single-GPU overlay with
  loss_type=nonsat, r1_gamma=10, r1_every=16, ema_kimg=10,
  lr_g=lr_d=2e-4. Existing pix2pix3d_unetvit_fit.yml stays untouched
  so the 14 legacy leaves continue running LSGAN unchanged.
- pix2pix3d_unetvit_fit_ddp_modernized.yml: 4-GPU DDP sibling, same
  model knobs, composes ddp_4gpu_gan topology
  (find_unused_parameters=True).
- er/pix2pix3d_unetvit/ipsc_confocal/train_4gpu_modernized.yml: Run A
  of the plan. SEC61B headline experiment. ModelCheckpoint monitors
  loss/validate_ema (the EMA generator's val loss = inference path).
  Checkpoints land in a distinct subtree from the LSGAN baseline so
  there is no collision risk with in-flight runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): harden DynacellGAN against silent failures (PR #428 review)

Addresses findings from the PR review subagents:

- ema_kimg=0 raises ValueError (silent EMA freeze prevention).
- RpGAN now requires both r1_gamma > 0 AND r2_gamma > 0 (R3GAN Theorem 3.1
  requires both penalties for sharp-distribution convergence; previously
  only r1_gamma was validated).
- ckpt_path strict=False load: missing-but-not-expected keys now RAISE
  RuntimeError instead of warnings.warn (silent half-loaded models burn
  hours of training before users notice). Unexpected-keys path switched
  from warnings.warn to _logger.warning.
- Partial-EMA checkpoints (some but not all generator_ema.* keys) now
  RAISE — silently half-seeding the EMA shadow would produce nonsense
  outputs in the missing layers.
- on_predict_start logs which generator (EMA vs raw) will be used at
  inference, so users notice silent fallbacks (predict overlay forgot
  ema_kimg, use_ema_at_predict=False, etc.).

New tests:
- test_dynacell_gan_rpgan_requires_r2
- test_dynacell_gan_ema_kimg_zero_raises
- test_dynacell_gan_rpgan_smoke (end-to-end trainer.fit with rpgan +
  R1+R2; exercises the fresh-d_real RpGAN G-step path)
- test_dynacell_gan_ema_update_math (closed-form EMA decay verification)
- test_dynacell_gan_use_ema_at_predict_false (predict_step returns raw
  output when EMA is gated off)
- test_dynacell_gan_ckpt_unexpected_missing_raises (corrupted ckpt
  surfaces as RuntimeError, not a swallowed warning)

Existing test_dynacell_gan_ema_seed_on_legacy_load updated to use
use_spectral_norm=False on both sides — with the stricter
unexpected-missing check, SN parametrization name mismatches would
otherwise trip the raise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): validate r1_every >= 1 in DynacellGAN.__init__

r1_every is the modulo period for the lazy R1/R2 schedule and is also
used as the unbiased rescaling factor on the penalty term. Without an
init-time guard, r1_every=0 would ZeroDivisionError on the first D step
when R1/R2 is enabled, and a negative value would silently fire the gate
on the wrong cadence. Raise at construction so the bad config fails
before any training time is burned.

Surfaced by Copilot review on PR #428.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): gate val sample logging on use_ema_at_predict

DynacellGAN.validation_step assigned pred_for_samples=pred_ema whenever
the EMA submodule existed, regardless of use_ema_at_predict. This
contradicted predict_step / forward (both routed through
_inference_generator, which respects the flag) — the dashboard could
show EMA-generator samples while real predictions were emitted by the
raw generator. Now matches the inference path.

Regression test stubs out both generators with distinguishable constant
modules and reads back validation_step_outputs to confirm which one
fed the dashboard for each value of use_ema_at_predict.

Surfaced by Copilot review on PR #428.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(dynacell): keep _d_step_count as Python int to drop per-step sync

Lazy-reg gate previously read `int(self._d_step_count)` each training
step where the counter was a CUDA buffer. Each .item() call forced a
device→host sync. Switch to a plain Python int (incremented in
training_step, modulo-gated against r1_every) and persist it via
on_save_checkpoint / on_load_checkpoint hooks. All DDP ranks start at 0
and advance deterministically, so the buffer-was-broadcast invariant is
preserved without per-step sync overhead.

Side benefit: `_d_step_count` is no longer a state_dict key, so the
strict-load filter drops one expected-missing prefix. Legacy checkpoints
without `_d_step_count` resume cleanly at 0 via the dict.get fallback.

Surfaced by Copilot review on PR #428.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(dynacell): point modernized overlay header at PR #428, not local path

The header comment referenced /home/alex.kalinin/.claude/plans/..., which
won't exist for other contributors or in CI. Swap for the PR URL so the
recipe rationale is reachable from the file.

Surfaced by Copilot review on PR #428.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(dynacell): point SEC61B Run A leaf header at PR #428, not local path

Same fix as the modernized overlay header — replace
/home/alex.kalinin/.claude/plans/... reference with the PR URL so it
resolves for everyone.

Surfaced by Copilot review on PR #428.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@edyoshikun edyoshikun merged commit 393abdf into modular-viscy-staging May 22, 2026
alxndrkalinin added a commit that referenced this pull request May 22, 2026
…g/dynacell-dynadtw

PR #432 base modular-viscy-staging diverged from staging/dynacell-dynadtw
when #404 squashed all of dynacell-models history into a single commit
(393abdf). dynacell-models has 30 newer commits since dynadtw's last sync
at f1df36b — re-eval campaign generator, controlled FOV-parallel eval
(#426), 3D pix2pix GAN baseline (#425), DeepFeatureBatcher / precompute
deep features, cache identity refactors, pix2pix3d_unetvit modernization
(#428), backbone bookkeeping refactor onto _BackboneLists.

Conflict triage (92 conflicts):

- 89 files where staging/dynacell-dynadtw carried stale copies of
  dynacell-models content (no post-branch edits on dynadtw side):
  84 AA conflicts (eval YAML configs, dynacell evaluation sources/tests,
  CLAUDE.md, tools) + 5 UU conflicts (engine.py, __init__.py, __main__.py,
  test_engine.py, .github/workflows/test.yml). Resolved by taking the
  modular-viscy-staging post-#404 version.

- 3 files with genuine concurrent edits, resolved per direction:
  * .gitignore — dynadtw side is a strict superset (added local
    evaluation_matrix.md exclusion). Take dynadtw.
  * packages/viscy-models/.../foundation/cell_dino.py — dynadtw added
    CellDinoModel wrapper + per-image normalize flag; modular-viscy-staging
    didn't touch post-branch. Take dynadtw.
  * pyproject.toml — modular-viscy-staging added pytest "slow" marker +
    addopts "-m not slow" (no-overlap, kept); both sides identically added
    aics* git sources, py312 bump, pythonpath. iohub: modular-viscy-staging
    kept a git override pin at afe11f6a (~0.3.3); dynadtw dropped it for
    iohub>=0.3.5 from PyPI. Took dynadtw's strategy (PyPI 0.3.5 is
    released; viscy-data dep pin already at >=0.3.5 on this side).

- uv.lock regenerated from the merged pyproject.toml (iohub 0.3.5 from
  PyPI registry).

Sanity check: uv sync --all-packages --all-extras + uv run pytest -m
"not slow" applications/dynacell/tests/ shows 51 failed / 459 passed /
78 skipped. All 51 failures are pre-existing on dynacell-models HEAD
(test_benchmark_config_composition strategy/leaf composition issues +
test_evaluation_metrics gain-and-offset) — verified by running the same
suite on the clean backbone-refactor worktree. The merge introduces no
new regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alxndrkalinin added a commit that referenced this pull request May 23, 2026
Resolves the 258 conflicts blocking PR #429 (252 AA from independent
adds across both branches' dynacell scaffolding, plus 6 UU).

AA resolution:
- 250 files → take ours (backbone-refactor). The diffs are uniformly
  "ours is the newer evolution": cache_dir organelle-namespacing
  (fdf71bb), any-GPU launcher profile (7943915), Phase B
  preprocess_version (6c23471), _BackboneLists refactor (PR #429
  scope), microssim → cubic (PR #429 scope), pixel-metric
  unification (PR #429 scope), tests targeting all of the above.
  Theirs is the older PR #404 squash + PR #432 absorption state.
- 2 files → take theirs:
  * packages/viscy-models/src/viscy_models/gan/__init__.py
  * packages/viscy-models/src/viscy_models/gan/losses.py
  Base side has the pix2pix3d_unetvit GAN modernization
  (lsgan/nonsat/rpgan losses, R1/R2 penalties, EMA, LeCam).
  Job 33232131 depends on this — must not be regressed.

UU resolution:
- .gitignore → theirs (add evaluation_matrix.md path).
- pyproject.toml (root) → theirs (drop temp cubic/iohub git overrides;
  both are released on PyPI now).
- applications/dynacell/pyproject.toml → ours (cubic==0.7.0a5, no
  microssim — this IS PR #429's stated change).
- applications/dynacell/src/dynacell/engine.py → theirs (DynacellGAN
  modernization, same reason as the GAN package files).
- applications/dynacell/tests/test_engine.py → theirs (GAN
  modernization test suite).
- uv.lock → regenerated via uv sync after fixing pyprojects.

Sanity: cross-module imports (dynacell.evaluation + DynacellGAN +
viscy_models.gan including new nonsat/rpgan/r1/r2 losses) succeed.
pytest applications/dynacell/tests/test_pipeline_cache.py
applications/dynacell/tests/test_engine.py → 83/83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants