Staging VisCy Monorepo#373
Open
edyoshikun wants to merge 25 commits into
Open
Conversation
* refactor: restructure viscy into uv workspace monorepo with viscy-transforms subpackage BREAKING CHANGE: - Import path changed: from viscy.transforms import X → from viscy_transforms import X - Removed legacy modules: viscy.utils, viscy.cli, viscy.unet, viscy.evaluation - Removed applications/, examples/, and docs/ directories Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org> * docs: updated readme with citations / examples from main Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org> * docs: added symlinking uv cache on hpc systems Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org> * add the jupyter and ipykernel to a optional visual group * re organize the transforms * add jupyternotebook back * build: updated some dep groups Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org> * build: added matplotlib back in Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org> * build: add ruff and prek to the dev dep group * docs: correct the docstring for the transform that doesn't exist lol Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org> --------- Signed-off-by: Sricharan Reddy Varra <sricharan.varra@biohub.org> Co-authored-by: Sricharan Reddy Varra <sricharan.varra@biohub.org> Co-authored-by: Eduardo Hirata-Miyasaki <edhiratam@gmail.com> Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
* add planning roadmap * docs: start milestone v1.1 Models * docs: define milestone v1.1 requirements * docs: create milestone v1.1 roadmap (5 phases) * docs(package-scaffold-shared-components): research phase domain * docs(06): create phase plan - package scaffold and shared components * feat(06-01): create viscy-models package scaffold - Add pyproject.toml with hatchling build, torch/timm/monai/numpy deps - Create src layout with _components, unet, contrastive, vae subpackages - Add PEP 561 py.typed marker - Add test scaffolding with device fixture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(06-01): register viscy-models in workspace - Add viscy-models to root dependencies and uv sources - Update lockfile with timm and viscy-models dependencies - Verified: uv sync, import, and pytest collection all succeed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(06-01): complete package scaffold plan - Add 06-01-SUMMARY.md with execution results - Update STATE.md with position, metrics, and decisions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(06-02): extract shared components into _components/ module - stems.py: UNeXt2Stem, StemDepthtoChannels from v0.3.3 unext2.py - heads.py: PixelToVoxelHead, UnsqueezeHead, PixelToVoxelShuffleHead - blocks.py: icnr_init, _get_convnext_stage, UNeXt2UpStage, UNeXt2Decoder - __init__.py: re-exports all 8 public components - Zero imports from unet/, vae/, or contrastive/ - All attribute names preserved for state dict compatibility * feat(06-03): migrate ConvBlock2D and ConvBlock3D to unet/_layers/ - Copy ConvBlock2D from v0.3.3 source to snake_case file - Copy ConvBlock3D from v0.3.3 source to snake_case file - Preserve register_modules/add_module pattern for state dict key compatibility - Update _layers/__init__.py with public re-exports - Fix docstring formatting for ruff D-series compliance * test(06-03): add tests for ConvBlock2D and ConvBlock3D - 6 tests for ConvBlock2D: forward pass, state dict keys, residual, filter steps, instance norm - 4 tests for ConvBlock3D: forward pass, state dict keys, dropout registration, layer order - All 10 tests verify shape, naming patterns, and module registration * test(06-02): add forward-pass tests for all _components - test_stems.py: UNeXt2Stem shape, StemDepthtoChannels shape + mismatch error - test_heads.py: PixelToVoxelHead, UnsqueezeHead, PixelToVoxelShuffleHead shapes - test_blocks.py: icnr_init, _get_convnext_stage, UNeXt2UpStage, UNeXt2Decoder - 10 tests total, all passing on CPU * docs(06-03): complete UNet ConvBlock layers plan - SUMMARY.md with migration details and self-check - STATE.md updated: phase 6 plan 3/3, decisions recorded * docs(06-02): complete shared components extraction plan - SUMMARY.md documents 8 extracted components with 10 tests - STATE.md updated with decisions from 06-02 execution * docs(phase-6): complete phase execution and verification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(07-core-unet-models): research phase domain * docs(07): create phase plan for core UNet models * feat(07-01): migrate UNeXt2 model class to viscy-models - Copy UNeXt2 class (~70 lines) from monolithic unext2.py - Update imports to use viscy_models._components (stems, heads, blocks) - Preserve all attribute names for state dict compatibility - Export UNeXt2 from viscy_models.unet public API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(07-01): add 6 UNeXt2 forward-pass tests and fix deconv tuple bug - Add tests: default, small backbone, multichannel, diff stack depths, deconv, stem validation - Fix deconv decoder tuple bug in UNeXt2UpStage (trailing comma created tuple not module) - Mark deconv test xfail: original code has channel mismatch in deconv forward path - All 26 tests pass (25 passed, 1 xfailed) with no regressions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(07-01): complete UNeXt2 migration plan - Add 07-01-SUMMARY.md with execution results and deviation documentation - Update STATE.md: phase 7, plan 1/2 complete, new decisions logged Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(07-02): migrate FullyConvolutionalMAE to viscy-models - Copy FCMAE and all helper classes/functions to unet/fcmae.py - Replace old viscy imports with viscy_models._components imports - Remove duplicated PixelToVoxelShuffleHead (import from _components.heads) - Fix mutable list defaults to tuples (encoder_blocks, dims) - Export both UNeXt2 and FullyConvolutionalMAE from unet/__init__.py * test(07-02): migrate 11 FCMAE tests to viscy-models - Copy all 11 test functions with zero logic changes - Update imports from viscy.unet.networks.fcmae to viscy_models.unet.fcmae - Import PixelToVoxelShuffleHead from viscy_models._components.heads - All 37 tests pass across full suite (no regressions) * docs(07-02): complete FCMAE migration plan (Phase 7 complete) - Add 07-02-SUMMARY.md with execution results - Update STATE.md: Phase 7 complete, 12 plans total, decisions logged * docs(phase-7): complete phase execution and verification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(phase-8): research representation models migration * docs(08): create phase plan for representation models * feat(08-02): migrate BetaVae25D and BetaVaeMonai to viscy-models - Add BetaVae25D with VaeUpStage, VaeEncoder, VaeDecoder helpers - Add BetaVaeMonai wrapping MONAI VarAutoEncoder - Fix VaeDecoder mutable list defaults to tuples (COMPAT-02) - Change VaeEncoder pretrained default to False - Preserve all attribute names for state dict compatibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(08-01): migrate ContrastiveEncoder and ResNet3dEncoder to viscy-models - Add ContrastiveEncoder with convnext/resnet50 backbone support via timm - Add ResNet3dEncoder with MONAI ResNetFeatures backend - Fix ResNet50 bug: use encoder.num_features instead of encoder.head.fc.in_features - Add pretrained parameter (default False) for pure nn.Module semantics - Preserve state dict attribute names (stem, encoder, projection) - Share projection_mlp utility between both encoder classes * test(08-01): add 5 forward-pass tests for contrastive models - 3 tests for ContrastiveEncoder: convnext_tiny, resnet50, custom stem - 2 tests for ResNet3dEncoder: resnet18, resnet10 - Verify embedding and projection output shapes - ResNet50 test uses in_stack_depth=10 for valid stem channel alignment * test(08-02): add forward-pass tests for BetaVae25D and BetaVaeMonai - 2 BetaVae25D tests: resnet50 and convnext_tiny backbones - 2 BetaVaeMonai tests: 2D and 3D spatial configurations - Verify SimpleNamespace output with recon_x, mean, logvar, z - Fix ResNet50 expected spatial dims (64x64 not 128x128) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(08-01): complete contrastive model migration plan - Add 08-01-SUMMARY.md with execution results - Update STATE.md to Phase 8, plan 1/2 * docs(08-02): complete VAE migration plan (Phase 8 complete) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(phase-8): complete phase execution and verification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(09): research legacy UNet models migration * docs(09): create phase plan for legacy UNet models * feat(09-01): migrate Unet2d and Unet25d to viscy-models - Copy Unet2d from v0.3.3 with import path update to viscy_models.unet._layers - Copy Unet25d from v0.3.3 with import path update to viscy_models.unet._layers - Fix mutable default num_filters=[] to num_filters=() in both models - Add module docstrings and __all__ exports - Update unet/__init__.py to export all 4 models (UNeXt2, FCMAE, Unet2d, Unet25d) - Preserve register_modules/add_module pattern for state dict compatibility - All 45 existing tests still pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(09-01): add pytest tests for Unet2d and Unet25d - 12 tests for Unet2d: default forward, variable depth, multichannel, residual, task mode, dropout, state dict keys, custom num_filters - 11 tests for Unet25d: default Z-compression, preserved depth, variable depth, multichannel, residual, task mode, state dict keys with skip_conv_layer, custom filters - Fix list(num_filters) conversion in both models for tuple default compatibility - Total test suite: 68 passed, 1 xfailed, 0 failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(09-01): complete legacy UNet migration plan - SUMMARY.md with task commits, deviations, and self-check - STATE.md updated to Phase 9 complete (15 plans total) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(phase-9): complete phase execution and verification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(10): create phase plan for public API and CI integration * feat(10-01): add top-level re-exports for all 8 model classes - Import UNeXt2, FullyConvolutionalMAE, Unet2d, Unet25d from unet subpackage - Import ContrastiveEncoder, ResNet3dEncoder from contrastive subpackage - Import BetaVae25D, BetaVaeMonai from vae subpackage - Update __all__ with all 8 classes in alphabetical order Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(10-01): add state dict key compatibility regression tests - 24 tests covering all 8 migrated model architectures - Each model tested for parameter count, top-level prefixes, and sentinel keys - Guards COMPAT-01: state dict keys must match for checkpoint loading - Tests import from top-level viscy_models package (validates public API) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(10-01): add viscy-models to CI test matrix - Add package dimension to test matrix (viscy-transforms, viscy-models) - Use cross-platform --cov=src/ instead of named package coverage - Matrix now produces 18 jobs (3 OS x 3 Python x 2 packages) - check job automatically aggregates all test results via alls-green Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(10-01): complete public API & CI integration plan (v1.1 milestone complete) - Add 10-01-SUMMARY.md with execution results - Update STATE.md: phase 10 complete, v1.1 milestone done, 100% progress Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(phase-10): complete phase execution and verification (v1.1 milestone complete) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: consolidate ConvBlock2D/3D into _components Move conv_block_2d.py and conv_block_3d.py from unet/_layers/ to _components/ alongside all other shared building blocks. All reusable layers now live in one place. unet/_layers/ retained as backward- compatible re-export shim. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * remove the _layers * update the readme * update the main readme * fix description in toml * changing ruff formatting , dosctrings and imports * renaming folder to components * updatet planning docs * update to components * numpy docstring * add claude.md and contributing.md --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* add planning roadmap * docs: start milestone v1.1 Extract viscy-data * docs: complete viscy-data project research * docs: define milestone v1.1 requirements * docs: create milestone v1.1 roadmap (4 phases) * docs(06-package-scaffolding-and-foundation): create phase plan * feat(06-01): create viscy-data package directory structure with pyproject.toml - Add pyproject.toml with hatchling build, uv-dynamic-versioning, all base deps - Declare optional dependency groups: triplet, livecell, mmap, all - Add PEP 561 py.typed marker and tests/__init__.py - Configure pattern-prefix for independent versioning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(06-01): add type definitions and package init with re-exports - Copy all type definitions from viscy/data/typing.py into _typing.py - Add INDEX_COLUMNS from viscy/data/triplet.py for shared access - Update typing_extensions.NotRequired to typing.NotRequired (Python >=3.11) - Create __init__.py with full re-export of all public types - Add README.md required by hatchling build Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(06-01): integrate viscy-data as workspace dependency in root pyproject.toml - Add viscy-data to root dependencies list - Register viscy-data as workspace source in [tool.uv.sources] - Verified editable install and full import chain works Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(06-01): complete package scaffolding plan with summary and state update - Add 06-01-SUMMARY.md documenting viscy-data package creation - Update STATE.md with plan position, metrics, and decisions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(06-02): extract shared utility functions into _utils.py - Extract _ensure_channel_list, _search_int_in_str, _collate_samples, _read_norm_meta from hcs.py - Extract _scatter_channels, _gather_channels, _transform_channel_wise from triplet.py - Update imports to use viscy_data._typing instead of viscy.data.typing - Add __all__ listing all 7 utility functions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(06-02): complete utility module extraction plan - Add 06-02-SUMMARY.md documenting utility extraction - Update STATE.md: Phase 6 complete, progress 80% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(phase-6): complete phase execution * docs(07-code-migration): create phase plan * feat(07-01): migrate select.py, distributed.py, segmentation.py to viscy-data - Copy select.py with well/FOV filtering utilities (no internal viscy imports) - Copy distributed.py with ShardedDistributedSampler (no internal viscy imports) - Copy segmentation.py with viscy.data.typing -> viscy_data._typing import update - Add missing docstrings to satisfy ruff D rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(07-01): migrate hcs.py to viscy-data with utility import rewiring - Copy HCSDataModule, SlidingWindowDataset, MaskTestDataset from main - Replace viscy.data.typing imports with viscy_data._typing - Remove 4 utility function definitions (now in _utils.py) - Add import from viscy_data._utils for shared utilities - Remove unused re and collate_meta_tensor imports - Add missing docstrings for ruff D compliance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(07-01): migrate gpu_aug.py to viscy-data with dependency rewiring - Copy GPUTransformDataModule, CachedOmeZarrDataset, CachedOmeZarrDataModule - Rewire viscy.data.distributed -> viscy_data.distributed - Rewire viscy.data.hcs utility imports -> viscy_data._utils - Rewire viscy.data.select -> viscy_data.select - Rewire viscy.data.typing -> viscy_data._typing - Add missing docstrings for ruff D compliance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(07-01): complete core data module migration plan - Add 07-01-SUMMARY.md documenting migration of 5 core modules - Update STATE.md: phase 7 plan 1 of 4, decisions, metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(07-03): migrate mmap_cache.py and ctmc_v1.py to viscy-data - Rewire all imports from viscy.data to viscy_data prefix - Add lazy import for tensordict with clear error message - Add docstrings for ruff D compliance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(07-03): migrate livecell.py with lazy optional dependency imports - Rewire imports from viscy.data to viscy_data prefix - Add lazy imports for pycocotools, tifffile, torchvision - Add import guards in LiveCellDataset and LiveCellTestDataset __init__ - Add docstrings for ruff D compliance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(07-03): migrate combined.py as-is with import rewiring - Rewire viscy.data.distributed to viscy_data.distributed - Rewire viscy.data.hcs._collate_samples to viscy_data._utils._collate_samples - Preserve all 6 public classes without structural changes - Add docstrings for ruff D compliance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(07-02): migrate cell_classification.py and cell_division_triplet.py - Rewire imports from viscy.data to viscy_data prefix - Add lazy import for pandas in cell_classification.py with clear error message - Import _transform_channel_wise from viscy_data._utils (not triplet.py) - Import INDEX_COLUMNS and AnnotationColumns from viscy_data._typing - Add docstrings for ruff D compliance * docs(07-03): complete optional dependency module migration plan Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(07-02): complete specialized module migration plan - Add 07-02-SUMMARY.md documenting triplet, classification, and cell division module migration - Update STATE.md with position, decisions, and metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(07-04): add complete public API exports to viscy_data __init__.py - Export all 45 public names (17 types, 2 utilities, 26 DataModules/Datasets/enums) - Eager imports from all 13 modules (lazy guards handled internally by each module) - Comprehensive __all__ list for IDE autocompletion and star-import support - Ruff-sorted import ordering passes all lint checks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(07-04): complete public API exports plan - phase 7 fully done - 07-04-SUMMARY.md documenting 45 public exports and full package verification - STATE.md updated: phase 7 complete (4/4 plans), 12 total plans done Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(phase-7): complete code migration execution * docs(08-test-migration-and-validation): create phase plan * test(08-01): add conftest.py with HCS OME-Zarr fixtures for viscy-data - Copy all 6 fixtures and _build_hcs helper from main branch conftest - Replace legacy np.random.rand with np.random.default_rng (NPY002) - No viscy import changes needed (only uses third-party libs) - Provides preprocessed_hcs_dataset, small_hcs_dataset, tracks fixtures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(08-02): complete smoke tests plan - phase 8 test migration done - Created 08-02-SUMMARY.md documenting 52 smoke tests for viscy_data - Updated STATE.md: phase 8 complete, 14 total plans executed - DATA-TST-02 satisfied: import, __all__, optional dep messages, no legacy namespace Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(08-01): migrate test_hcs, test_triplet, test_select to viscy-data package - Update imports from viscy.data.X to viscy_data - Add BatchedCenterSpatialCropd to _utils.py (fixes batch dim handling) - Fix triplet.py to use BatchedCenterSpatialCropd instead of CenterSpatialCropd - Add tensorstore to test dependency group for triplet tests - All 19 tests pass across 3 test files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(08-01): complete data test migration plan summary - Create 08-01-SUMMARY.md documenting test migration and bug fixes - Update STATE.md with BatchedCenterSpatialCropd decision revision Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(phase-8): complete test migration and validation * docs(09-ci-integration): create phase plan * feat(09-01): add viscy-data CI test jobs to GitHub Actions workflow - Add test-data job with 3x3 matrix (3 OS x 3 Python) for viscy-data - Add test-data-extras job (ubuntu-latest, Python 3.13) for extras validation - Update check job needs to aggregate all test jobs: test, test-data, test-data-extras Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(09-01): complete CI integration plan - Add 09-01-SUMMARY.md documenting viscy-data CI jobs - Update STATE.md: phase 9 complete, v1.0 milestone complete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(phase-9): complete CI integration - milestone v1.1 done * chore: complete v1.1 milestone — Extract viscy-data Delivered: viscy-data package with 15 modules, 45 public exports, optional dependency groups, 71 tests, and tiered CI. Archives: - milestones/v1.1-ROADMAP.md - milestones/v1.1-REQUIREMENTS.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add missing pandas guards, restore conftest fixture, remove no-op CI filter - Add `if pd is None` guard in ClassificationDataModule.setup() and TripletDataModule._align_tracks_tables_with_positions() to raise helpful ImportError instead of AttributeError when pandas is absent - Fix ClassificationDataset error message to suggest `pip install pandas` instead of `pip install 'viscy-data[triplet]'` (classification doesn't need tensorstore) - Restore `num_timepoints` parameter on `_build_hcs()` and add `temporal_hcs_dataset` fixture from upstream commit 44b25b9 - Remove no-op `-m "not slow"` from test-data-extras CI job (no tests use @pytest.mark.slow) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add CLAUDE.md and update CONTRIBUTING.md Add CLAUDE.md with project-specific instructions for Claude Code sessions. Update CONTRIBUTING.md with ruff config centralization warning and numpy docstring convention note. Synced from 71009b5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update uv.lock after rebase onto modular-viscy-staging Regenerate lockfile to include viscy-data workspace dependencies alongside viscy-models from the updated base branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * port changes from tests-zarrv3 branch * ruff * add additional test for the main dataloaders * redundant tets 3. I think test 2 alreaady takes care of this. * rename INDEX_COLUMNS * remove unused LABEL classes * fix(livecell): assign transform result and avoid mutable defaults LiveCellTestDataset.__getitem__ discarded the return value of self.transform(sample), so MONAI transforms had no effect. Also replace mutable default lists in LiveCellDataModule.__init__ with None to prevent cross-instance state sharing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(cell_classification): raise ValueError, tighten val_fovs type, fix mutable default - `raise (f"Unknown stage: {stage}")` raised a string instead of an exception — use `ValueError`. - `val_fovs: list[str] | None` was unconditionally indexed in `setup()` — remove the `None` option since it's always required. - `_subset(..., exclude_timepoints=[])` used a mutable default — replace with `None`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(gpu_aug): remove _filter_fit_fovs override so exclude_fovs is applied CachedOmeZarrDataModule accepted exclude_fovs but its local _filter_fit_fovs override only filtered wells, silently ignoring excluded FOVs. Remove the override so the SelectWell mixin's implementation (which filters both wells and FOVs) is used. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: rename select.py to _select.py (private module) The module contains mostly private helpers (_filter_wells, _filter_fovs) and a mixin dataclass (SelectWell). Renaming to _select.py signals it is internal implementation detail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update package descriptions to "AI x Imaging" and CLAUDE.md Update viscy-data description from "virtual staining microscopy" to "AI x Imaging tasks" in README, pyproject.toml, and __init__.py. Add viscy-models test example to CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ctmc_v1): add missing prefetch_factor attribute CTMCv1DataModule.__init__ did not set self.prefetch_factor, causing an AttributeError when train_dataloader() or val_dataloader() was called (inherited from GPUTransformDataModule). Set it to None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add functional tests for ctmc, livecell, segmentation, classification Add unit tests for the four data modules that previously only had import smoke tests: - test_ctmc_v1.py: setup, val subsample ratio, batch shape - test_livecell.py: dataset/datamodule with mock TIFF + COCO data - test_segmentation.py: paired pred/target datasets, z-slice - test_cell_classification.py: annotation CSV, FOV split, timepoint exclusion Also adds shared fixtures to conftest.py (single_channel_hcs_pair, segmentation_hcs_pair, classification_hcs_dataset). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(livecell): add missing prefetch_factor; add dataloader iteration tests Add prefetch_factor attribute to LiveCellDataModule (same fix as 97455eb for CTMC). Add batch iteration + shape validation to classification and livecell datamodule tests to match coverage patterns in test_hcs/test_triplet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: delete leftover select.py after rename to _select.py Commit 5b132f9 renamed select.py to _select.py but did not remove the original file. Nothing imports from it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use explicit positions[0] instead of leaked loop variable CachedOmeZarrDataset and MmappedDataset both built self.channels using the loop variable `position` after iterating, implicitly depending on the last element. Use positions[0] to be explicit and avoid UnboundLocalError on empty input. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(livecell): handle empty annotations and correct return type Guard torch.stack against empty annotation lists in LiveCellTestDataset.__getitem__ when load_labels=True, returning properly shaped empty tensors instead of crashing. Fix _parse_image_names return type annotation: list[Path] -> list[str]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(segmentation): defer open_ome_zarr from __init__ to setup() Store only paths in __init__ and open OME-Zarr stores in setup("test"), consistent with other DataModules in the package and Lightning conventions for resource lifecycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * print to logger * refactor(triplet): remove BatchedCenterSpatialCropd from viscy-data The transform already exists in viscy-transforms and viscy-data should not depend on it. Replace the final crop with a shape validation check in on_after_batch_transfer() and require initial_yx_patch_size to match the desired output size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(docs): convert Sphinx-style docstrings to numpy style in _utils.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(init): lazy-load submodules via PEP 562 __getattr__ Replace eager imports of all DataModule/Dataset submodules with on-demand loading. Modules with optional dependencies (triplet, livecell, mmap_cache) are no longer imported at `import viscy_data` time. Add __init__.pyi stub for type-checker/IDE support. Also split CI test-data job to run without --all-extras so the base package is validated independently from optional dependencies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: guard None dereferences and replace mutable default arguments - cell_classification.py: guard _read_norm_meta() returning None - hcs.py: guard MaskTestDataset with ground_truth_masks=None, add missing array_key parameter, replace mutable default [] with None - triplet.py, cell_division_triplet.py: replace mutable default [] with None for normalizations/augmentations parameters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: correct return type annotations in _utils.py _gather_channels and _transform_channel_wise return Tensor, not list[Tensor]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(combined): check stage before calling dm.setup() Move the unsupported-stage guard to the top of setup() in ConcatDataModule and CachedConcatDataModule so constituent data modules are not set up for stages that will be rejected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add tests for cell_division_triplet, mmap_cache, and HCS test stage - test_cell_division_triplet.py: 11 smoke tests for dataset and datamodule - test_mmap_cache.py: 5 smoke tests (skipped when tensordict missing) - test_hcs.py: add setup("test") coverage for MaskTestDataset Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Eduardo Hirata-Miyasaki <edhiratam@gmail.com> Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test(10-01): add state dict key compatibility regression tests
- 24 tests covering all 8 migrated model architectures
- Each model tested for parameter count, top-level prefixes, and sentinel keys
- Guards COMPAT-01: state dict keys must match for checkpoint loading
- Tests import from top-level viscy_models package (validates public API)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore(10-01): add viscy-models to CI test matrix
- Add package dimension to test matrix (viscy-transforms, viscy-models)
- Use cross-platform --cov=src/ instead of named package coverage
- Matrix now produces 18 jobs (3 OS x 3 Python x 2 packages)
- check job automatically aggregates all test results via alls-green
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(10-01): complete public API & CI integration plan (v1.1 milestone complete)
- Add 10-01-SUMMARY.md with execution results
- Update STATE.md: phase 10 complete, v1.1 milestone done, 100% progress
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(phase-10): complete phase execution and verification (v1.1 milestone complete)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: consolidate ConvBlock2D/3D into _components
Move conv_block_2d.py and conv_block_3d.py from unet/_layers/ to
_components/ alongside all other shared building blocks. All reusable
layers now live in one place. unet/_layers/ retained as backward-
compatible re-export shim.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* remove the _layers
* update the readme
* docs: start milestone v1.1 Extract viscy-data
* update the main readme
* docs: complete viscy-data project research
* docs: define milestone v1.1 requirements
* docs: create milestone v1.1 roadmap (4 phases)
* docs(06-package-scaffolding-and-foundation): create phase plan
* feat(06-01): create viscy-data package directory structure with pyproject.toml
- Add pyproject.toml with hatchling build, uv-dynamic-versioning, all base deps
- Declare optional dependency groups: triplet, livecell, mmap, all
- Add PEP 561 py.typed marker and tests/__init__.py
- Configure pattern-prefix for independent versioning
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(06-01): add type definitions and package init with re-exports
- Copy all type definitions from viscy/data/typing.py into _typing.py
- Add INDEX_COLUMNS from viscy/data/triplet.py for shared access
- Update typing_extensions.NotRequired to typing.NotRequired (Python >=3.11)
- Create __init__.py with full re-export of all public types
- Add README.md required by hatchling build
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(06-01): integrate viscy-data as workspace dependency in root pyproject.toml
- Add viscy-data to root dependencies list
- Register viscy-data as workspace source in [tool.uv.sources]
- Verified editable install and full import chain works
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(06-01): complete package scaffolding plan with summary and state update
- Add 06-01-SUMMARY.md documenting viscy-data package creation
- Update STATE.md with plan position, metrics, and decisions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(06-02): extract shared utility functions into _utils.py
- Extract _ensure_channel_list, _search_int_in_str, _collate_samples, _read_norm_meta from hcs.py
- Extract _scatter_channels, _gather_channels, _transform_channel_wise from triplet.py
- Update imports to use viscy_data._typing instead of viscy.data.typing
- Add __all__ listing all 7 utility functions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(06-02): complete utility module extraction plan
- Add 06-02-SUMMARY.md documenting utility extraction
- Update STATE.md: Phase 6 complete, progress 80%
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(phase-6): complete phase execution
* docs(07-code-migration): create phase plan
* feat(07-01): migrate select.py, distributed.py, segmentation.py to viscy-data
- Copy select.py with well/FOV filtering utilities (no internal viscy imports)
- Copy distributed.py with ShardedDistributedSampler (no internal viscy imports)
- Copy segmentation.py with viscy.data.typing -> viscy_data._typing import update
- Add missing docstrings to satisfy ruff D rules
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(07-01): migrate hcs.py to viscy-data with utility import rewiring
- Copy HCSDataModule, SlidingWindowDataset, MaskTestDataset from main
- Replace viscy.data.typing imports with viscy_data._typing
- Remove 4 utility function definitions (now in _utils.py)
- Add import from viscy_data._utils for shared utilities
- Remove unused re and collate_meta_tensor imports
- Add missing docstrings for ruff D compliance
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(07-01): migrate gpu_aug.py to viscy-data with dependency rewiring
- Copy GPUTransformDataModule, CachedOmeZarrDataset, CachedOmeZarrDataModule
- Rewire viscy.data.distributed -> viscy_data.distributed
- Rewire viscy.data.hcs utility imports -> viscy_data._utils
- Rewire viscy.data.select -> viscy_data.select
- Rewire viscy.data.typing -> viscy_data._typing
- Add missing docstrings for ruff D compliance
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(07-01): complete core data module migration plan
- Add 07-01-SUMMARY.md documenting migration of 5 core modules
- Update STATE.md: phase 7 plan 1 of 4, decisions, metrics
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(07-03): migrate mmap_cache.py and ctmc_v1.py to viscy-data
- Rewire all imports from viscy.data to viscy_data prefix
- Add lazy import for tensordict with clear error message
- Add docstrings for ruff D compliance
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(07-03): migrate livecell.py with lazy optional dependency imports
- Rewire imports from viscy.data to viscy_data prefix
- Add lazy imports for pycocotools, tifffile, torchvision
- Add import guards in LiveCellDataset and LiveCellTestDataset __init__
- Add docstrings for ruff D compliance
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(07-03): migrate combined.py as-is with import rewiring
- Rewire viscy.data.distributed to viscy_data.distributed
- Rewire viscy.data.hcs._collate_samples to viscy_data._utils._collate_samples
- Preserve all 6 public classes without structural changes
- Add docstrings for ruff D compliance
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(07-02): migrate cell_classification.py and cell_division_triplet.py
- Rewire imports from viscy.data to viscy_data prefix
- Add lazy import for pandas in cell_classification.py with clear error message
- Import _transform_channel_wise from viscy_data._utils (not triplet.py)
- Import INDEX_COLUMNS and AnnotationColumns from viscy_data._typing
- Add docstrings for ruff D compliance
* docs(07-03): complete optional dependency module migration plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(07-02): complete specialized module migration plan
- Add 07-02-SUMMARY.md documenting triplet, classification, and cell division module migration
- Update STATE.md with position, decisions, and metrics
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(07-04): add complete public API exports to viscy_data __init__.py
- Export all 45 public names (17 types, 2 utilities, 26 DataModules/Datasets/enums)
- Eager imports from all 13 modules (lazy guards handled internally by each module)
- Comprehensive __all__ list for IDE autocompletion and star-import support
- Ruff-sorted import ordering passes all lint checks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(07-04): complete public API exports plan - phase 7 fully done
- 07-04-SUMMARY.md documenting 45 public exports and full package verification
- STATE.md updated: phase 7 complete (4/4 plans), 12 total plans done
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(phase-7): complete code migration execution
* docs(08-test-migration-and-validation): create phase plan
* test(08-01): add conftest.py with HCS OME-Zarr fixtures for viscy-data
- Copy all 6 fixtures and _build_hcs helper from main branch conftest
- Replace legacy np.random.rand with np.random.default_rng (NPY002)
- No viscy import changes needed (only uses third-party libs)
- Provides preprocessed_hcs_dataset, small_hcs_dataset, tracks fixtures
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(08-02): complete smoke tests plan - phase 8 test migration done
- Created 08-02-SUMMARY.md documenting 52 smoke tests for viscy_data
- Updated STATE.md: phase 8 complete, 14 total plans executed
- DATA-TST-02 satisfied: import, __all__, optional dep messages, no legacy namespace
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(08-01): migrate test_hcs, test_triplet, test_select to viscy-data package
- Update imports from viscy.data.X to viscy_data
- Add BatchedCenterSpatialCropd to _utils.py (fixes batch dim handling)
- Fix triplet.py to use BatchedCenterSpatialCropd instead of CenterSpatialCropd
- Add tensorstore to test dependency group for triplet tests
- All 19 tests pass across 3 test files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(08-01): complete data test migration plan summary
- Create 08-01-SUMMARY.md documenting test migration and bug fixes
- Update STATE.md with BatchedCenterSpatialCropd decision revision
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(phase-8): complete test migration and validation
* docs(09-ci-integration): create phase plan
* feat(09-01): add viscy-data CI test jobs to GitHub Actions workflow
- Add test-data job with 3x3 matrix (3 OS x 3 Python) for viscy-data
- Add test-data-extras job (ubuntu-latest, Python 3.13) for extras validation
- Update check job needs to aggregate all test jobs: test, test-data, test-data-extras
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(09-01): complete CI integration plan
- Add 09-01-SUMMARY.md documenting viscy-data CI jobs
- Update STATE.md: phase 9 complete, v1.0 milestone complete
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(phase-9): complete CI integration - milestone v1.1 done
* chore: complete v1.1 milestone — Extract viscy-data
Delivered: viscy-data package with 15 modules, 45 public exports,
optional dependency groups, 71 tests, and tiered CI.
Archives:
- milestones/v1.1-ROADMAP.md
- milestones/v1.1-REQUIREMENTS.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* harmonize the planning between the modular-data and modular-models
* viscy-utils package
* add applications/dynaclr
* update the monorepo uv
* moving files around
* update planning
* docs: start milestone v2.1 DynaCLR Integration Validation
* docs: define milestone v2.1 requirements
* docs: create milestone v2.1 roadmap (2 phases)
* docs(18-training-validation): create phase plan
* feat(18-01): add training integration tests for ContrastiveModule
- Add fast_dev_run tests for TripletMarginLoss and NTXentLoss code paths
- Add parametrized config class_path resolution tests for fit.yml and predict.yml
- Add tensorboard as test dependency for TensorBoardLogger in integration tests
- Fix workspace exclude to skip non-package application directories
- Use 2D-compatible synthetic data shapes (1,1,4,4) for render_images compatibility
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(18-01): complete training integration tests plan
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(phase-18): complete phase execution
* docs(19-inference-reproducibility): create phase plan
* chore(19-01): add anndata test dependency and HPC conftest fixtures
- Add anndata to dynacrl test dependency group
- Create conftest.py with HPC path constants, skip markers, and fixtures
- Update uv.lock
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(19-01): add inference reproducibility integration tests
- Create test_inference_reproducibility.py with 2 HPC integration tests
- test_checkpoint_loads_into_modular_contrastive_module (INFER-01)
- test_predict_embeddings_and_exact_match (INFER-02 + INFER-03)
- Fix lazy imports in EmbeddingWriter to avoid unconditional umap import
- Fix anndata nullable string compatibility in write_embedding_dataset
- Tests skip gracefully when HPC paths or GPU unavailable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(19-01): complete inference reproducibility plan
- Add 19-01-SUMMARY.md with execution results and deviation documentation
- Update STATE.md: Phase 19 complete, v2.1 milestone finished
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add seed_everything(42) to all integration tests
Ensures reproducibility by seeding all tests consistently.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(phase-19): complete phase execution
* restructure the examples folder and ruff
* update readme.me hallucination
* update the readmes
* - Add `viscy` console script in viscy-utils pointing to
viscy_utils.cli:main
- Add jsonargparse[signatures] dependency for LightningCLI
- Add 4 CLI smoke tests (help, subcommands, fit --help, predict
--help)
- Replace conda/anaconda with uv in SLURM scripts
- Update SLURM scripts to use `viscy fit/predict` instead of old
monolith
* add the CLI for running training and prediction
* default embedding writer to None
* import within the function
* ruff
* dynaclr typo
* rename folder to dynaclr
* add the classifiers here
* docs: start milestone v2.2 Composable Sampling Framework
* docs: define milestone v2.2 requirements
* docs: create milestone v2.2 roadmap (6 phases)
* docs(20): capture phase context
* docs(20): create phase plan for experiment configuration
* test(20-01): add failing tests for ExperimentConfig and ExperimentRegistry
- 19 test cases covering config creation, defaults, channel maps,
validation errors, YAML loading, tau-range conversion, and lookups
- All tests fail with ModuleNotFoundError (module not yet implemented)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(20-01): implement ExperimentConfig and ExperimentRegistry
- ExperimentConfig dataclass with all fields and defaults
- ExperimentRegistry with fail-fast validation at __post_init__:
empty check, duplicate names, source_channel membership,
channel count consistency, interval_minutes positivity,
condition_wells non-empty, data_path existence, zarr channel match
- channel_maps: per-experiment source position -> zarr index mapping
- from_yaml classmethod for YAML config loading
- tau_range_frames for hours-to-frames conversion with warning
- get_experiment lookup by name with KeyError
- All 19 tests pass
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor(20-01): clean up imports and exclude stale dynacrl workspace member
- Fix ruff I001 (import sorting) and F401 (unused import) in test file
- Exclude applications/dynacrl (typo) from uv workspace to unblock builds
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(20-01): complete ExperimentConfig/ExperimentRegistry plan
- SUMMARY.md with TDD execution results, self-check passed
- STATE.md updated with position, decisions, session continuity
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(20-02): add explicit deps and top-level experiment API exports
- Add iohub>=0.3a2 and pyyaml as explicit dependencies in dynaclr pyproject.toml
- Re-export ExperimentConfig and ExperimentRegistry from dynaclr __init__.py
- Both classes now importable via `from dynaclr import ExperimentConfig, ExperimentRegistry`
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(20-02): add example multi-experiment YAML configuration
- Demonstrate positional channel alignment across 2 experiments
- SEC61 (30min interval, ER) and TOMM20 (15min interval, mito)
- Show condition_wells with infected/uninfected/mock conditions
- Include comments explaining channel alignment and tau_range conversion
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(20-02): complete package wiring and example config plan
- SUMMARY.md with execution results and self-check
- STATE.md updated: Phase 20 complete, 20/25 phases (80%)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(phase-20): complete phase execution
Phase 20 Experiment Configuration verified (11/11 must-haves).
ExperimentConfig + ExperimentRegistry with TDD, package wiring, example YAML.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(21): create phase plan for Cell Index & Lineage
* test(21-01): add failing tests for MultiExperimentIndex
- 17 test cases covering CELL-01 (unified tracks), CELL-02 (lineage), CELL-03 (border clamping)
- All fail with ModuleNotFoundError (dynaclr.index not yet implemented)
- Test fixtures create mini OME-Zarr stores with tracking CSVs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(21-01): implement MultiExperimentIndex with lineage and border clamping
- Unified tracks DataFrame from all experiments with enriched columns
- Lineage reconstruction linking daughters to root ancestor via parent_track_id
- Border clamping: retains border cells with shifted patch origins instead of exclusion
- All 23 tests pass
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor(21-01): fix lint issues and export MultiExperimentIndex
- Remove unused variable (F841) in test_global_track_id_unique_across_experiments
- Use .to_numpy() instead of .values (PD011) in test_exclude_fovs_filter
- Export MultiExperimentIndex from dynaclr __init__.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(21-01): complete MultiExperimentIndex plan summary and state update
- 21-01-SUMMARY.md with full execution documentation
- STATE.md updated for 21-01 completion, decisions, session continuity
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* test(21-02): add failing tests for valid anchors, properties, and summary
- 8 tests for valid_anchors: basic validity, subset check, end-of-track exclusion,
lineage continuity, different tau ranges, empty tracks, gap handling, self-exclusion
- 9 tests for properties/summary: experiment_groups, condition_groups, summary()
- All 17 new tests fail with TypeError (tau_range_hours not yet accepted)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(21-02): implement valid_anchors, experiment_groups, condition_groups, summary
- Add tau_range_hours parameter to MultiExperimentIndex.__init__
- _compute_valid_anchors: per-experiment tau conversion, lineage-based lookup
- experiment_groups/condition_groups properties returning index arrays
- summary() with experiment counts, observation counts, per-experiment breakdowns
- All 40 tests pass (23 existing + 17 new)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(21-02): complete valid anchors plan
- SUMMARY.md with self-check passed
- STATE.md updated: Phase 21 complete, ready for Phase 22
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(22): research batch sampling phase domain
* docs(22): create phase plan for batch sampling
* test(22-01): add failing tests for FlexibleBatchSampler
- Experiment-aware batching: single-experiment restriction, all experiments appear
- Condition balancing: 2-condition and 3-condition proportional tests
- Leaky mixing: zero leak, 20% leak injection, no-effect when not experiment-aware
- Small group fallback: no crash, warning emission
- Determinism: same seed/epoch reproduces, set_epoch changes sequence
- Sampler protocol: yields list[int], correct __len__
- DDP partitioning: disjoint interleaved batches across ranks
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(22-01): implement FlexibleBatchSampler with experiment-aware, condition-balanced, leaky mixing
- FlexibleBatchSampler(Sampler[list[int]]) with cascade batch construction
- experiment_aware=True restricts each batch to a single experiment
- condition_balanced=True balances condition representation per batch
- leaky > 0.0 injects cross-experiment samples into restricted batches
- Deterministic via np.random.default_rng(seed + epoch)
- DDP support via interleaved batch partitioning across ranks
- Small group fallback to replacement sampling with logged warning
- Pre-computed group indices at __init__ for O(1) lookup
- Fix lint issues in test file (import sorting, .values -> .to_numpy(), nunique -> len(unique))
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor(22-01): export FlexibleBatchSampler from viscy_data package
- Add FlexibleBatchSampler to viscy_data.__init__.py public API
- Place import in alphabetically correct position for ruff isort compliance
- Add to __all__ exports under Utilities section
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(22-01): complete FlexibleBatchSampler core plan
- Create 22-01-SUMMARY.md with TDD execution results
- Update STATE.md: plan 01/02 complete, decisions, session continuity
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* test(22-02): add failing tests for temporal enrichment, DDP coverage, validation
- 6 temporal enrichment tests (focal concentration, global_fraction edge cases, validation)
- 5 DDP disjoint coverage tests (interleaving, coverage, epoch reproducibility)
- 3 validation guard tests (missing experiment/condition/hpi columns)
- 2 package import tests (import, __all__)
- All 9 new feature tests fail as expected (RED)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(22-02): implement temporal enrichment, validation guards, DDP coverage
- Add temporal_enrichment, temporal_window_hours, temporal_global_fraction params
- Implement _enrich_temporal: focal/global sampling from experiment pool
- Add column validation guards for experiment/condition/hpi columns
- Conditional precomputation: only groupby columns when feature enabled
- Fix stale smoke test __all__ count (45 -> 46) from Plan 01
- All 35 sampler tests pass, 107 total viscy-data tests pass
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(22-02): complete temporal enrichment + DDP plan
- 22-02-SUMMARY.md with all metrics, decisions, deviations
- STATE.md advanced to Phase 23, progress 22/25 (88%)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(phase-22): complete batch sampling phase execution
Phase 22 verified: FlexibleBatchSampler with all 5 SAMP requirements.
5/5 must-haves passed. 35 tests, 107 full suite pass. No regressions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(23): create phase plan for Loss & Augmentation
* test(23-01): add failing tests for NTXentHCL
- 12 test cases covering subclass, beta=0 equivalence, hard negatives,
gradients, temperature effect, edge cases, defaults, and CUDA
- All fail with ModuleNotFoundError (dynaclr.loss not yet created)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* test(23-02): add failing tests for ChannelDropout and variable tau sampling
- 11 tests for ChannelDropout: zeros, probability bounds, eval mode, per-sample, dtype, input safety, multi-channel, CUDA
- 7 tests for sample_tau: range, exponential decay, uniform, single value, determinism, return type
* feat(23-02): implement ChannelDropout and variable tau sampling
- ChannelDropout nn.Module: per-sample channel zeroing on (B,C,Z,Y,X) tensors
- sample_tau: exponential decay weighted sampling for temporal offsets
* feat(23-01): implement NTXentHCL with hard-negative concentration
- NTXentHCL subclasses NTXentLoss from pytorch_metric_learning
- beta=0.0 delegates to parent for exact numerical equivalence
- beta>0 applies exp(beta*sim) reweighting on negatives in denominator
- Normalized weights preserve loss magnitude across beta values
- All 11 tests pass (1 CUDA test skipped on macOS)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor(23-02): add ChannelDropout and sample_tau to package exports
- Export ChannelDropout from viscy_data top-level
- Export sample_tau from dynaclr top-level
- Include NTXentHCL export added by linter
* docs(23-02): complete ChannelDropout and tau sampling plan
- Summary with TDD metrics, decisions, self-check
- STATE.md updated for Phase 23 completion
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(23-01): complete NTXentHCL loss plan
- Created 23-01-SUMMARY.md with TDD execution results
- Updated STATE.md with HCL implementation decisions
- Self-check passed: all artifacts and commits verified
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(phase-23): complete loss & augmentation phase execution
Phase 23 verified: NTXentHCL (3/3 LOSS reqs), ChannelDropout (AUG-01),
sample_tau (AUG-03). AUG-02 wiring deferred to Phase 24 by design.
30 tests pass across loss, channel_dropout, tau_sampling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(24): create phase plan
* test(24-01): add failing tests for MultiExperimentTripletDataset
- 7 test cases covering __getitems__ return format, norm_meta, lineage-aware
positive sampling, division event traversal, channel remapping, predict mode,
and dataset length
- All tests fail with ModuleNotFoundError (RED phase)
* feat(24-01): implement MultiExperimentTripletDataset with lineage-aware sampling
- __getitems__ returns batch dicts with anchor/positive Tensors (B,C,Z,Y,X)
- Lineage-aware positive sampling via pre-built (experiment, lineage_id) lookup
- Division events traversed naturally via shared lineage_id
- Per-experiment channel remapping using registry.channel_maps
- Tensorstore I/O with SLURM-aware context and per-FOV caching
- Predict mode returns anchor + TrackingIndex dicts
- Exponential decay tau sampling with fallback to full range scan
* refactor(24-01): add MultiExperimentTripletDataset to package exports
- Export from dynaclr.__init__ for public API access
* docs(24-01): complete MultiExperimentTripletDataset plan
- SUMMARY.md with TDD commits, decisions, self-check
- STATE.md updated: position 24-01, decisions, session continuity
* update uv
* test(24-02): add failing tests for MultiExperimentDataModule
- 6 test cases covering hyperparameter exposure, experiment-level split,
FlexibleBatchSampler wiring, val dataloader, transforms, ChannelDropout
- RED phase: all tests fail with ModuleNotFoundError
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(24-02): implement MultiExperimentDataModule with experiment-level split
- MultiExperimentDataModule composes FlexibleBatchSampler + Dataset +
ChannelDropout + ThreadDataLoader with collate_fn=lambda x: x
- Train/val split by whole experiments via val_experiments parameter
- All sampling, augmentation, and loss hyperparameters exposed as __init__ params
- on_after_batch_transfer applies normalizations + augmentations + final crop
+ ChannelDropout with proper norm_meta handling for all-None case
- 6 TDD tests passing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor(24-02): add MultiExperimentDataModule to dynaclr package exports
- Import MultiExperimentDataModule from dynaclr.datamodule
- Add to __all__ for top-level importability
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(24-02): complete MultiExperimentDataModule plan
- Summary with TDD commits, decisions, and deviation documentation
- STATE.md updated: Phase 24 complete, 96% progress, ready for Phase 25
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(phase-24): complete dataset & datamodule phase execution
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(25): create phase plan
* docs(phase-25): complete integration phase plan
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(25-01): add end-to-end multi-experiment integration tests
- Create test_multi_experiment_fast_dev_run: 2 experiments with different
channel sets (GFP vs RFP), fast_dev_run with NTXentHCL loss
- Create test_multi_experiment_fast_dev_run_with_all_sampling_axes:
experiment_aware + condition_balanced + temporal_enrichment enabled
- Synthetic data helpers for multi-channel HCS OME-Zarr creation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(25-01): add multi-experiment YAML config and class_path validation test
- Create multi_experiment_fit.yml with MultiExperimentDataModule,
NTXentHCL loss, all sampling axes, generic channel names (ch_0/ch_1)
- Add test_multi_experiment_config_class_paths_resolve validating all
class_path entries in the config resolve to importable Python classes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(25-01): complete integration plan - milestone v2.2 complete
- Add 25-01-SUMMARY.md documenting end-to-end integration validation
- Update STATE.md: phase 25/25 complete, progress 100%, milestone v2.2 done
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs(phase-25): complete integration phase execution — v2.2 milestone shipped
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* add the smoothness and dynamic range comparison
* add the applications/qc
* bug qc metrics exposing the device
* add batch predict
* adding cli for reduce dimensionality composable
* add example configs for model comparision and smoothness
* add the biological annotations to the zattrs
* adding airtable logic
* harmonize and remove duplication between airtable and qc. moving most things to airtable
* cleanup readme for airtable
* add callback to store embeddings every n epochs and store metadata to the anndata.uns
* fix the apply-linear classifiers to make sure we use the model and version.
* Exclude untracked applications/dynacell from uv workspace
Local debris directory (hydra outputs, pycache) has no pyproject.toml
and breaks uv lock when matched by applications/* glob.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(26): capture phase context
* docs(state): record phase 26 context session
* docs(26): create phase plans
* feat(26-01): extract HCSPredictionWriter to viscy-utils callbacks
- Create prediction_writer.py with HCSPredictionWriter, _pad_shape, _resize_image, _blend_in
- Use TYPE_CHECKING guard for viscy_data imports (HCSDataModule, Sample)
- Add numpy-style docstrings to all functions and class
- Re-export HCSPredictionWriter from callbacks __init__.py
- Fix pre-existing INDEX_COLUMNS -> ULTRACK_INDEX_COLUMNS in embedding_snapshot.py and embedding_writer.py
- Add missing docstrings to embedding_snapshot.py public methods
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(26-01): extract MixedLoss to viscy-utils losses submodule
- Create losses/ submodule with mixed_loss.py containing MixedLoss class
- Uses ms_ssim_25d from viscy_utils.evaluation.metrics internally
- Convert docstrings to numpy-style
- Re-export MixedLoss from losses __init__.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(26-01): create translation application scaffold with workspace registration
- Create applications/translation/ with src layout following dynaclr pattern
- Add pyproject.toml with hatchling build, uv-dynamic-versioning, and workspace deps
- Add README.md required by hatchling readme field
- Add __main__.py delegating to viscy_utils.cli.main for LightningCLI entry point
- Add example YAML configs (fit.yml, predict.yml) with HCSPredictionWriter callback
- Create empty tests/__init__.py
- Register viscy-translation in root pyproject.toml workspace sources
- Update uv.lock with new workspace member
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(26-01): complete shared infra extraction + app scaffold plan
- Create 26-01-SUMMARY.md with execution results
- Update STATE.md with plan progress, decisions, session info
- Update ROADMAP.md marking 26-01 as complete
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(26-02): migrate translation engine and evaluation modules
- Copy engine.py with VSUNet, FcmaeUNet, AugmentedPredictionVSUNet, MaskedMSELoss
- Copy evaluation.py with SegmentationMetrics2D
- Update all imports to new package paths (viscy_data, viscy_models, viscy_utils)
- Remove MixedLoss class from engine.py (now imported from viscy_utils.losses)
- Update __init__.py with top-level re-exports
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test(26-02): add translation engine test suite
- Import tests for all public exports (VSUNet, FcmaeUNet, etc.)
- VSUNet init and forward pass smoke tests with synthetic data
- State dict key regression test for checkpoint compatibility
- MixedLoss integration test (from viscy_utils.losses)
- FcmaeUNet init test
- No old import paths grep test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(26-02): complete engine migration plan
- SUMMARY.md with 2 task commits, 2 auto-fixed deviations
- STATE.md updated: phase 26 complete, 20/25 phases (80%)
- ROADMAP.md updated with plan progress
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test(26): complete UAT - 8 passed, 0 issues
* add the pseudotime evals
* re-structure pseudotime folder
* add the linear classifier evals and restructure folder path
* add evaluations to dynaclr package
* cli and linear classifier init
* fix(translation): address engine and evaluation bugs
- Fix operator precedence bug in evaluation.py boolean condition
- Fix source variable overwritten in AugmentedPredictionVSUNet loop
- Fix unbound return_target in FcmaeUNet.forward_fit_task
- Add weights_only=True to torch.load for security
- Fix mutable default argument model_config: dict = {}
- Simplify redundant Union[nn.Module, MixedLoss] to nn.Module
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(translation): correct example config keys and dependencies
- Fix monitor key loss/val -> loss/validate in fit.yml
- Fix output_path -> output_store param name in predict.yml
- Add lightning>=2.3 as direct dependency
- Remove unused torchvision, pin torchmetrics>=1
- Remove unused SYNTH_OUT_C from conftest
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(viscy-utils): fix prediction_writer indexing and embedding_writer overwrite flag
- Fix numpy indexing in prediction_writer _blend_in broadcast shape
- Fix tuple[int] -> tuple[int, ...] type annotation in _create_image
- Respect overwrite flag in EmbeddingWriter.on_predict_start
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(viscy-utils): correct annotation.py docstring examples
- Fix function name in example: convert_xarray_annotation_to_anndata -> convert
- Fix obsm keys in example: X_PCA/X_UMAP/X_PHATE -> X_pca/X_umap/X_phate
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(viscy-models): prevent Sequence mutation and ZeroDivisionError
- Avoid mutating Sequence input in UNeXt2Decoder.forward
- Add input validation in UNeXt2Stem to prevent ZeroDivisionError
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address PR review blockers (airtable security, tests, qc cleanup)
- Remove api_key/base_id params from AirtableDatasets.__init__; read
credentials exclusively from env vars with clear ValueError on missing
- Add 59 tests for airtable_utils (database + schemas) with full mocking
- Wrap open_ome_zarr in context manager in qc/annotation.py to prevent
file handle leaks on exceptions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix tests
* add track timing backup, save the aucroc as metric for linear classifier, and fix the overwritting by the linear classifiers.
* fix(viscy-utils): validate every_n_epochs >= 1 in EmbeddingSnapshotCallback
Prevent ZeroDivisionError in _should_collect when every_n_epochs=0.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* consolidate appending column to anndata functionality from reduce dimension and linear classifier
* add DINOv3 to viscy-data
* refactor dynaclr app folder structure
* move losses to viscy-models
* data folder
* porting #360
* rname dino to foundation and support openphenom
* move shells cripts to the configs folder
* fix the import for opephenom
* generalize the qc class
* fix(viscy-data): skip None norm_meta in SlidingWindowDataset collation
Pre-existing bug: when a zarr has no normalization metadata,
sample["norm_meta"] = None was added to the batch dict, causing
default_collate to crash on NoneType. Only add norm_meta when
it is not None.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test(translation): add inference reproducibility test for vscyto3d
Validates refactored FcmaeUNet matches old monolithic code predictions.
Reference generated from main branch on a 512x512 crop of the mehta-lab
VSCyto3D test dataset with fov_statistics normalization.
- test_checkpoint_loads: 0 missing/unexpected state dict keys
- test_predict_and_match_reference: full pipeline (HCSDataModule +
HCSPredictionWriter + VisCyTrainer), Pearson r > 0.999, atol=0.02
HPC-gated: skips when checkpoint/data/reference paths unavailable.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(translation): use self.log_dict instead of self.logger.log_metrics
self.logger.log_metrics() crashes when no logger is attached and
bypasses Lightning's built-in aggregation/sync. Use self.log_dict()
which handles missing loggers and DDP sync correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(dynaclr): avoid CPU pos_weight device mismatch in ClassificationModule
pos_weight=torch.tensor(1.0) as a default argument is created on CPU
at class definition time. When the module is moved to GPU,
BCEWithLogitsLoss errors due to device mismatch. Move the default
into __init__ body so the tensor is created at instantiation time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(dynaclr): fix typo in create_pseudo_tracks docstring
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* add tqdm as default instead of Rich. Rich doesnt show up on the stdout of slurm jobs until it's done.
* de parallelize and default cosine distance for msd and knn for PHATE.
* make a cli for anndata
* add append to obs cli
* adding a cell index that standardizes and spits out parqet
* add example cell index
* update multiexperiment datamodule
* recipes
* add parquet integration test for multi-experiment training
Adds test_multi_experiment_fast_dev_run_with_parquet which verifies the
full Lightning training loop works when loading from a pre-built cell
index parquet via MultiExperimentDataModule.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* demo slurm code
* obs should use fov_name and track_id and id as UUID
* add marker, task and channels to compute the crossval.
* fix prediction remodeling analysis
* test(translation): add training integration tests for VSUNet and FcmaeUNet
Add 6 fast_dev_run tests validating the forward+backward pass:
- VSUNet with MSELoss and MixedLoss (synthetic data)
- FcmaeUNet pretraining (MaskedMSELoss) and fine-tuning (synthetic data)
- VSUNet and FcmaeUNet with real HCSDataModule/CachedOmeZarrDataModule
Also remove architecture-specific language from package description.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test(translation): add config class_path resolution tests
Validate that all class_path entries in fit.yml and predict.yml resolve
to importable classes, matching the DynaCLR test pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Raw logits are now passed through sigmoid before computing binary_accuracy and
binary_f1_score
* fix the dataset statistics
* add contributing and uv to claude.md
* pydantic >2.0
* - Add `schemas.py` with shared Pydantic data models (FOVRecord, etc.)
- Add `collection.py` for ML training collection definitions
- Rename `hours_post_infection` → `hours_post_perturbation` in `_typing.py`
- Update `cell_index.py` to use `collection_path` and renamed field
- Add corresponding tests for new modules and updated cell index
* uvx precommit
* refactor(viscy-data): replace condition_balanced with stratify_by in sampler
* add norm_meta batching helpers and triplet None guard
* - Apply _radians_to_degrees to shear_range (Kornia expects degrees, calls deg2rad internally)
- Fix docstrings to reflect ZYX input order, facet naming, and unit conventions
- Remove stale timepoint_statistics lookup in _normalize.py
* extend DatasetRecord from FOVRecord in viscy-data schemas; add viscy-data dep
* - ExperimentRegistry backed by Collection for per-experiment channel/norm maps
- MultiExperimentIndex with parallel FOV loading and lineage-aware anchors
- MultiExperimentDataModule with stratify_by, num_workers_index, FOV-level split
- Dataset updates for new pipeline; remove ExperimentConfig from __init__
- Consolidate shared test helpers and constants into conftest.py
* pass batch_size to Lightning metric logging for correct step counts
* update recipes and add sampling-strategies guide; add CLAUDE.md
- Update build-cell-index, train-multi-experiment, troubleshooting recipes
- Add sampling-strategies.md documenting FlexibleBatchSampler axes
- Update README table
- Add CLAUDE.md with data pipeline design principles
* update pseudotime and linear classifier scripts for renamed fields
* add configs
* physical nomrlaiztion and fix to mlp projetion layer for adapter.
* code review fixes: remove _components/, assert→ValueError, docstrings, logging
- Remove duplicate viscy_models/_components/ (all imports already use components/)
- Replace assert with raise ValueError in conv_block_2d/3d.py
- Fix numpy docstrings in viscy_data/_utils.py (_ensure_channel_list, _collate_samples)
- Narrow broad except Exception to (OSError, ValueError) in hcs.py
- Use direct row["parent_track_id"] access in cell_index.py (column guarded above)
- Fix frame interval mode().iloc[0] IndexError risk in pseudotime/metrics.py
- Remove backwards-compat re-exports from qc/config.py
- Add logging + warn on silent AUROC ValueError in linear_classifier.py
- Add exc_info=True to PHATE/PCA warning in embedding_writer.py
- Remove resolved TODO in feature.py
- Add _logger and replace print() with logging in dynaclr/utils.py and qc/qc_metrics.py
- Warn when logger_base path does not exist in dynaclr/utils.py
- Move inline imports to top in test_sampler.py
- Add type hints to stems.py compute_stem_channels()
- Remove redundant num_channels reassignment in beta_vae_25d.py
- Fix prek reference in CLAUDE.md (was pre-commit)
- Fix input_channel comment/value: organelle→marker in example config
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* fix tests: remove test_smoke, update config fields, fix parquet dtype
- Remove packages/viscy-data/tests/test_smoke.py: not a real integration
test (magic __all__ count, source string matching, no functional coverage)
- Update TestLinearClassifierTrainConfig: embedding_model+wandb_project
replaced by embedding_model_name+embedding_model_version in schema
- Fix test_parquet_lineage_preserved: add check_index_type=False to handle
object vs StringDtype difference between legacy and parquet paths
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* rename organelle→marker channel type; fix ArrowString/schema test failures
- Rename "organelle" → "marker" in VALID_CHANNELS, pseudotime plotting,
and test fixtures to match the unified channel type naming
- Fix tests: set pd.options.future.infer_string=False in conftest to prevent
pandas 2.x ArrowStringArray from breaking anndata zarr writer
- Fix test_loss: torch.Generator(device=device) for CUDA compatibility
- Update TestLinearClassifierInferenceConfigOrganelle to new schema
(embedding_model_name/version + models list)
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* add experiment configs and SLURM scripts
- Update example_cell_index.yaml with real dataset paths and rename
hours_post_infection → hours_post_perturbation
- Add collection YAMLs: A549_ZIKV_multiorganelle, A549_bag_of_channels, example
- Add training fit configs for A549_ZIKV_multiorganelle and A549_bag_of_channels
- Add smoothness evaluation SLURM scripts
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* add linear classifier and pseudotime analysis scripts
- generate_classifier_inference.py: generate inference configs + SLURM
script for a model predictions folder
- generate_train_config_from_folder.py: generate training configs from
prediction folders, supports multi-dataset combine
- label_offset_sweep.py: sweep temporal label offsets for infection
classifier to find optimal onset labeling
- infection_death_remodeling.py: correlate infection, death, and
organelle remodeling event timings across tracks
- infection_onset_distribution.py: compute and plot infection onset
distributions from classifier predictions
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* fix test_inference_reproducibility: remove stale reference zarr dependency
Replace comparison against a pre-computed reference zarr (39170 cells,
now stale) with a self-contained determinism test: run inference twice
with the same seed and assert the outputs match within GPU tolerance.
This removes the brittle hardcoded cell count and the reference zarr
that needs to be regenerated whenever the data or pipeline changes.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* feat(translation): add sliding window volume prediction (PR #280 port)
Port PR #280 ("Predict volume") functionality to the modular architecture.
- Enhance _blend_in in prediction_writer.py to support both torch.Tensor
(5D: B,C,Z,Y,X) and np.ndarray (4D: C,Z,Y,X) with unified blending
- Add predict_sliding_windows() to AugmentedPredictionVSUNet for
in-memory Z-sliding inference with linear feathering blending
- Extract _predict_with_tta() helper from predict_step()
- Make forward_transforms/inverse_transforms optional (default identity)
- Add getattr guard for out_stack_depth (clear error for 2D models)
- Add 7 tests: blend_in consistency/edge cases, sliding window shape,
invalid input, missing attribute, optional transforms
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* fix examples and configs: rename organelle→marker channel type, add overwrite=True to EmbeddingWriter
- quickstart.py/ipynb: add overwrite=True to EmbeddingWriter to prevent
FileExistsError on notebook re-run
- cross_validate_example.yaml: channels [phase, sensor, organelle] → [phase, sensor, marker]
- example_linear_classifier_inference.yaml: update W&B artifact names
organelle_state-organelle-* → organelle_state-marker-*
- example_linear_classifier_train.yaml: update comment and example
embedding paths to use marker instead of organelle
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* fix(viscy-models): port UNeXt2Stem validation guards to components/stems.py
Copy the in_stack_depth and out_channels divisibility checks from
_components/stems.py to components/stems.py before the upcoming merge
deletes _components/. Without these guards, invalid parameters cause
a silent ZeroDivisionError.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* remove ed notes
* missing pyarrow in dynaclr install. dpeend on visc-data triplet optional dependency
* fix troubleshooting.md for MultiExperiment setup
* add wandndb as core dependency
* remove the data.log
* fix(monorepo): declare missing deps, fix test infrastructure for root pytest
Dependency fixes (all were missing from pyproject.toml):
- viscy-utils: add wandb, anndata, tensorboard as core deps; remove
redundant optional-dependencies.anndata group
- viscy-data: promote pandas and pyarrow from optional/test to core deps
- viscy-models: add pytorch-metric-learning as core dep
- dynaclr: add statsmodels to eval and test dep groups
Test infrastructure:
- Extract shared test helpers/constants from conftest.py into
helpers.py so tests can import them under --import-mode=importlib
(from conftest import broke when running pytest from repo root)
- Remove pythonpath=["tests"] from dynaclr pyproject.toml (was a
workaround for running from the app dir; root config now handles it)
- Add pythonpath=["applications/dynaclr/tests"] to root pytest config
- Move pd.options.future.infer_string=False into pytest_configure hook
so imports stay at top level (fixes ruff E402)
Bug fixes:
- hcs.py: scope propagate=False to viscy_data.hcs.cache logger instead
of viscy_data parent, which was breaking caplog in downstream tests
- Remove test_small_group_emits_warning: tested log wording not behavior
- Remove test_nan_gene_name_to_ntc: tested pandas fillna not module code
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* docs(dynaclr): add tracking note for anndata ArrowString zarr bug
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* chore: remove tests/__init__.py files (importlib mode doesn't need them)
--import-mode=importlib does not use package-style imports for test
files, so __init__.py in test directories is unnecessary and can cause
import collisions.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* fix(dynaclr): resource leak, None propagation, CSV ambiguity
- Wrap open_ome_zarr() in context managers in index.py and cell_index.py
- Raise RuntimeError in _sample_positives() when no positive found instead of silent self-positive fallback
- Raise FileNotFoundError when tracking CSV is missing (fail fast before training)
- Raise ValueError when multiple CSVs exist in a tracks dir
- Update test_empty_tracks_empty_anchors → test_missing_tracking_csv_raises
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* chore(dynaclr): minor cleanup in inspect_dataloader and test formatting
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* add package and applications folder rules for CLAUDE.md
* context managers rule
* remove __version.py
* delete gsd markdowns
* chore: remove stale applications/dynacell directory
Remove old Hydra output logs and cache files from the defunct dynacell
application. This directory was already excluded from the uv workspace.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Eduardo Hirata-Miyasaki <edhiratam@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Rename the virtual staining application to match project branding. Follows Pattern A naming convention (like dynaclr, qc) — no viscy- prefix. - Directory: applications/translation → applications/cytoland - Package: viscy-translation → cytoland - Imports: from viscy_translation.engine → from cytoland.engine - Config class_paths: cytoland.engine.VSUNet - CLI: python -m cytoland fit/predict - README rewritten to match dynaclr style with paper reference - Root README updated with Cytoland in Applications table Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Rename the virtual staining application to match project branding. Follows Pattern A naming convention (like dynaclr, qc) — no viscy- prefix. - Directory: applications/translation → applications/cytoland - Package: viscy-translation → cytoland - Imports: from viscy_translation.engine → from cytoland.engine - Config class_paths: cytoland.engine.VSUNet - CLI: python -m cytoland fit/predict - README rewritten to match dynaclr style with paper reference - Root README updated with Cytoland in Applications table Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…equired,
matches runtime ultrack columns)
- Add log_embeddings_every_n_epochs to ContrastiveModule; logs UMAP colored
by condition/experiment/HPI to WandB on validation epochs
- Fix detach_sample to split channels as extra columns (landscape-friendly)
- Guard MultiExperimentTripletDataset.__getitem__ with NotImplementedError
- Add demo scripts for WandB image and UMAP logging
- Add engine tests for embedding accumulation and epoch-gating logic
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com> Co-authored-by: Eduardo Hirata-Miyasaki <edhiratam@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* refactor: clean up viscy-utils optional dependencies
* refactor: clean up dynaclr test imports
* fix: lazy import annotation dependencies
* feat: add fnet model for cytoland benchmarking
* remove deprectaed files that were left over from the monorepo restructuring
* CI to test applications/ loading each test-applications matrix job cds into the app directory before running pytest.
* fix: resolve root pytest conftest collision and lazy-import linear_classifier deps
Remove applications/*/tests from root testpaths — multiple app conftest.py
files collide as the same plugin name when collected together. Apps are now
tested per-directory via CI matrix (added in prior commit).
Lazy-import wandb and anndata in linear_classifier.py (same pattern as
annotation.py) so the module is importable without optional extras.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add e2e FNet3D test and move unit tests to test_unet/
- Add test_fnet3d_real_datamodule_fast_dev_run: end-to-end test
exercising HCSDataModule → VSUNet(FNet3D) → Trainer
- Move test_unet3d.py from tests/ root to tests/test_unet/
matching the location of other UNet test files
- Add test_state_dict_keys verifying recursive key structure
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: extract _make_divisible_pad helper and simplify Unet3d
- Extract duplicated DivisiblePad construction into _make_divisible_pad()
- Cache 2**depth as self._divisor in Unet3d.__init__
- Make downsamples_z a class attribute instead of a @property
- Remove trivial docstring on _DoubleConv3d.forward
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: enable root pytest to collect all packages and applications
Remove pythonpath = ["tests"] from app configs and __init__.py from app
test dirs — both caused conftest plugin collisions when root pytest
collected multiple applications together. Replace all `from conftest
import` / `from .conftest import` with either inlined constants or
fixture factories so test files no longer need conftest on sys.path.
Root testpaths now includes applications/*/tests, so `uv run pytest`
runs all 836 tests (packages + applications) in a single invocation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: consolidate duplicated test constants into conftest fixtures
Replace inlined SYNTH_C/D/H/W, IMG_H/W, N_T/Z/TRACKS, FCMAE_H/W, and
MIXED_LOSS_H/W constants with synth_dims and hcs_dims fixtures returning
dicts. Single source of truth in each app's conftest.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: restore empty src/viscy/__init__.py for wheel build
Hatchling requires the package directory referenced in
[tool.hatch.build.targets.wheel] to exist. Without it, `uv build`
fails and `import viscy` breaks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add Spotlight foreground-aware loss with precomputed masks (#389)
* feat: add SpotlightLoss for foreground-aware virtual staining
Implement Spotlight (Kalinin et al. 2025, arXiv:2507.05383), a
model-agnostic loss function that focuses supervision on biologically
relevant foreground regions via:
- Masked MSE using per-sample Otsu thresholding on targets
- Dice loss on soft-thresholded predictions (tunable sigmoid)
Works as a drop-in loss_function for any VSUNet architecture.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: document Spotlight paper deviations and implementation choices
- Add config comments listing all deviations from the paper: optimizer,
normalization, patch selection, target normalization, training length,
isotropic voxels
- Add code comments explaining tunable sigmoid clamping rationale and
zero-foreground fallback behavior
- Fix _otsu_threshold_batch docstring return shape
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address Spotlight paper deviations
- Normalize both source and target channels in config (paper A.4)
- Use max_steps: 50000 instead of max_epochs (paper A.3)
- Add per-sample min_foreground_fraction filtering to SpotlightLoss
(paper A.4: only patches with ≥0.1% FG voxels used for training)
- FG fraction computed per-sample, not batch-wide
- Add min_foreground_fraction: 0.001 to example config
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: simplify SpotlightLoss foreground filtering logic
- Co-locate soft_pred weighting with mask weighting in same block
- Remove misleading sample_weight is not None guard
- Use (pred * 0).sum() for zero-gradient return (maintains graph)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: validate eps, n_bins, min_foreground_fraction in SpotlightLoss
Add constructor validation for remaining parameters to catch
misconfiguration early, consistent with existing sigmoid_k/lambda_mse
checks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: fix misleading z-score normalization comment in SpotlightLoss
The docstring incorrectly implied z-score normalization was what the
paper uses. Rewritten to acknowledge this is a deviation: the paper
subtracts the Otsu threshold, while the implementation recomputes
Otsu inside forward() on already-normalized data.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: precomputed FOV-level Otsu thresholds for Spotlight
- Add opt-in compute_otsu to generate_normalization_metadata() and
viscy preprocess CLI — stores per-FOV Otsu threshold in norm_meta
- Add min_foreground_fraction to SlidingWindowDataset/HCSDataModule
with bounded retry loop (max 10 random retries) for patch filtering
- Refactor SpotlightLoss: replace n_bins + min_foreground_fraction
with fg_threshold (None=Otsu fallback, float=fixed threshold)
- Update config: use subtrahend=otsu_threshold, fg_threshold=0.0,
min_foreground_fraction=0.001
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: generalize foreground filtering to nonzero fraction check
- Rename min_foreground_fraction → min_nonzero_fraction in
SlidingWindowDataset and HCSDataModule
- Add nonzero_threshold (default 0.0) and nonzero_channel (default
None → first target) parameters for method-agnostic filtering
- Remove coupling to Otsu thresholds in norm_meta — check is now a
simple (patch >= threshold) comparison on the configured channel
- Validate nonzero_channel against channel map in __init__
- Fix MaskTestDataset to forward **kwargs to super().__init__()
- Hoist check_key before retry loop, fix attribute ordering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: configurable max_nonzero_retries with exhaustion warning
- Add max_nonzero_retries parameter (default 100) to
SlidingWindowDataset and HCSDataModule
- Log warning when retries are exhausted instead of silently
returning a below-threshold sample
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: remove papers directory
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: use local-mean downsampling for robust Otsu thresholding
Point-sampling at stride 32 loses spatial structure needed for clean
bimodal histograms. Replace with downscale_local_mean (default factor
4) that averages local neighborhoods, preserving the FG/BG separation
that Otsu needs. Only affects the compute_otsu path — existing
grid sampling for mean/std is unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use denser grid + median filter for Otsu instead of full-volume read
Replace _downsample_local_mean (which loaded the entire FOV at full
resolution) with a denser _grid_sample at otsu_grid_spacing=8 followed
by a median filter (size=3). This is fast (sparse tensorstore reads)
while providing enough spatial density to capture inter-cell gaps.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: hoist Otsu imports and restrict median filter to spatial dims
- Move scipy/skimage imports before the position loop (avoid repeated
import lookups per FOV)
- Fix median_filter size from (3,3,3,3) to (1,1,3,3) — only smooth
Y/X, not across timepoints or Z-slices
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add preprocessing tests for compute_otsu path
- test_compute_otsu_stores_threshold: verifies otsu_threshold key
is written to fov_statistics when compute_otsu=True
- test_compute_otsu_threshold_separates_bimodal: verifies threshold
falls between BG and FG modes on bimodal fluorescence data
- test_compute_otsu_false_omits_threshold: verifies otsu_threshold
is NOT present when compute_otsu=False (default)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: precompute foreground masks for Spotlight training
Add `generate_fg_masks()` to precompute binary FG masks during
preprocessing (smooth + Otsu threshold), store as zarr arrays, and
thread through the data pipeline to SpotlightLoss.
This eliminates the mismatch where Otsu thresholds were computed on
smoothed/subsampled data but applied to raw patches at training time,
producing noisy FG/BG masks at boundaries.
Following pytorch_fnet's WeightedMSE weight_map_batch pattern:
- Preprocessing: `generate_fg_masks()` smooths full-res data per FOV
per timepoint and stores binary masks as "fg_mask" zarr arrays
- Data loading: `SlidingWindowDataset` loads masks alongside images,
threads through MONAI transforms via temp keys for spatial co-alignment
- Loss: `SpotlightLoss.forward()` accepts optional `fg_mask` argument
(priority: precomputed > fixed threshold > runtime Otsu)
- Engine: `VSUNet._compute_loss()` helper passes mask from batch to loss
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address PR review feedback on nonzero filtering
- Add parameter validation for min_nonzero_fraction and max_nonzero_retries
- Fix retry loop: check fraction on every attempt (including last),
warn only when all attempts fail the criterion
- Scope nonzero filtering to training only via _train_filter_settings
property (val/test/predict are deterministic, no random resampling)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: address code review — hoist imports, use context managers
- Move scipy/skimage imports to top of meta_utils.py (CLAUDE.md rule)
- Move generate_fg_masks and SlidingWindowDataset imports to top of
test files
- Use context managers for zarr stores in test fixtures
- Replace inline `import pytest` with top-level `raises` import
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: per-channel SpotlightLoss for multi-target training
Compute loss per (batch, channel) pair instead of globally:
- Masked MSE: channels with FG mask use masked MSE, channels without
fall back to regular MSE
- Dice: only channels with FG mask data contribute; channels without
are excluded from the Dice average
- Otsu: compute per-(sample, channel) thresholds instead of per-sample
This correctly handles multi-channel targets where masks exist for
only some channels (e.g., Nuclei mask but not Membrane mask).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add fg_mask_channels param to preprocess CLI
Allow specifying which channels to compute FG masks for, independently
from which channels get normalization/Otsu. Defaults to all channels
with Otsu thresholds. Enables e.g. masking only Nuclei when training
with both Nuclei and Membrane targets.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use all-ones default for non-target mask channels
Channels without explicit FG masks should get full supervision (all
voxels contribute to loss), not zero supervision. Changed from
np.zeros to np.ones for the mask array initialization.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: exclude all-ones mask channels from Dice loss
All-ones masks (placeholder for "no real FG/BG mask") should not
contribute to Dice — they would penalize the model for not predicting
everything as foreground. Now only channels with real masks (both
0s and 1s) contribute to Dice. If no channel has a real mask, Dice
is zero and a warning is logged.
Also simplified MSE path: all-ones masks naturally give regular MSE
via masked_sum / fg_per_ch = sq_err.sum() / n_spatial, eliminating
the need for a torch.where fallback.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: use warnings.warn for no-mask Dice, fix test coverage
- Replace _logger.warning with warnings.warn (built-in dedup prevents
log spam on every forward call)
- Fix test_partial_mask_ignores_placeholder_channel_in_dice to use a
real FG/BG mask on channel 0 (not all-ones which has_real_mask
correctly excludes)
- Remove redundant "what" comment
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: use _logger.warning with one-shot flag for no-mask Dice
Replace warnings.warn with _logger.warning guarded by
_warned_no_real_mask flag — fires once per SpotlightLoss instance.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address Copilot review comments on PR #389
- Remove unused num_workers param from generate_fg_masks
- Fix all-zero mask MSE: fall back to unmasked MSE via torch.where
- Use keyword arg for fg_mask in _compute_loss (avoids TypeError
with non-Spotlight losses)
- Fix check_key operator precedence (nonzero_channel was bypassing
min_nonzero_fraction=0 guard)
- Guard mask-based nonzero check to target channels only (source
channels fall back to raw threshold)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: adapt FNet3D tests to fixture pattern after merge
FNet3D tests use hardcoded in_stack_depth=4 (not synth_dims["d"]=5)
because FNet3D requires Z divisible by 2^depth. Also fixed attribute
name (_predict_pad, not predict_pad) and removed synthetic_batch
dependency since FNet3D needs its own tensor shape.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add scipy dependency and ndim validation for Unet3d
- Add scipy as explicit dependency in viscy-utils (directly imported
in meta_utils.py and evaluation modules, not just transitive)
- Add ndim != 5 check in Unet3d.forward() to catch 4D input with a
clear error instead of a misleading spatial dimension mismatch
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: stream fg_mask writes to avoid OOM on large FOVs
Replace in-memory full-FOV allocation with streaming writes:
use pos.create_zeros() to allocate the zarr array on disk, then
write per-timepoint per-channel slices. For a typical FOV
(80×6×50×2048×2048), this reduces memory from ~96GB to ~16MB
(one Z×Y×X slice at a time).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: migrate cytoland logging from TensorBoard to W&B
Replace TensorBoard-specific _log_samples with the existing
log_image_grid() helper from log_images.py which dispatches to both
TensorBoard and W&B. Add DDP guard (is_global_zero) to prevent
duplicate image logging in distributed training.
Update example configs from TensorBoardLogger to WandbLogger.
No hard wandb dependency needed — import is lazy in log_images.py.
Tests keep TensorBoardLogger (log_image_grid handles both backends).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add SLURM submission template for cytoland training
Based on dynaclr's fit_slurm.sh pattern. Uses Lightning CLI with
config overrides for run name, save dir, and checkpoint path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add FCMAE pretrain/finetune configs and encoder-only checkpoint loading
Add encoder_only parameter to FcmaeUNet for loading only encoder weights
from a pretrained checkpoint, enabling fine-tuning with different output
channels (e.g., 1→2). Create example configs for self-supervised FCMAE
pretraining (pretrain_fcmae.yml) and supervised fine-tuning
(finetune_fcmae.yml) matching the patterns from pretrain_3d.py and
finetune_3d.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: review feedback — worker-safe RNG, clear fg_mask TypeError, Otsu comment
- Replace stdlib random.randint with torch.randint in DataLoader retry
loop (torch seeds each worker independently, avoiding correlated retries)
- Wrap fg_mask keyword dispatch in _compute_loss with TypeError catch
that names the misconfiguration (loss vs data config mismatch)
- Add derivation comment showing Otsu inter_class_var formula is correct
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: migrate FCMAE pipeline to batched GPU transforms
Move augmentation transforms from per-sample (Decollated → MONAI wrappers
→ StackChannelsd → collate) to batched GPU execution via
on_after_batch_transfer, matching the DynaCLR pattern.
Changes across 3 packages + cytoland:
viscy-transforms:
- Add BatchedStackChannelsd (inherits StackChannelsd, dim=1 cat)
- Add BatchedRandInvertIntensityd (per-sample randomization on batched tensors)
viscy-data:
- Add on_after_batch_transfer to GPUTransformDataModule base class
(dispatches train_gpu_transforms vs val_gpu_transforms)
- Add on_after_batch_transfer dispatcher to CombinedDataModule
(handles list batches for training, single dict for validation)
- Add gpu_augmentations param + on_after_batch_transfer to HCSDataModule
cytoland:
- Remove FcmaeUNet.train_transform_and_collate / val_transform_and_collate
- Add _merge_batches to concatenate per-dataset batches from CombinedLoader
- Update FCMAE configs to use Batched* transforms (no more Decollated)
- Update test fixtures to use BatchedStackChannelsd
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address PR review issues and Copilot comments
- Replace try/except TypeError in _compute_loss with inspect.signature
check that handles both explicit fg_mask param and **kwargs
- Add freeze_encoder to FcmaeUNet Parameters docstring
- Move inline imports to top of test_training_integration.py
- Restore weight channel limitation comment in hcs.py
- Replace O(N log N) _find_window with bisect.bisect_right (O(log N))
- Optimize generate_fg_masks: bulk write non-target channels, tile
zarr chunks at 256 for large FOVs
- Align BatchedRandInvertIntensityd with MONAI RandomizableTransform
pattern (randomize() + per-sample _do_transform tensor)
- Fix Airtable test_dataframe_columns expected columns
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address PR review issues and Copilot comments
- Replace try/except TypeError in _compute_loss with inspect.signature
check that handles both explicit fg_mask param and **kwargs
- Add freeze_encoder to FcmaeUNet Parameters docstring
- Move inline imports to top of test_training_integration.py
- Restore weight channel limitation comment in hcs.py
- Replace O(N log N) _find_window with bisect.bisect_right (O(log N))
- Optimize generate_fg_masks: bulk write non-target channels, tile
zarr chunks at 256 for large FOVs
- Align BatchedRandInvertIntensityd with MONAI RandomizableTransform
pattern (randomize() + per-sample _do_transform tensor)
- Fix Airtable test_dataframe_columns expected columns
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: make cytoland tta padding metadata-independent
* fix: propagate fg_mask through spatial augmentations
Spatial transforms (RandAffined, RandFlipd, etc.) now automatically
include fg_mask keys when fg_mask_key is set, ensuring masks stay
pixel-aligned with source/target after augmentation.
Uses an explicit _SPATIAL_TRANSFORMS allowlist — intensity transforms
(contrast, noise, etc.) are excluded to avoid corrupting binary masks.
Injection is idempotent (safe across repeated setup() calls).
Covers both CPU augmentations (_fit_transform) and GPU augmentations
(gpu_augmentations in __init__).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: restructure cytoland configs by model with composable recipes
Reorganize example configs from flat files (fit.yml, predict.yml) into
model-specific directories (vscyto2d/, vscyto3d/, vsneuromast/, fnet3d/)
with composable recipe fragments.
New infrastructure:
- viscy_utils.compose: PyYAML-based config composition via `base:` key
(recursive deep merge, lists replace, zero new deps)
- CLI integration: detect `base:` in --config/-c and auto-compose
before LightningCLI (no-op for configs without base:)
Model configs:
- vscyto2d: pretrain, finetune, predict (FcmaeUNet, in_stack_depth=1)
- vscyto3d: pretrain, finetune, train_spotlight, predict (UNeXt2, z=5)
- vsneuromast: fit, predict (UNeXt2, z=21, no pretraining)
- fnet3d: fit, predict (FNet3D, depth=4)
Recipes (reusable fragments):
- trainer/: fit_4gpu, predict_gpu
- models/: fcmae_2d, fcmae_3d, unext2_3d, unext2_neuromast, fnet3d
- data/: hcs_nuc_mem_{2d,3d,neuromast}, cached_pretrain
- modes/: spotlight (loss + fg_mask + Otsu normalization)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: port CellDiff models to viscy-models as optional dependency (#394)
* feat: port CellDiff models to viscy-models as optional dependency
Add UNetViT3D (deterministic) and CELLDiffNet (flow-matching backbone)
to viscy-models behind an optional `celldiff` extra ["diffusers", "einops"].
CELLDiff3DVS training wrapper and transport module deferred to Stage 3
(Dynacell application layer). Dead code from the fork dropped: CondConvNet,
BertPredictionHeadTransform, MLMHead, PixelShuffle3d, Upsample, Downsample,
init_weights, and unused 1D/2D positional embedding functions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: validate patch divisibility in CellDiff model constructors
Reject spatial sizes where latent dimensions are not exactly divisible
by patch_size after encoder downsampling. Previously, integer division
silently truncated remainders, causing the decoder to crash on torch.cat
with mismatched skip-connection shapes (e.g. input_spatial_size=[10,64,64]
with patch_size=4).
Also: fix positional embedding return type annotations (float32 -> float64),
add cond channel validation in CELLDiffNet.forward(), and add validation
tests for both models.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address PR #388 review findings
- SLURM script: fix CLI command from `cytoland fit` to `viscy fit`
(cytoland has no [project.scripts] entrypoint)
- FcmaeUNet: add `ckpt_path` to save_hyperparameters(ignore=...) so
load_from_checkpoint works with encoder_only=True
- test_training_integration: move inline imports to module level
- celldiff models: convert constructor assert to if/raise ValueError
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: only allow missing keys in final crop when fg_mask is configured
Keep strict key validation (allow_missing_keys=False) when no fg_mask_key
is set, so mis-specified channel names fail fast at crop time instead of
silently skipping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: unify 3D U-Net model API behind UNet3DBase (#396)
* feat: add generalized 3D conv blocks for unified U-Net base
Move Block, ResnetBlock from celldiff/modules/simple_diffusion.py to
unet/blocks.py with configurable norm (group/batch), activation
(silu/relu), and residual flag. Add TimestepEmbedder and
ConvBottleneck3D. Replace einops with plain PyTorch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add UNet3DBase iterative encoder-decoder with injected bottleneck
Parametrized 3D U-Net base with configurable norm, activation, residual,
downsample_z, time conditioning, and conditioning input. Exposes
num_blocks property and downsamples_z attribute for engine compatibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add ViTBottleneck3D encapsulating CellDiff transformer bottleneck
Extracts PatchEmbed3D, sinusoidal positional embedding, TransformerBlock
stack, FinalLayer, and unpatchify into a single module with the unified
bottleneck interface forward(x, time_embeds=None).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: rewrite Unet3d (FNet) as thin wrapper of UNet3DBase
Replace recursive _FNetRecurse with iterative UNet3DBase configured for
BatchNorm+ReLU, non-residual blocks, all-dim downsampling, and conv
bottleneck. FNet weight init preserved via self.apply(). Add positive
FNet sliding-window test to cytoland engine tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: rewrite UNetViT3D + CELLDiffNet as thin wrappers, remove einops
Rewrite both CellDiff models as thin UNet3DBase wrappers with injected
ViTBottleneck3D. Delete simple_diffusion.py (Block/ResnetBlock moved to
unet/blocks.py). Remove TimestepEmbedder from celldiff/modules (moved
to unet/blocks.py). Remove einops from celldiff optional deps and test
importorskips.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add unified parametrized tests for all 3D U-Net variants
Shared assertions for num_blocks, downsamples_z, forward pass shape
preservation, and UNet3DBase lineage across Unet3d, UNetViT3D, and
CELLDiffNet.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use consistent Tensor type annotation in CellDiff wrappers
Replace mixed torch.Tensor / Tensor usage with Tensor throughout
forward signatures and docstrings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Dynaclr-dino (#387)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Eduardo Hirata-Miyasaki <edhiratam@gmail.com>
* fix: remove duplicate fields from auto-merged test_database.py
Auto-merge kept both branches' additions of channel_names, marker,
and tracks_path. Remove the duplicates.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: cache loss function fg_mask compatibility check at init time
Move inspect.signature() call from _compute_loss() (hot path, every
batch) to __init__() (one-time). Stores result as _loss_accepts_fg_mask
boolean flag.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: skip fg_mask in predict stage and handle --config= form in CLI
P1: _setup_predict now strips fg_mask_key from dataset settings so
prediction works on datasets without precomputed masks.
P2: _maybe_compose_config now handles --config=path.yml and -c=path.yml
in addition to the space-separated form.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: handle base:null in config composition and constant-channel Otsu
compose.py: normalize base:null to empty list instead of crashing
with TypeError on iteration.
meta_utils.py: catch ValueError from threshold_otsu on constant-value
channels (e.g. all-zero) and default to 0.0.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: add ForegroundMaskSupport collaborator for fg_mask logic
Extract _SPATIAL_TRANSFORMS tuple and ForegroundMaskSupport class into
a new foreground_masks.py module. This collaborator will encapsulate all
fg_mask/Spotlight logic (validate, read, inject, extract, patch transforms)
so SlidingWindowDataset and HCSDataModule can delegate to it instead of
scattering mask conditionals across 21 touch points.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: extract SlidingWindowDataset and MaskTestDataset into sliding_window.py
Move both dataset classes from hcs.py to a new sliding_window.py module.
Code is moved verbatim with inline fg_mask logic preserved — the collaborator
wiring happens in the next commit. Update import paths in __init__.py and
test_hcs.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: wire ForegroundMaskSupport collaborator into SlidingWindowDataset
SlidingWindowDataset now delegates all fg_mask logic to a
ForegroundMaskSupport collaborator: validate_and_store (per-position),
read_window (inside retry loop), inject_into_sample (before transform),
extract_from_sample (after transform). The fg_mask_key constructor param
is preserved for backward compat — it creates the collaborator internally.
HCSDataModule._inject_mask_keys becomes a 3-line delegate to
ForegroundMaskSupport.patch_spatial_transforms. _SPATIAL_TRANSFORMS and
its viscy_transforms imports are removed from hcs.py (now live in
foreground_masks.py).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: centralize mask temp-key naming, remove extract_from_sample
Add ForegroundMaskSupport.mask_temp_keys() as single source of truth for
the __fg_mask_{ch} naming convention. Use it in __init__, _fit_transform,
and _final_crop instead of inlining the f-string pattern in 4 places.
Simplify inject_into_sample to use precomputed _mask_keys instead of
recomputing f-strings per call. Remove extract_from_sample and its
callback indirection — the caller now stacks directly via _stack_channels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: ignore encoder_only in FcmaeUNet.save_hyperparameters
load_from_checkpoint on an encoder-only fine-tuned checkpoint would crash
with ValueError because encoder_only=True was saved but ckpt_path was not.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: generalize mask channel indexing for target-only mask arrays
ForegroundMaskSupport now auto-detects whether the mask array uses the
full image channel layout or a compact target-only layout. On the first
position, validate_and_store compares mask vs image channel counts and
sets _mask_ch_idx accordingly. read_window uses internal indices instead
of the caller-provided target_ch_idx.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use context managers for open_ome_zarr in HCSDataModule
_setup_fit, _setup_test, and _positions_maybe_single opened zarr stores
without `with` statements, leaking file handles. Zarr v2 DirectoryStore
positions survive store closure so this is safe.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add map_location="cpu" to VSUNet checkpoint loading
Prevents device mismatch when a GPU-saved checkpoint is loaded during
CPU-based config parsing. Matches FcmaeUNet._load_encoder_weights which
already uses map_location="cpu".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: reject non-divisible input sizes in ViTBottleneck3D
Floor division silently accepted odd spatial sizes (e.g. 514) that cause
encoder/decoder shape mismatches at concat time. Now validates that each
downsampled dimension is exactly divisible by 2^num_downsamples before
computing the latent size.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: store mask channel indices per position, not globally
_mask_ch_idx was cached from the first FOV and reused for all positions.
This breaks datasets that mix full-channel and target-only mask layouts
across positions. Now stores per-position indices in _mask_ch_indices
so each position's mask array is read with the correct channel mapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use constant value as Otsu threshold for uniform channels
A constant nonzero channel got otsu_threshold=0.0, causing
generate_fg_masks to mark everything as foreground and Spotlight
normalization to produce huge values. Using the constant value
itself means nothing is marked foreground, which is correct —
a uniform channel has no meaningful foreground structure.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: replace assert with ValueError in unpatchify
Asserts are stripped with python -O, turning shape mismatches into
cryptic reshape errors. An explicit ValueError with expected/actual
token counts fails deterministically in all runtime modes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add config composition tests for load_composed_config
The compose module had zero test coverage. These 12 tests cover
_deep_merge (flat, nested, list-replace, immutability) and
load_composed_config (no base, base:null, single/multiple/nested
bases, base-as-string, circular detection, ordering precedence).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: guard freeze_encoder against non-FCMAE architectures
freeze_encoder=True accessed self.model.encoder, which only exists on
FullyConvolutionalMAE. Using it with UNet3DBase or UNeXt2 would crash
with AttributeError. Now raises a clear ValueError instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add FNet3D and VSCyto3D training configs for dynacell SEC61B benchmark (#399)
* feat: add MinMaxSampled normalization transform
Port MinMaxSampled from the cell_diff_vs_viscy repo for percentile-
based min-max normalization to [-1, 1]. Supports p1_p99, p5_p95,
and min_max data ranges. Also extends LevelNormStats TypedDict in
both viscy-data and viscy-transforms with percentile fields (p1, p5,
p95, p99, min, max) and changes to total=False since not all zarr
stores have all stats.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use self.keys[0] in Batched* transforms for gpu_augmentations
Four Batched* dict transforms used next(iter(sample.keys())) or
next(iter(sample.values())) to get a reference tensor for
randomization. This fails when used in gpu_augmentations because
the batch dict has non-tensor keys like 'index' before the image
keys. Use self.keys[0] to always reference the first declared
transform key instead.
Also simplify DTypeLike -> type annotation in _noise.py to fix
jsonargparse introspection failure with numpy's complex DTypeLike
union type.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: log VSUNet hyperparameters to WandB Config tab
Add self.save_hyperparameters(ignore=["loss_function"]) in
VSUNet.__init__ so architecture, model_config, lr, and schedule
are logged to WandB's Config section. loss_function is excluded
because Lightning already saves nn.Module state in checkpoints.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add FNet3D and VSCyto3D training configs for SEC61B benchmark
Composable config recipes and leaf training configs for FNet3D and
VSCyto3D (UNeXt2) on AICS iPSC SEC61B (ER) dataset, targeting
architecture comparison with CellDiff UNetViT3D.
New recipes: fit_1gpu trainer, hcs_sec61b_3d data (MinMax p1_p99
normalization + GPU augmentations), fnet3d_z8 and unext2_3d_z8
model presets. Leaf configs compose these with model-appropriate
hyperparameters (FNet3D: lr=1e-3, batch=32; VSCyto3D: lr=2e-4,
batch=16). SLURM scripts target 1x H200.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: extract shared _match_image and precompute data_range keys
Deduplicate _match_image from NormalizeSampled and MinMaxSampled
into a module-level function. Precompute data_range -> (low_key,
high_key) mapping in MinMaxSampled.__init__ to avoid per-sample
string dispatch. Also fix DTypeLike docstrings in _noise.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add to_numpy helper for mixed-precision tensor conversion
NumPy does not support bfloat16, so bf16 tensors from AMP/autocast
crash on .numpy(). Add a shared to_numpy() helper in viscy_utils
that casts floating tensors to float32 before conversion. Integer
and boolean tensors preserve their dtype.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: replace .numpy() with to_numpy() at all external boundaries
Mixed-precision training produces bf16 tensors that crash on
.numpy() when passed to NumPy, CellPose, sklearn, wandb, zarr,
or AnnData. Replace all .detach().cpu().numpy() patterns with the
new to_numpy() helper across logging, callbacks, evaluation, and
application engines.
Also removes redundant .cpu() calls in embedding_writer and
apply_mlp_embedder since to_numpy() handles device transfer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: narrow to_numpy to only cast bfloat16, preserve fp64
The blanket float→fp32 cast silently discarded float64 precision
in evaluation code like pairwise_distance_matrix, which explicitly
uses .double() for numerical accuracy. Only bfloat16 is unsupported
by NumPy; fp16/fp32/fp64 all have native equivalents.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: verify NormalizeSampled output values, not just shape
The test only asserted shape and metadata presence, missing actual
normalization correctness. Now computes (x - mean) / (std + eps)
on known inputs and asserts the result matches.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: restore per-batch CPU offload in MLP embedder
Removing .cpu() from the accumulation loop kept all encoded batches
on GPU until final concatenation, causing memory to grow with
dataset size. Restore immediate CPU offload so GPU memory stays
flat during large embedding exports.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: create slurm_log directory before job starts
SLURM opens --output/--error files before the script body runs.
Without the directory, jobs fail immediately on a clean checkout.
Create with mode 775 for group-write access on shared HPC storage.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: avoid GPU concatenation in embedding writer
Convert each prediction to numpy individually and concatenate on
CPU with np.concatenate instead of torch.cat on GPU. Prevents
transient GPU memory spikes during large prediction runs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add public Dynacell benchmark app and Stage 2 training fixes (#397)
* feat: add public Dynacell benchmark application (Stage 2)
Create applications/dynacell/ as a thin supervised virtual staining
benchmark app consuming UNetViT3D and FNet3D from viscy-models.
Engine: DynacellUNet(LightningModule) with model-aware
example_input_array, fg_mask/Spotlight support, and explicit
NotImplementedError for predict_step (Stage 3 scope).
Configs: base-composition recipes for UNetViT3D and FNet3D fit,
including data, trainer, and Spotlight mode overlays.
Tests: 17 tests covering init, forward, spatial rejection,
fast_dev_run (synthetic + real OME-Zarr), Spotlight+fg_mask,
and config class_path resolution.
Workspace: remove dynacell from uv exclude, add to sources.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add else-branch for invalid schedule and use torch.stack
configure_optimizers silently fell through for unrecognized schedule
strings, causing a NameError at training time. Also replace
torch.tensor(losses) with torch.stack(losses) since elements are
already tensors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use uv run in usage examples and clarify unsupported subcommands
CLAUDE.md requires uv run for all commands. Also distinguish predict
(explicit NotImplementedError) from test (no test_step override,
Lightning default fails on batch dict) in README.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: rename test config dicts to VIT_TEST_CONFIG/FNET_TEST_CONFIG
Clearer naming distinguishes test-size configs from production defaults.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: correct test subcommand error description in README
Lightning raises MisconfigurationException when test_step is missing,
not a batch-dict failure from the default implementation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: simplify training_step and remove logging noise
- Drop multi-batch loop in training_step: Dynacell uses a single
HCSDataModule with no CombinedDataModule, so the loop always ran
once and was speculative abstraction per CLAUDE.md
- Remove .to(self.device) on loss tensors: loss is already on the
correct device; the no-op calls are misleading
- Remove unused _logger module-level variable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add return type annotation and fix logger comment in engine
- training_step was missing -> Tensor return annotation
- example_input_array comment said TensorBoard specifically;
W&B and other loggers also consume it
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: save hyperparameters, weight validation loss, fix scheduler, guard log_samples
- Add save_hyperparameters(ignore=[loss_function, ckpt_path]) so
load_from_checkpoint works and hparams are logged to experiment trackers
- Weight loss/validate by batch size per dataloader and across dataloaders;
unweighted mean biases toward smaller tail batches when val set is not
divisible by batch_size
- Fix WarmupCosineSchedule t_total: use estimated_stepping_batches instead
of max_epochs — the scheduler is step-based, so passing epoch count caused
LR collapse after ~1 epoch (200 steps vs 50k steps for fnet3d)
- Guard _log_samples against empty input: log_image_grid crashes via
np.concatenate([]) when log_batches_per_epoch=0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: fix GPU device mismatch and WarmupCosine step interval
sizes_t was created on CPU while losses are on the model device in
GPU/DDP training, causing a device-mismatch crash at the end of the
first validation epoch. Fix by passing device=losses[0].device.
WarmupCosineSchedule is parameterized with estimated_stepping_batches
(a step count) but was returned as a bare scheduler, which Lightning
steps once per epoch by default. Return it as a config dict with
interval="step" so the LR decays at the correct granularity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: document data_path requirement in README usage example
The configs ship with data_path unset (null). Without a note,
users following the README command hit a jsonargparse error before
training starts. Show the CLI override pattern explicitly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: correct 3D positional embedding token ordering and add validation
The meshgrid used default xy indexing which produces depth-varies-
fastest ordering, but PatchEmbed3D flattens (B,C,D,H,W) in C-order
where depth varies slowest. Switch to ij indexing so depth tokens
are contiguous. Also replace assert statements with ValueError in
positional embedding functions, and add spatial divisibility
validation to UNet3DBase.forward.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: cache _divisor in UNet3DBase, remove redundant Unet3d.forward
UNet3DBase.forward recomputed 2**num_blocks on every call. Cache it
at init as _divisor. Since the base class now validates spatial
divisibility, Unet3d no longer needs its own forward override or
separate _divisor — remove both to eliminate the duplication.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: cache channel lists and use stdlib randint in SlidingWindowDataset
__getitem__ rebuilt combined channel name/index lists via .copy() +
.extend() on every call. Cache them at init since they never change.
Also replace torch.randint (allocates a tensor) with random.randint
for the nonzero-retry index — ~10x faster for scalar int sampling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use torch RNG for retry index to respect global seed
random.randint uses a separate RNG from torch.manual_seed, making
retry sampling non-reproducible when seed_everything is set. Revert
to torch.randint but use a 0-dim tensor to minimize allocation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add deterministic predict support to Dynacell (Stage 2.5)
Replace predict_step NotImplementedError with DivisiblePad-based
tiled inference, matching Cytoland's proven pattern. FNet3D pads
all spatial dims; UNetViT3D tiles must match input_spatial_size.
Adds predict configs, integration tests for both architectures,
and HCSPredictionWriter pipeline verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add git workflow rules to CLAUDE.md
Codify no-amend, no-force-push, atomic commits, and explicit
staging as project-level instructions for Claude Code sessions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: correct key-sharing and axis bugs in batched GPU transforms (#400)
* fix: correct key-sharing and axis bugs in BatchedRandAffined and BatchedRand3DElasticd
BatchedRandAffined generated independent random params per key, causing
source/target misalignment during training. Also, scale_range was
incorrectly axis-inverted and Kornia sampled per-axis independently.
BatchedRand3DElasticd had the same per-key bug plus a displacement axis
swap (D↔W) when mapping to grid_sample, and a double probability gate.
Fixes:
- Generate affine params once via forward_parameters(), reuse for all keys
- Support flat (min,max) and per-axis ZYX scale_range with isotropic option
- Generate elastic displacement field once, reuse for all keys
- Correct displacement D,H,W → grid X,Y,Z axis mapping
- Remove double probability gate and unused spatial_size in elastic
- Add 12 tests covering key consistency, axis ordering, and scale behavior
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: update SEC61B training configs for H200 and external artifact storage
- Increase batch_size to 64 (FNet3D and VSCyto3D) and max_epochs to 100
- Move artifacts to /hpc/projects/comp.micro/virtual_staining/models/
- Set explicit checkpoint dirpath outside the repo
- Fix scale_range from [0.5, 1.5] to per-axis [[0.8, 1.2], [0.7, 1.3],
[0.7, 1.3]] matching original VSCyto3D augmentation ranges
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address code review — tests, imports, docs, and allow_missing_keys
- Strengthen elastic axis ordering test to actually catch D↔W swap
- Strengthen isotropic scale test to verify params directly
- Import from public viscy_transforms API in tests, not private modules
- Replace unused loop variable `b` with `_` in elastic field generation
- Document batch-level probability semantics in elastic docstring
- Raise ValueError when isotropic_scale=True combined with per-axis ranges
- Guard against KeyError when allow_missing_keys=True and no keys present
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: validate scale_range length to catch malformed 3-scalar inputs
_parse_scale_range now raises ValueError for inputs like [0.2, 0.3, 0.3]
(3 bare floats) instead of passing them to Kornia which crashes with an
unhelpful shape error.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use 2/(size-1) for align_corners=True displacement normalization
With align_corners=True, grid_sample maps [-1, 1] to pixel [0, N-1],
so 1 voxel displacement = 2/(N-1). The old 2/N formula under-scaled
displacements. Pre-existing bug, fixed while in the area.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add shear support with Z-proportional scaling and oversized crop pipeline
- Fix shear_range: remove incorrect _radians_to_degrees and _invert_per_axis
(shear is in degrees, not radians; facets are not ZYX-ordered axes)
- Add _parse_shear_range: supports (min,max) isotropic, 3-value MONAI
shorthand [s_zy, s_zx, s_yz], and 6 per-facet (min,max) pairs
- Add scale_z_shear option (default True): scales Z-related shear facets
by z_depth/yx_size to prevent destructive shear on thin Z volumes
- Update config: yx_patch_size 512, center crop to 256 after affine,
re-enable shear [0.0, 3.0, 3.0] matching original VSCyto3D
- Use mul_ in-place for shear scaling, remove unnecessary .contiguous()
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: match original VSCyto3D normalization and noise params
Switch normalization from MinMax p1_p99 to mean/std (source) and
median/iqr (target) to match the original VSCyto3D pipeline.
Reduce Gaussian noise std from 5.0 to 1.0 to match original.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: standardize W&B run naming and grouping
* feat: add BatchedRandWeightedCropd GPU-batched weighted spatial crop
Samples crop positions proportional to a spatial importance map using
avg_pool2d for per-window weights and torch.multinomial for sampling.
Replaces the fixed center-crop approach so training can focus on
signal-rich regions within each FOV.
Pipeline: CPU 512x512 → weighted crop 384x384 → affine → center 256x256.
Registered in _SPATIAL_TRANSFORMS for fg_mask co-alignment.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add UNeXt2 architecture support to DynacellUNet
Register UNeXt2 in the _ARCHITECTURE dict so Dynacell can own
SEC61B benchmarks that use the UNeXt2 (VSCyto3D) backbone.
Includes unit tests (init, forward, predict_step), a fast_dev_run
integration test with YX=64 fixtures, and a predict-to-OME-Zarr
integration test.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add SEC61B benchmark configs and launch scripts to dynacell
Dynacell becomes the canonical launch owner for SEC61B benchmarks
(FNet3D + UNeXt2). Includes data/model/trainer recipes, leaf
configs with MixedLoss and WarmupCosine, and H200 SLURM scripts.
Cytoland copies will be marked as transitional legacy in a
follow-up commit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: mark cytoland SEC61B configs as transitional legacy paths
Dynacell is now the canonical launch owner for SEC61B and FNet3D
benchmarks. Add legacy markers to cytoland configs pointing to
their dynacell replacements, and note the change in the cytoland
README. No files deleted — cleanup deferred until dynacell runs
are validated on H200.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use valid per-axis scale_range format in cached_pretrain config
After PR #400 tightened _parse_scale_range validation, bare 3-float
sequences like [0.2, 0.3, 0.3] raise ValueError. Convert to explicit
per-axis (min, max) tuples matching the intended scale ranges.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add competing-locations multichannel weight test
The original multichannel test only verified sum-over-C at one spatial
location. Add a test with deltas at opposite corners (each in exactly
one crop window) to verify that higher total weight across channels
biases sampling toward the stronger location.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add paper-baseline FNet SEC61B config
* feat: finalize SEC61B FNet configs
* refactor: move spatial cropping from CPU _final_crop to GPU transforms
Remove the hardcoded CenterSpatialCropd from HCSDataModule._final_crop()
which coupled the data module to cropping logic and eliminated spatial
diversity during training (always center-cropped). Cropping now happens
on GPU in on_after_batch_transfer:
- Training with gpu_augmentations: user-specified transforms handle crop
- Training without: default BatchedRandSpatialCropd to yx_patch_size
- Validation: deterministic BatchedCenterSpatialCropd to yx_patch_size
- Test/predict: pass through unchanged
Also moves target_2d Z slicing from on_before_batch_transfer to after
the GPU crop in on_after_batch_transfer to avoid Z dimension mismatch.
Adds BatchedRandSpatialCropd to the FNet paper config's gpu_augmentations
for random spatial sampling across the full FOV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: extract _pad_forward_crop and fix predict unpadding
Replace _predict_pad.inverse() with explicit center-crop to the
pre-pad shape for more reliable unpadding. Extract the repeated
pad→forward→crop pattern into _pad_forward_crop helper.
Also add tuple-merging support in FcmaeUNet._merge_batches so
heterogeneous batch entries (e.g. index tuples with mixed Tensor
and list elements) are correctly concatenated across sub-batches
instead of silently dropping all but the first.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: decouple HCSDataModule from viscy_transforms
Remove BatchedCenterSpatialCropd/BatchedRandSpatialCropd import and
fallback crop construction from HCSDataModule. The data module should
not depend on the transforms package — cropping is the user's
responsibility via gpu_augmentations config.
Replace fallback crops with a shape validation check that raises a
clear error when source spatial dims don't match yx_patch_size during
training/validation. Test/predict pass through unchanged (full FOV).
Also delete the no-op on_before_batch_transfer and gate the shape
check on training/validating only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use context manager for open_ome_zarr in CachedOmeZarrDataModule
Same pattern fixed in HCSDataModule by commit 462f653 — zarr store
opened without `with` statement leaks file handles if an exception
occurs before close.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: restrict spatial shape check to training only
Validation runs at full FOV without gpu_augmentations, so the shape
check against yx_patch_size would always fail. Per-pixel MSE is
scale-invariant so full-FOV validation is valid. Training still
validates that gpu_augmentations produce the expected patch size.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: add per-worker FOV cache to SlidingWindowDataset
Consecutive sliding window samples often come from the same FOV,
causing redundant zarr chunk decompression across calls. Cache the
decompressed FOV array per worker using lru_cache so subsequent
Z windows from the same FOV are pure numpy slices (no I/O).
Each DataLoader worker fork gets its own cache. Default 5 FOVs
per worker (~1 GB at 200 MB/FOV for 2-channel SEC61B data).
Configurable via fov_cache_maxsize parameter (0 to disable).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: cache per-timepoint in FOV cache, not entire T dimension
The FOV cache was reading all timepoints into memory but only using
one per sample. For multi-timepoint datasets this wastes T× memory.
Key the cache on (arr_idx, t, ch_idx) and read a single timepoint
per entry: (1, C, Z, Y, X) instead of (T, C, Z, Y, X).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: replace predict_pad.inverse with center-crop in dynacell engine
Same fix applied to cytoland in e4171aa — DivisiblePad.inverse relies
on MONAI metadata tracking which may not be active, producing wrong
output shapes. Use explicit _center_crop_to_shape instead.
Also update dynacell test yx_patch_size from (16,16) to (32,32) to
match the test fixture FOV size, since _final_crop no longer exists
to bridge the mismatch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: fix stale docstrings in cytoland engine
on_predict_start: remove incorrect claim about DivisiblePad.inverse
(replaced by _center_crop_to_shape in e4171aa).
_compute_loss: change "configuration time" to "training time" since
the TypeError is raised inside _compute_loss, not at init.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: move inline imports to top of file in test_hcs.py
CLAUDE.md requires top-level imports unless there is a strong reason
for inline. These monai/viscy_transforms imports have no circular
dependency or conditional import justification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: decouple viscy-data from viscy-transforms via is_spatial
Each transform now self-declares `is_spatial = True` (spatial) or
`is_spatial = False` (intensity). foreground_masks.py checks this
attribute instead of importing concrete classes from viscy-transforms,
eliminating the undeclared runtime dependency.
An MRO-based fallback detects raw MONAI spatial transforms (from
monai.transforms.spatial or monai.transforms.croppad) without imports.
Also fixes a pre-existing gap: BatchedRand3DElasticd, BatchedZoomd,
and BatchedRandZStackShiftd are now correctly identified as spatial.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add flow matching celldiff (#402)
* feat: port flow-matching transport module to viscy-models
Port ODE/SDE solvers, coupling plans (ICPlan, VPCPlan, GVPCPlan), and
Transport/Sampler classes from the CellDiff fork into
viscy_models.celldiff.modules.transport. Add torchdiffeq to the celldiff
optional dependency group. Include 7 unit tests.
Cleanup vs fork: th→torch, type hints, numpy docstrings, dead code
removed (EasyDict, log_state), bare except fixed, class names
capitalized (ODESolver/SDESolver), eval→is_eval parameter rename,
eager diffusion evaluation replaced with lazy if/elif dispatch,
CPU-to-GPU allocations fixed (randn_like/new_ones), SDE solver
returns only final state instead of all intermediates, silent .get()
fallbacks replaced with explicit ValueError on invalid config strings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add DynacellFlowMatching LightningModule and CellDiff configs
Add CELLDiff3DVS flow-matching wrapper (celldiff_wrapper.py) and
DynacellFlowMatching LightningModule to the dynacell application.
Includes fit/predict configs, flow-matching trainer recipe (no
validation loss monitor), and celldiff_fm model recipe.
Key fixes vs fork: save_hyperparameters added, WarmupCosine uses
estimated_stepping_batches with step interval, logger-agnostic image
logging via log_image_grid, validation_step captures batch for ODE
generation logging, configure_optimizers extracted to shared helper.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add flow-matching tests and update dynacell README
Add 4 engine smoke tests (instantiation, forward loss, generate shape,
predict pad/crop) and 3 training integration tests (WarmupCosine,
Constant schedule, predict-to-zarr) for DynacellFlowMatching.
Config class_path resolution auto-discovers new celldiff/*.yml configs.
Update README with flow-matching usage and architecture docs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: validate path_type in create_transport
Raises ValueError with a descriptive message for invalid path_type
instead of a raw KeyError, matching the validation already done for
the prediction and loss_weight parameters.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use Callable type hint instead of lowercase callable
Replace built-in callable with collections.abc.Callable in type
annotations throughout integrators.py and transport.py. The lowercase
callable is the built-in function, not a valid type annotation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: slice validation batch at capture time in DynacellFlowMatching
Only clone log_samples_per_batch samples instead of the full batch
when capturing validation data for epoch-end ODE generation logging.
Avoids holding unused GPU memory for the entire validation epoch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: remove dead test config constants from conftest
CELLDIFF_TEST_NET_CONFIG and CELLDIFF_TEST_TRANSPORT_CONFIG in
conftest.py were never used by any fixture. The test files define
their own copies because --import-mode=importlib prevents importing
from conftest directly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: restrict spatial shape check to training only
Validation runs at full FOV without gpu_augmentations, so the shape
check against yx_patch_size would always fail. Per-pixel MSE is
scale-invariant so full-FOV validation is valid. Training still
validates that gpu_augmentations produce the expected patch size.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: add per-worker FOV cache to SlidingWindowDataset
Consecutive sliding window samples often come from the same FOV,
causing redundant zarr chunk decompression across calls. Cache the
decompressed FOV array per worker using lru_cache so subsequent
Z windows from the same FOV are pure numpy slices (no I/O).
Each DataLoader worker fork gets its own cache. Default 5 FOVs
per worker (~1 GB at 200 MB/FOV for 2-channel SEC61B data).
Configurable via fov_cache_maxsize parameter (0 to disable).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: cache per-timepoint in FOV cache, not entire T dimension
The FOV cache was reading all timepoints into memory but only using
one per sample. For multi-timepoint datasets this wastes T× memory.
Key the cache on (arr_idx, t, ch_idx) and read a single timepoint
per entry: (1, C, Z, Y, X) instead of (T, C, Z, Y, X).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: replace predict_pad.inverse with center-crop in dynacell engine
Same fix applied to cytoland in e4171aa — DivisiblePad.inverse relies
on MONAI metadata tracking which may not be active, producing wrong
output shapes. Use explicit _center_crop_to_shape instead.
Also update dynacell test yx_patch_size from (16,16) to (32,32) to
match the test fixture FOV size, since _final_crop no longer exists
to bridge the mismatch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: fix stale docstrings in cytoland engine
on_predict_start: remove incorrect claim about DivisiblePad.inverse
(replaced by _center_crop_to_shape in e4171aa).
_compute_loss: change "configuration time" to "training time" since
the TypeError is raised inside _compute_loss, not at init.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: move inline imports to top of file in test_hcs.py
CLAUDE.md requires top-level imports unless there is a strong reason
for inline. These monai/viscy_transforms imports have no circular
dependency or conditional import justification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: decouple viscy-data from viscy-transforms via is_spatial
Each transform now self-declares `is_spatial = True` (spatial) or
`is_spatial = False` (intensity). foreground_masks.py checks this
attribute instead of importing concrete classes from viscy-transforms,
eliminating the undeclared runtime dependency.
An MRO-based fallback detects raw MONAI spatial transforms (from
monai.transforms.spatial or monai.transforms.croppad) without imports.
Also fixes a pre-existing gap: BatchedRand3DElasticd, BatchedZoomd,
and BatchedRandZStackShiftd are now correctly identified as spatial.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: move SDE solver time grid to input device
SDESolver created self.t and self.dt on CPU in __init__, causing
device mismatch when sample() operates on GPU tensors. Move to
input device at sample() start and pass dt explicitly to step
functions, matching the pattern already used by ODESolver.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address numerical and correctness bugs in …
edyoshikun
commented
Apr 21, 2026
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
* Exempt examples/ from docstring and line-length lint rules Tutorial-style scripts under applications/*/examples use jupytext-style percent cells, markdown docstrings, and long URL references that trip D205/D400/D100/D103 and E501. Treat them like notebooks and tests. * Port VS_model_inference demos from main into applications/cytoland Bring back the four VSCyto inference demos (VSCyto2D, VSCyto3D, VSNeuromast, and TTA-augmented) plus plot.py helper from examples/virtual_staining/VS_model_inference on main. Imports are updated to the modular package layout: - viscy.data.hcs -> viscy_data.hcs - viscy.transforms -> viscy_transforms - viscy.trainer -> viscy_utils.trainer - viscy.translation.engine -> cytoland.engine - viscy.translation.predict_writer -> viscy_utils.callbacks * Port dlmbl_exercise tutorial from main into applications/cytoland Bring back the DL@MBL 2024 image-translation exercise (solution.py, README, setup.sh, prepare-exercise.sh) from examples/virtual_staining/dlmbl_exercise on main. Notebooks are not ported; regenerate from solution.py via jupytext if needed. Imports are updated to the modular package layout: - viscy.data.hcs -> viscy_data.hcs - viscy.transforms -> viscy_transforms - viscy.trainer -> viscy_utils.trainer - viscy.translation.engine.VSUNet -> cytoland.engine.VSUNet - viscy.translation.engine.MixedLoss -> viscy_utils.losses.MixedLoss - viscy.translation.evaluation_metrics.mean_average_precision -> viscy_utils.evaluation.metrics.mean_average_precision setup.sh now installs applications/cytoland[metrics] editable instead of the legacy top-level viscy[metrics,visual,examples]>=0.2 wheel, plus cellpose and torchview as extra tutorial dependencies. * Extend examples/ lint exemption to cover E402 and F821 Tutorial-style scripts routinely import late inside % cells (E402) and reference notebook builtins like get_ipython without importing them (F821). Extending the per-file-ignores so the vcp tutorial scripts lint cleanly alongside the existing D and E501 exemptions. * Port vcp_tutorials from main into applications/cytoland Bring back the Virtual Cell Platform tutorials (quick_start.py, hek293t.py, neuromast.py, README.md) from examples/virtual_staining/vcp_tutorials on main. Notebooks are not ported; regenerate from .py via jupytext if needed. quick_start.py imports are updated to the modular package layout: - viscy.data.hcs -> viscy_data.hcs - viscy.transforms -> viscy_transforms - viscy.trainer -> viscy_utils.trainer - viscy.translation.engine.FcmaeUNet -> cytoland.engine.FcmaeUNet - viscy.translation.predict_writer -> viscy_utils.callbacks hek293t.py and neuromast.py have no viscy Python imports (they demonstrate the viscy preprocess / viscy predict CLIs), so no code changes are required. The commented-out pip install "viscy[...]" hints are left as-is for historical reference. * Document ported examples in cytoland README Add a Tutorials and demos section listing VS_model_inference, vcp_tutorials, dlmbl_exercise, and configs, so the newly ported examples are discoverable from the cytoland landing page. * Port phase_contrast tutorial from main into applications/cytoland Bring back the phase-contrast virtual staining tutorial (solution.py, README, setup.sh, prepare-exercise.sh) from examples/virtual_staining/phase_contrast on main. Notebooks are not ported; regenerate from solution.py via jupytext if needed. Imports are updated to the modular package layout: - viscy.data.hcs -> viscy_data.hcs - viscy.transforms -> viscy_transforms - viscy.trainer -> viscy_utils.trainer - viscy.translation.engine.VSUNet -> cytoland.engine.VSUNet setup.sh now installs applications/cytoland[metrics] editable instead of the legacy top-level viscy[metrics,visual,examples]>=0.2 wheel. * Teach PyTorch Lightning in the dlmbl exercise Add a concise Lightning primer before Part 1 (three-object mental model: LightningDataModule, LightningModule, Trainer; what trainer.fit replaces) so learners unfamiliar with Lightning can follow the rest of the exercise. Expand inline explanations at the three points where Lightning concepts first appear in code: - HCSDataModule section: describe the DataModule role and the exact dict shape (source, target, index) yielded to training_step. - VSUNet instantiation: frame the class as the LightningModule that bundles network, loss, and per-batch logic; gloss lr, schedule, freeze_encoder, log_batches_per_epoch. - Trainer constructors: annotate fast_dev_run, accelerator, devices, precision="16-mixed", max_epochs, log_every_n_steps, and TensorBoardLogger; spell out what trainer.fit does internally so learners see it replacing a hand-written training loop. Markdown only, no code changes. * Sharpen dlmbl exercise teaching material Six concise additions for learners new to deep learning and Lightning: 1. Typos: fluoresecence -> fluorescence, componets -> components, Person Correlation -> Pearson Correlation. 2. Tighten the OME-Zarr / HCS section: one-paragraph primer on the row/col/field/level/T/C/Z/Y/X hierarchy before open_ome_zarr is called. 3. Add an augmentation motivation table mapping each MONAI transform to the real-world microscope variation it simulates. 4. Add a pre-model markdown cell explaining the UNeXt2 config (encoder_blocks, dims, decoder_conv_blocks, stem_kernel_size, in_stack_depth) in U-Net terms. 5. Same cell explains MixedLoss (L1 + MS-SSIM tradeoff) and the WarmupCosine LR schedule. 6. Part 2 opener distinguishes regression vs segmentation metric families; fill in the Task 2.1 TODO with real Pearson/SSIM definitions tied to the image-translation setting. Markdown only, no code changes. * Apply 2025 DL@MBL feedback to dlmbl exercise Issue dl-janelia/image_translation#16: - Fix the nested duplicate for-loop after Task 1.1 that shadows the loop variable (single loop now). - Replace the broken viscy.transforms source link with authoritative MONAI docs links for RandAffined and RandGaussianNoised. - Add explicit path + public download URL for the VSCyto2D pretrained checkpoint next to Task 2.2 so students don't guess where it went. - Switch spatial dimension labels from (D, H, W) to (Z, Y, X) in Part 3 Gaussian-blur tasks to match the convention used everywhere else in the exercise. - Note MicroSSIM as a microscopy-appropriate SSIM variant in the metrics primer. - Warn students to restart the kernel before re-running training to release GPU memory (CUDA OOM prevention). * Migrate dlmbl setup from conda to uv Rewrite setup.sh to use uv instead of conda so students can provision the exercise without a working conda install. The script now: - Installs uv via the official installer if missing. - Creates a Python 3.11 venv at .venv/ inside the exercise folder. - Installs cytoland editable from the monorepo plus the tutorial extras (cellpose, torchview, jupyter, ipykernel, ipywidgets, jupytext, nbformat, nbconvert) into that venv. - Registers the venv as a Jupyter kernel named 06_image_translation so VSCode and JupyterLab both surface it by default. - Downloads the training / test OME-Zarr stores and the VSCyto2D pretrained checkpoint into ~/data/06_image_translation/. Script is bash strict-mode (set -euo pipefail) and resolves the monorepo root relative to the script path so it works regardless of the caller's cwd. README updated to match the new flow and document the VSCode / Jupyter kernel selection. * Replace SSIM with microSSIM in dlmbl exercise Classic skimage SSIM assumes natural-image dynamic range and scores collapse into a narrow band on sparse, dim, noisy fluorescence microscopy predictions — so the metric can barely rank good vs bad outputs. microSSIM (Ashesh et al. 2024, arXiv:2408.08747) subtracts background and fits a per-image rescale before running SSIM, which restores sensitivity over the intensity range microscopy predictions actually live in. - Replace skimage.metrics.structural_similarity with microssim.micro_structural_similarity at all 6 call sites (two metrics blocks: single-model and phase2fluor vs pretrained). - Drop the now-unused from skimage import metrics import. - Rename DataFrame columns SSIM_nuc / SSIM_mem to microSSIM_nuc / microSSIM_mem so plots and saved CSVs name the metric correctly. - Rewrite the Part 2 Task 2.1 metric definition to explain WHY microSSIM instead of SSIM for microscopy. - Add microssim to setup.sh install list and to the README install summary. * Point dlmbl exercise to zarrv3 datasets * Rename dlmbl_exercise to dl-course-exercise * Split dl-course-exercise setup into TA and student scripts setup_TA.sh stages data + checkpoint to a shared DATA_ROOT (no env). setup_student.sh creates the per-user venv + kernel and skips the download when DATA_ROOT already has the data. * Bump dl-course-exercise to Python 3.13 and viscy>=0.5.0a0 setup_student.sh now defaults to python 3.13 and installs cytoland + viscy from PyPI when run outside a VisCy monorepo clone. When the monorepo is detected (and root pyproject is the viscy umbrella), it falls back to the existing editable workspace install. README updated to match. * Polish dl-course-exercise solution.py - Replace stale torch.no_grad() with torch.inference_mode() in the prediction-visualization block to match the rest of the file. - Renumber the first 'Task 1.5' (model instantiation) to 'Task 1.4'. The exercise was missing a Task 1.4, leaving two Task 1.5s. - Update the source-code reference under that task to point at VSUNet and the fcmae network (the Unet2D link was obsolete). * Polish prose in dl-course-exercise solution.py Three small wording fixes ported from the DL@Janelia version of this exercise where it had clearer phrasing: - Task 1.1 hint: 'what are your options' -> 'what your options are'. - Task 2.4: add the missing alert heading and move the section header outside the alert box (was structurally mismatched). - Task 2.4 question: typo fix 'How do yout model' -> 'How does your model'. * Port Task 2.5 (fluorescence -> phase) from DL@Janelia exercise Adds the Task 2.5 section that walks students through the inverse translation: predicting QPI from nuclei + membrane fluorescence using a pretrained model and Test-Time Augmentation. Inserted between Task 2.4 and Part 3. setup_TA.sh and setup_student.sh now also stage the fluor2phase_step668.ckpt checkpoint (previously commented out in TA, absent in student) since Task 2.5 loads it. The hardcoded /mnt/efs/... paths from the original Janelia version are replaced with the top_dir-based DATA_ROOT layout used elsewhere in this exercise. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * demos: remove invalid `architecture` kwarg from HCSDataModule calls HCSDataModule's __init__ (packages/viscy-data/src/viscy_data/hcs.py:104) does not accept an `architecture` parameter; the demos were raising TypeError before prediction. Architecture selection is on VSUNet only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * demo_vscyto_w_ttas: apply precomputed FOV stats normalization The in-memory sliding-window path was feeding the raw phase volume to the model while the CLI prediction path normalizes Phase3D with the precomputed FOV median / IQR (NormalizeSampled). Read those stats from .zattrs and apply them inline so this demo produces the same results as `viscy predict`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * phase_contrast setup: resolve cytoland path from script location setup.sh used "applications/cytoland[metrics]" which only resolved correctly when invoked from the monorepo root. The README instructs users to `source setup.sh` from the example directory, where that path doesn't exist. Compute the cytoland directory from ${BASH_SOURCE[0]} so the install works regardless of cwd. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * phase_contrast solution: drop unused trainer, make GPU placement explicit VisCyTrainer was imported but never instantiated. Remove the dead import and chain `.eval()` after `.to(inference_device)` for both checkpoints so it's obvious the models run on GPU. Add a print of the resolved device for visibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dl-course solution: fix DATA_ROOT semantics, TTA ordering, GPU loads - top_dir now points directly at the data root populated by setup_student.sh / setup_TA.sh ($DATA_ROOT/{training,test,pretrained_models}), defaulting to ~/data/06_image_translation. Drop the dangling "06_image_translation/" subpath everywhere top_dir is used. - Move the TTA implementation into the same solution cell as the TTA metrics so the generated solution.ipynb runs top-to-bottom (task cells are stripped, so the later TTA solution cell was unreachable before tta_pred_nuc/mem were referenced). - Add the first-use imports (skimage.metrics, rescale_intensity) to the fluor->phase metrics cell so it no longer raises NameError. - VSUNet.load_from_checkpoint does not accept `accelerator=` (it is forwarded to VSUNet.__init__ and raises TypeError). Drop the kwarg and move the loaded module to GPU with .to(...) instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dl-course README: recommend solution.ipynb path first, warn about task cells Reorder the "Run the exercise" section to lead with the notebook workflow (which strips task-tagged placeholder cells) and demote the raw solution.py path to an advanced/VSCode option with an explicit warning that task-tagged cells contain TODO/... placeholders and must be skipped manually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Eduardo Hirata-Miyasaki <edhiratam@gmail.com> Co-authored-by: dihan.zheng <dihan.zheng@ucsf.edu> fix(viscy-data): drop use_thread_workers to fix DDP deadlock (#413) fix(viscy-data): port to iohub 0.3.2 ImageArray API (#407) fix shipped via PR #412 fixes triggered by Copilot's review on PR #427 plus an independent fix two CLAUDE.md doc nits flagged in PR #427 self-review fix(dynacell): address PR #426 code-review findings fix(dynacell): harden DynacellGAN against silent failures (PR #428 review)
* 2D MIP single-marker 192: 384->256->192 patch variant
Larger final crop (192 vs 160) for more subcellular detail at ~2x I/O cost.
Affine corner safety holds: 384 / sqrt(2) * 0.8 = 217 > 192 final crop.
Layered config (base + single-marker + single-marker-192):
- yx_patch_size: 256 -> 384
- final_yx_patch_size: 160 -> 192
- BatchedRandSpatialCropd roi_size: [10, 192, 192] -> [10, 256, 256]
- example_input_array_shape: [1,1,1,160,160] -> [1,1,1,192,192]
- All other augmentation knobs (scale_range, rotate, contrast, smooth,
noise, ChannelWiseZReduction) preserved verbatim from the 160 recipe.
Warm-start uses the 160 single-marker epoch-0 checkpoint (0rhpwh77).
ConvNeXt-Tiny stem is fully convolutional (1x4x4 kernel, stride 1x4x4)
so the state_dict loads cleanly at the new input size.
Run name: 2d-mip-ntxent-t0p2-lr2e5-bs256-384to192-zext16-single-marker-fix-shuffler
sbatch job-name: dynaclr_2d_sm192
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Lineage-aware PHATE subsampling in combined-dim-reduction
REDUCE_COMBINED on the infectomics-annotated matrix (350k cells x 768
features) was running PHATE on a 50k random subsample, fragmenting
tracks across the diffusion graph and producing slow + biologically
incoherent embeddings. The viscy_utils compute_phate already supports
lineage-aware subsampling but only when the caller passes lineage_ids;
nothing was constructing them.
This commit derives lineage_ids in reduce_combined.py from the
(fov_name, track_id) columns already present in obs (prefixed by store
index to keep namespaces disjoint across stores), then passes the
combined array to PHATE. compute_phate now picks N whole lineages and
fits on all their timepoints, then transforms the full 350k.
Recipe (recipes/reduce.yml) updates the `subsample` field with a
comment explaining the unit semantics flip when lineage cols are
present (subsample = N lineages, not N cells) and lowers the value to
1500 — at mean track length 17 across ~17k lineages, that yields
~25k fitting cells. Comparable wall time to the previous random 20k
but with coherent trajectories.
Falls back to random-cell subsampling when neither `lineage_id` nor
`fov_name + track_id` is in obs.
Validation log in evaluation_matrix.md §9 will be updated after the
resubmitted Wave-1 finishes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): add hardware_h200_single_smoke launcher profile
Adds a smoke-sized variant of hardware_h200_single that bounds
launcher.sbatch.time to 30 min. Identical otherwise (single H200,
ntasks_per_node=1, 256G mem). Re-points train_smoke.yml at the new
profile so a smoke job cannot sit on a multi-day allocation by
default. Pair with --override trainer.fast_dev_run=true (or
trainer.max_steps=N) so the run actually exits inside the wall.
test_joint_train_smoke_leaf_composes now asserts sbatch.time matches
the smoke value, locking the invariant into the leaf.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(2D MIP 192): tighten random crop to 216 + drop warm-start
Job 31442612 hit a 30-min NCCL all-reduce timeout in optimizer.step.
Two suspected causes addressed:
- BatchedRandSpatialCropd roi_size 256 -> 216 to fit fully inside the
affine-safe inscribed region (384/sqrt(2)*0.8 = 217 px).
- Warm-start commented out to remove the 160-trained encoder loaded
into a 192-input model as a possible source of rank divergence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(dynacell): /simplify smoke profile via wall-only overlay
Replaces the duplicated `hardware_h200_single_smoke.yml` (which copied
8 of 9 fields from `hardware_h200_single.yml`) with a 4-line
`wall_smoke.yml` that only overrides `launcher.sbatch.time`. The smoke
leaf now stacks `hardware_h200_single + wall_smoke + runtime_shared`,
matching the existing compose-by-overlay idiom. Future hardware tweaks
(cpus, mem, constraint) propagate to smoke automatically; future
hardware variants reuse `wall_smoke.yml` for free.
Also trims the smoke leaf header: drops the numbered "Differences from
sibling train.yml" enumeration (would rot the moment either leaf
changes; the README's "Joint smoke sibling" section already documents
the contract) and shortens the anchor-mechanism comment to a one-line
cross-ref into train.yml.
No behavior change — composed `launcher.sbatch` is byte-identical to
the previous form (verified by `test_joint_train_smoke_leaf_composes`,
which still asserts `time == "00:30:00"` and `constraint == "h200"`).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(dynacell): disable logger at the smoke leaf level
Consumers of train_smoke.yml shouldn't have to remember
`--override trainer.logger=false` at submit time, and smokes don't
need a logger by definition. Sets `trainer.logger: false` directly in
the leaf so future smokes for other models/organelles inherit the
same disable when the recipe defaults a WandbLogger.
Recipe `fit.yml` also defaults a `LearningRateMonitor` callback that
raises `MisconfigurationException` when `trainer.loggers` is empty
(`lr_monitor.py:121-122`), so the smoke leaf's `callbacks` list is
trimmed to just `ModelCheckpoint`. Lists replace wholesale under
deep_merge, so this cleanly drops LRMonitor without affecting prod.
`test_joint_train_smoke_leaf_composes` now asserts both invariants:
`trainer.logger is False` and LRMonitor is not in the callbacks list.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(2D MIP 192 sbatch): mem-per-cpu 8G -> 14G to avoid host RAM OOM
* fix(viscy-data): drop non-tensor metadata in BatchedConcatDataModule combine
`BatchedConcatDataModule.on_after_batch_transfer` combined per-key
values across micro-batches by assuming each value was either a list
(extend) or a tensor (`torch.cat`). HCSDataModule emits `norm_meta`
(dict of per-channel normalization stats, when
`load_normalization_metadata=True`, the default) and `index` (tuple
of `(img_name, t_idx, z_idx)`) per sample. After collation those keys
hit the cat branch and `torch.cat([dict_a, dict_b])` raised
`TypeError: expected Tensor as element 0 in argument 0, but got dict`,
killing joint training during Lightning's `_run_sanity_check`.
Joint training across heterogeneous children has no well-defined
combined semantic for these keys — channels need not align between
zarrs and FOV identifiers are dataset-specific — and Lightning's
training/validation only reads source/target tensors. Skip the key
during combine instead.
Existing tests (`test_batched_concat_datamodule_with_hcs_children`
and the DDP variants) iterate `next(iter(loader))` to inspect the
micro-batch contract but never invoke `on_after_batch_transfer`, so
the bug never surfaced. New regression test calls
`on_after_batch_transfer` against the real HCS fixture (which does
write per-FOV normalization metadata) and asserts tensor keys
concatenate cleanly while dict-valued metadata is dropped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Move pseudotime package out of evaluation/ and add io.py
The pseudotime tree was duplicated under both
applications/dynaclr/src/dynaclr/evaluation/pseudotime/ and
.../pseudotime/. Make .../pseudotime/ the canonical location:
- Add io.py: template-zarr layout helpers + dataset-routing helpers
(save_template_zarr, load_template_flavor, find_embedding_zarr,
date_prefix_from_dataset_id, get_dynaclr_versions, ...).
- Populate __init__.py with the curated public API and re-export
io helpers.
- Rename POSITIVE_CLASSES -> DEFAULT_POSITIVE_CLASSES and
build_infection_template -> build_template for clarity.
- Migrate .values -> .to_numpy() in pandas usage (pyarrow string
arrays compatibility).
- Delete the legacy evaluation/pseudotime/ tree (no remaining
imports anywhere in the repo).
- Update test_pseudotime.py and scripts/pseudotime/3-organelle-remodeling/
to import from dynaclr.pseudotime, replacing the
sys.path-based shim into 1-build_template/evaluate_template.py.
- Refresh docs/DAGs/pseudotime.md with the new package layout and
the alignment-funnel drop_log.json sidecar note.
Verified: uv run pytest applications/dynaclr/tests/test_pseudotime.py
(29/29 pass), ruff check + format clean.
* fix(OnlineEvalCallback): all_gather features under DDP
Per-rank validation shards were being averaged via sync_dist=True,
but effective_rank, kNN-CV/holdout, and Spearman rho are non-linear
in the sample set — averaging shard scalars yields the wrong value.
The val DataLoader is sharded (datamodule passes num_replicas/rank
to FlexibleBatchSampler with use_distributed_sampler: false), so
this affected every multi-GPU run.
Fix: gather features + per-sample arrays (labels, track_ids,
timepoints) across ranks before computing metrics. All ranks see
the full set, produce identical scalars, and sync_dist=True
becomes a no-op average that still satisfies Lightning's epoch-end
all-reduce (avoiding the original rank-0-only deadlock).
- Add OnlineEvalCallback._gather_across_ranks: passthrough on
world_size=1; pl_module.all_gather + concat on world_size>1.
Equalize per-rank shard sizes by truncating to the minimum length
before gather to keep all_gather happy with fixed shapes.
- Add an availability all-reduce so optional metadata arrays
(track_ids, timepoints) return None consistently across ranks
when missing, instead of stalling on a no-op gather.
- Update class + method docstrings to reflect the new contract.
- Test the multi-rank gather path with a fake pl_module exposing
trainer.world_size and an all_gather that block-stacks. Single
consolidated test covers the world_size=2 concat plus the
missing-optional-array passthrough.
* refactor(cli_utils): hoist load_composed_config import to module top
Per CLAUDE.md "Import at the top of the file." Audited all 17 in-repo
callers of load_config across applications/dynaclr/, applications/
airtable/, applications/qc/, and packages/viscy-utils/: every existing
top-level base: key in their YAMLs is a recipe-style list of relative
paths (no data-shaped collisions), so load_composed_config is a strict
superset of yaml.safe_load on legacy configs. No opt-in flag needed.
- Move 'from viscy_utils.compose import load_composed_config' from
inside load_config() to the module-level imports.
- Drop the redundant existence check; load_composed_config already
raises FileNotFoundError when opening the file.
- Update the docstring to describe the base: composition behavior
rather than the prior raw-yaml.safe_load contract; remove the
yaml.YAMLError raises stanza (yaml is no longer imported here).
* fix(prepare): subclass Dumper instead of mutating yaml.Dumper
`save_yaml_no_aliases` was assigning a lambda to
`yaml.Dumper.ignore_aliases`, which mutates the class attribute
globally. Every other yaml.dump() in the same Python process inherited
the alias-disabled behavior — silent action-at-a-distance for any
caller that happens to import this module first.
Define a local _NoAliasDumper subclass and pass it as Dumper=...
instead. Behavior unchanged for this call site; the rest of the process
gets stock yaml.Dumper back.
* refactor: hoist inline imports to module top
Per CLAUDE.md "Import at the top of the file." Three trivial inline
imports of mandatory dependencies become module-level imports:
- preprocess_cell_index.py: viscy_data.cell_index.preprocess_cell_index
- prepare.py: iohub.open_ome_zarr (was inlined in two functions)
- dimensionality_reduction.py: numpy as np
Inline imports of optional/heavy dependencies (phate, sklearn in
dimensionality_reduction.py; copairs in embedding_map.py with explicit
ImportError handling) are intentional lazy-load patterns and left
alone.
* fix(mmd): cap pooled-kernel size and drop dead helper
mmd_permutation_test allocates an (N, N) float32 kernel where
N = len(X) + len(Y). Default callers pass max_cells=2000 per group
(N=4000 ≈ 64 MB), but a user passing max_cells=None or a too-large
value would silently OOM (50k → 10 GB; 100k → 40 GB).
- Raise ValueError up front when N exceeds 20000 (≈1.6 GB), with
a message that names the would-be allocation and tells the caller
to subsample. Threshold matches the practical statistical regime
for permutation MMD; anyone needing larger samples should switch
to a linear-time MMD estimator.
- Delete _mmd2_from_kernel (~28 lines) — it was never called from
any code path. The live permutation test uses an inner
_mmd2_from_labels closure instead.
23/23 test_mmd.py tests still pass.
* Move LC registry to per-task-domain sub-registries
Restructure the central LC registry from {model}/v{N}/ to
{model}/{task_domain}/v{N}/ so different annotation domains can each
have their own versioned bundle. Wave-2 leaves choose which
sub-registry to fetch based on what their data's markers need.
Layout change:
/linear_classifiers/{model}/v1/ -> /linear_classifiers/{model}/infectomics/v1/
/linear_classifiers/{model}/latest -> /linear_classifiers/{model}/infectomics/latest
Leaves updated:
- DynaCLR-2D-MIP-BagOfChannels/infectomics-annotated.yaml: publish_dir
now ends in /infectomics/ (Wave-1 trainer for infectomics task domain)
- DynaCLR-2D-MIP-BagOfChannels/{alfi,microglia}.yaml: pipelines_dir
ends in /infectomics/latest. alfi.yaml has a comment noting the
marker mismatch (DIC vs G3BP1/SEC61B/Phase3D/viral_sensor) means
predictions will be NaN until an alfi-annotated trainer publishes
to /alfi/. This is a smoke test for the reader path.
Matrix doc (evaluation_matrix.md) §1 + §2 reflect the sub-registry
design and the addition of alfi-annotated as a Wave-1 column.
Existing on-disk registry was relocated by hand:
cd /hpc/.../linear_classifiers/DynaCLR-2D-MIP-BagOfChannels/
mkdir infectomics
mv v1 infectomics/v1
rm latest
ln -s v1 infectomics/latest
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(slurm): make training/debug shell scripts location-portable
Several training launchers hardcoded my home checkout path
(/hpc/mydata/eduardo.hirata/repos/viscy/...) in their `source` line
or in `cd` calls, so they broke when run from a worktree or by
another user. Two scripts (DynaCLR-2D-BagOfChannels-v3.sh,
already-untracked OPS-373genes.sh) used `$(dirname "$0")/slurm/...`
which silently looked for a sibling 'slurm/' that doesn't exist —
those launchers are broken in HEAD today.
Switch every affected launcher to `$(dirname "$0")/../slurm/train.sh`
(one level up to the shared `training/slurm/`) and the three debug
scripts to `cd "$(dirname "$0")/../../../../.."` (5 levels up to the
repo root). Now any clone or worktree runs without editing.
Scripts already using ${WORKSPACE_DIR} (OPS-1000genes-{allmarkers,
lite,multimarker-BoC}.sh, Phase-contrastive-timeaware.sh) are left
alone — that pattern lets the user override the location via env
and works correctly from worktrees as long as WORKSPACE_DIR is set.
* fix(dynacell): cut smoke leaf batch_size to 1 to fit a single H200
train.yml's batch_size=4 across 4 GPUs OOM'd when run on a single H200
(per-step memory is batch_size * num_samples patches at [8, 512, 512]).
Drops the smoke's child batch_size to 1 so per-step VRAM stays inside
one GPU's budget while the patch shape stays identical to train.yml —
the validation remains apples-to-apples.
`RandWeightedCropd.num_samples` also drops from 2 to 1: HCSDataModule
requires `batch_size % num_samples == 0` and rejects the leaf at
schema-validation time otherwise. num_samples=1 is the largest value
that satisfies the constraint with batch_size=1.
`test_joint_train_smoke_leaf_composes` follows along (asserts
batch_size=1 instead of 4).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* deps: cap anndata <0.12.9 in viscy-utils
anndata 0.12.9 hard-requires pandas<3, but the rest of the stack runs
on pandas 3 (we explicitly downcast ArrowStringArray columns in
embedding_writer.py to bridge the two). Locking anndata to 0.12.6 by
default works today, but a future `uv sync` could pick up 0.12.9+ and
silently break embedding writes.
Cap both the optional `[anndata]` extra and the test dep-group at
<0.12.9 so the resolver fails fast instead of installing a known-broken
combo. The TODO at embedding_writer.py:159 already points to anndata
0.13 as the lift-the-cap signal.
uv.lock unchanged — 0.12.6 already satisfies the new constraint, so a
lock refresh is cosmetic and will land separately if/when the lockfile
is otherwise touched.
* refactor(cell_index): parquet-only preprocess + logger over print
- preprocess_cell_index now returns None. The function's job is to
write the preprocessed parquet; callers that need the dataframe can
read_cell_index afterwards. Drops the dual return + write contract
flagged in review. Only caller is the click CLI, which ignored the
return.
- Convert all five print(...) calls in cell_index.py to
_logger.info(...). Logger was already wired at module top — these
calls were vestigial. Lets users silence the chatter via standard
logging config and routes through the same handler as the existing
_logger.warning / _logger.info elsewhere in the file.
* chore: rank-zero warn in effective_rank, pytest.raises in z_reduction test
- online_eval.effective_rank's NaN-row warning fires once per call,
but with the recent all_gather fix every rank computes on the same
full-set features matrix, so the warning would emit world_size times
with identical content. Switch to lightning_utilities' rank_zero_warn
so DDP runs see one log line, not N.
- test_z_reduction.test_invalid_strategy used try/assert False/except
ValueError. Replace with `with pytest.raises(ValueError)` (and add
the missing pytest import) — clearer intent and a better failure
mode if the error type ever drifts.
* feat(dynacell): add 4-GPU DDP smoke leaf for joint celldiff SEC61B
Sibling of train_smoke.yml that swaps hardware_h200_single -> hardware_4gpu
and adds trainer.strategy=ddp, devices=4, max_steps=5. Validates that
BatchedConcatDataModule + ShardedDistributedSampler integrate correctly
on the production DDP topology - single-GPU smoke already proved the
joint loader and training/val loops work; this isolates sharding.
batch_size=1 / num_samples=1 kept identical to train_smoke.yml so the
per-rank memory profile is the proven one - sharding is the only new
variable. max_steps=5 baked in (not override) after train_smoke.yml
TIMEOUT'd at 30 min from a missed --override at submit time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Untrack evaluation_matrix.md (local-only planning doc)
The model x dataset evaluation matrix doc is internal planning that
shouldn't ship upstream. It carries running notes (work in progress
status, validation logs, open questions) that aren't useful in the
public docs and would constantly need rebasing.
- git rm --cached the file (preserved on disk)
- .gitignore now excludes it
Prior history at:
ff04365a Move LC registry to per-task-domain sub-registries
f44b8ffb Lineage-aware PHATE subsampling in combined-dim-reduction
8a7245ef Add evaluation matrix DAG doc with central LC registry design
Those commits remain in history but the doc is no longer tracked.
The pipeline-level DAG (evaluation.md) stays tracked since it
documents the actual production pipeline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(viscy-utils): bf16-precision SSIM helper for Hopper FCMAE training (#412)
* feat(viscy-utils): add bf16-precision SSIM helper
Replaces monai.metrics.regression.compute_ssim_and_cs (which
unconditionally casts both inputs to fp32 internally) with a precision-
aware variant in viscy_utils.evaluation.metrics. The helper runs the 5
Gaussian-mean convolutions in bf16 and promotes only the variance
subtractions and C1/C2-guarded divisions to fp32. Squared products are
computed in fp32 first, then cast to bf16 for the conv input, which
preserves precision on the precision-sensitive squaring step.
Why: FCMAE VSCyto3D training on Hopper (H100/H200) runs at ~45 s/step
vs ~3-5 s/step on Ampere/Ada because monai's 25 fp32 conv ops per loss
invocation can't use Hopper's bf16 tensor cores. The bf16 helper
restores tensor-core path while keeping the precision-sensitive math in
fp32 to match monai's NaN-resistance properties.
Numerical contract validated by 7 tests in test_metrics.py:
- per-pixel random: rtol=5e-2, atol=1e-1 (>=2x margin over measured
0.0418 max abs drift)
- aggregate random: rtol=1e-2, atol=1e-2 (>=25% margin over measured
0.00776)
- aggregate correlated-pair (pred=target+0.05*randn): rtol=2e-3,
atol=5e-3
- gradient flow: cosine-similarity >= 0.99 between helper and monai-ref
flat gradients; sign-flip fraction < 1% on |grad_ref| > 1e-3 voxels
- dtype invariance: helper returns fp32 regardless of fp32/bf16/fp16
input dtype
The decorators on MixedLoss/SpotlightLoss are addressed in a follow-up
commit; this commit only changes the SSIM compute path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(viscy-utils): drop redundant fp32 cast on MixedLoss/SpotlightLoss
Removes @torch.amp.custom_fwd(device_type="cuda",
cast_inputs=torch.float32) from MixedLoss.forward and SpotlightLoss.forward.
For MixedLoss: the decorator was redundant because monai's
compute_ssim_and_cs already cast both SSIM inputs to fp32 internally
(monai/metrics/regression.py:402-403), so the outer @custom_fwd only
affected F.l1_loss / F.mse_loss / outer F.avg_pool3d — none of which
are the slow path. Now that the SSIM compute path goes through
_compute_ssim_and_cs_bf16 (preceding commit) the fp32 island the
decorator created actively defeats the purpose. The L1 and MSE branches
remain numerically identical because PyTorch's autocast policy already
promotes them to fp32 (cf. test_mixed_loss_l1_only_matches_torch_l1).
For SpotlightLoss: the decorator's fp32 island was layered onto a
function with no conv-heavy ops — only squared error, masking, sums,
and divisions, all of which are in autocast's promote-to-fp32 list
already. Removal is a no-op for behaviour but unblocks autocast
plumbing for future precision experiments.
New tests:
- test_mixed_loss.py (4): forward outside autocast, forward under bf16
autocast (drift vs explicit-fp32 baseline within rtol=1e-2, atol=1e-2),
gradient flow under autocast, L1-only bit-exact F.l1_loss.
- test_spotlight_loss.py (+2): autocast bf16 forward+backward finite,
autocast result matches explicit-fp32 baseline within rtol=1e-2,
atol=1e-2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(dynacell): record Hopper FCMAE slowdown investigation + bf16 fix
Captures the full diagnostic chain from "Hopper trains 10-15x slower
than Ampere/Ada" to "fp32 MS-DSSIM is the dominant slow term" to the
landed bf16 SSIM helper, with measured numbers for every claim.
Notable artefacts the doc records (so future readers don't redo the
work):
- T2 probe (job 31452082) showed compute_ms collapses from 45.5 s/step
to ~0.9 s/step on identical N=4 H100 hardware when MS-DSSIM is
removed.
- monai's compute_ssim_and_cs unconditionally casts both inputs to fp32
at regression.py:402-403; the @torch.amp.custom_fwd decorator on
MixedLoss/SpotlightLoss was layered redundantly on top.
- T6 probe (job 31453564) with the bf16 helper measured 4151 ms/step
steady-state mean (6 consecutive STEP_TIMER lines) on identical
N=4 H100 — 10.97x speedup, fix landed in commits e42c49a and
3a7fa05.
Falsified hypotheses that were time-consuming to evaluate are also
recorded (cuDNN [240,960,1,1] stride mismatch, DDP bucket-view
mismatch, Lightning vs synthetic data, sync_batchnorm, MS-DSSIM-only
symmetric probe T4 cancelled due to 8h queue slip) so the next person
investigating Hopper performance regressions doesn't re-run them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(viscy-utils): tighten gradient sign-flip + spotlight autocast bounds
Two findings from /verify-plan:
1. test_ssim_helper_gradient_flow used |grad_ref| > 1e-3 to mask
non-tiny voxels for the sign-flip-fraction assertion, but the
measured |grad_ref| max is ~1.7e-6 — so the predicate selected zero
voxels and the assertion was vacuous. Switch to a relative threshold
(10% of the reference grad max) so the assertion is scale-invariant
and meaningful regardless of the loss scale.
2. test_spotlight_loss_autocast_matches_fp32_baseline used rtol=1e-2,
atol=1e-2 but measured drift is 0.0 — SpotlightLoss has no
conv-heavy ops and autocast policy promotes the precision-sensitive
pieces to fp32 anyway. Tighten to rtol=1e-3, atol=1e-3 to match the
plan's spec and catch any future drift.
Cosine-similarity check (>= 0.99 between flat helper-grad and flat
reference-grad) was already meaningful and unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(viscy-utils): specialize bf16 SSIM helper to 3D, drop redundant tensors
Two simplify passes consolidated:
1. Drop the ``spatial_dims`` parameter and the ``conv_fn = getattr(F,
f"conv{spatial_dims}d")`` indirection. The helper is already
specialized to "uniform kernel only" per the original plan; the only
caller (``ssim_25d``) only ever passes ``spatial_dims=3``. Hardcoding
``F.conv3d`` directly removes a parameter from the public surface and
removes a getattr lookup per conv call. Docstring updated to spell out
the 3D specialization.
2. Drop the explicit ``y_pred_bf`` and ``y_bf`` intermediate tensors. The
bf16 versions of the simple (non-squared) inputs are only consumed
once each (in ``mu_x`` / ``mu_y`` convs), so caching them as named
variables is pure memory duplication — peak working-set tensors drop
from 7 to 5. Inline the cast at the conv site so the temporary lives
only during the conv. Squared-product bf16 tensors stay named because
the squaring must happen in fp32 first to preserve precision and the
result is genuinely a different tensor.
Test wrapper ``_bf16(...)`` updated to drop the ``spatial_dims=3``
kwarg; ``_SPATIAL_DIMS`` module constant removed; ``_ref(...)`` keeps
``spatial_dims=3`` since monai's signature still requires it.
All 34 tests still pass; numerical contract unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(viscy-utils): apply PR review nits to bf16 SSIM helper
Three small improvements from PR #412 review:
1. metrics.py:178 — switch ``data_range: Union[float, torch.Tensor]``
to ``data_range: float | torch.Tensor`` (modern typing per global
CLAUDE.md). The pre-existing ``Union[...]`` at line 270 in
``ssim_25d`` is left alone per surgical-changes rule; the import
stays since it's still used.
2. metrics.py:233 — add a brief comment explaining why the kernel is
rebuilt per call (small tensor, negligible per-step cost relative
to the conv) and what the cache-key would be if profiling ever
shows it matters. Pre-empts the "should we cache this?" question
future readers will have.
3. test_metrics.py:108 — update the gradient-flow test docstring to
reference the relative threshold (10% of grad max) that the code
actually uses; it previously said ``|grad_ref| > 1e-3`` which was
the absolute threshold from before commit a498d59.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(viscy-utils): address Copilot review on PR #412
Four of Copilot's six findings applied; one rejected.
Applied:
- ``test_metrics.py``: monai is not a hard dep of viscy-utils (only
transitively pulled via viscy-models / optimizers.py). Use
``pytest.importorskip("monai.metrics.regression")`` so the suite
degrades gracefully if monai is ever decoupled.
- All three test files: skip on ``not torch.cuda.is_bf16_supported()``
in addition to ``not torch.cuda.is_available()``. bf16 conv on
pre-Ampere CUDA falls back to software emulation rather than failing,
but the equivalence-vs-monai-fp32 tolerances were measured on Hopper
— exclude older hardware where emulated bf16 could push drift past
the configured rtol/atol.
- ``metrics.py`` docstring: "five Gaussian-mean convolutions" →
"five uniform-window mean convolutions". Kernel is uniform
(ones / prod(kernel_size)), not Gaussian.
- ``metrics.py`` helper: keep named ``y_pred_bf`` / ``y_bf`` views of
the simple inputs going straight from the caller's dtype, instead of
the fp32 round-trip ``y_pred_fp32.to(bf16)``. Saves a cast per simple
input on the autocast path; under autograd the conv inputs are
retained for backward either way, so this doesn't change peak
memory.
Rejected:
- ``metrics.py`` capability gate / fp32 fallback. The plan deliberately
chose unconditional bf16 convs (explicit casts, not autocast policy
detection). bf16 conv works on every CUDA generation we ship to
(sm_80+); on older CUDA it would emulate, not hard-fail. Falling back
to fp32 for "any non-bf16 path" would reintroduce the Hopper
bottleneck this PR exists to fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(viscy-data): drop use_thread_workers to fix DDP deadlock (#413)
* fix(viscy-data): drop use_thread_workers to fix DDP deadlock
BatchedConcatDataModule used ThreadDataLoader(use_thread_workers=True),
which substitutes a thread-shim for multiprocessing.Process inside
PyTorch's worker iterator and silently forces persistent_workers=False.
That combination × real init_process_group hangs Lightning's
barrier("train_dataloader()") on a 4-GPU H200 node (SLURM 31453225).
Single-GPU and CPU+gloo DDP both worked, so the deadlock is
GPU/CUDA-specific (pin-memory thread × thread-shim worker context ×
per-rank CUDA context).
Drop use_thread_workers=True from train_dataloader and val_dataloader,
move the lambda collate to a module-level _identity_collate (spawn-safe),
and document why workers are now real subprocesses. The new
test_combined_ddp.py uses mp.start_processes(start_method="fork") to
spawn 2 real DDP ranks under pytest (mp.spawn requires start_method
"spawn", which can't re-resolve pytest's --import-mode=importlib path)
and parameterizes (num_workers, mmap_preload) ∈ {(0,F), (2,F), (2,T)}
— the last cell is the regression guard. Uses a deadline-loop join
because ctx.join returns False as soon as any rank exits, so a single
ctx.join(timeout=...) call is unsafe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(viscy-data): /simplify cleanup of DDP fix + regression test
Simplify pass on combined.py + test_combined_ddp.py:
- Drop the homegrown ``_identity_collate`` helper; reuse
``monai.data.utils.no_collation``, which is character-identical and
already available via the same ``monai.data`` namespace the file
imports ``ThreadDataLoader`` from.
- Trim the change-narration tail on the ``BatchedConcatDataModule``
docstring; the WHY belongs in the commit message, not the long-lived
class doc.
- In the new test: drop the SLURM job number reference (transient task
artifact), drop the hard-coded ``data_connector.py:93`` line number
(rots across Lightning versions), drop the contributor-private
``/home/...`` path from the deadline-loop comment, and delete two
WHAT-narration comments. Extract ``_kill_survivors`` so the
terminate→join→kill sequence isn't duplicated between the deadline
loop and the ``finally`` block. Replace the ``out_file.exists()``
pre-check with a direct ``read_text()`` (per repo style: prefer
raising errors over TOCTOU stat checks). Only ``mkdir`` the
``scratch_dir`` when ``mmap_preload=True`` actually needs it.
No behavior change. ``packages/viscy-data/tests/test_combined_ddp.py``
still passes all three parameter cells and ``test_combined.py`` is
unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(viscy-data): skip DDP test on platforms without fork
Copilot review on PR #413 flagged that the new
``test_combined_ddp.py`` hard-codes ``start_method="fork"`` and would
fail at runtime on the ``windows-latest`` matrix in
``.github/workflows/test.yml`` (Windows has no fork). Add a
``pytest.skip`` when ``"fork"`` is absent from
``mp.get_all_start_methods()``. Linux (CI default) keeps running all
three parameter cells; Windows now skips cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(viscy-data): scope test docstring honestly to CPU+gloo coverage
Code-review feedback flagged that the test claims to "guard the
regression that broke production" but the deadlock is GPU/CUDA-specific
(pin-memory thread × thread-shim worker context under NCCL) per the
fix commit's own diagnosis — gloo/CPU passes even with the bug present.
The test still has value (locks down collate, sampler-attachment, rank-0
prepare_data ordering, mmap_preload) but the docstring oversold its
coverage. Reword to state the scope honestly and note that catching
a direct revert of ``use_thread_workers=True`` needs a GPU runner.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(viscy-data): guard against re-introducing use_thread_workers=True
CPU+gloo regression test in test_combined_ddp.py cannot reproduce the
GPU/NCCL-specific deadlock that PR #413 fixed (pin-memory thread ×
thread-shim worker context under CUDA), so a future revert of
``use_thread_workers=True`` would slip past CI.
This source-level check runs without DDP on every CI matrix cell.
It builds the joint loader with ``num_workers=2`` (the threshold MONAI
uses for the substitution) and asserts the resulting DataLoader's
``multiprocessing_context`` is not the
``monai.data.thread_buffer._ProcessThreadContext`` shim that
``use_thread_workers=True`` would install
(``monai/data/thread_buffer.py:189-191``).
Validated by temporarily re-adding ``use_thread_workers=True`` to
``BatchedConcatDataModule.train_dataloader`` and confirming the guard
fails on both zarr_v2 and zarr_v3 fixture parameterizations, then
removing the temp change before commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): nucleus + membrane FCMAE_VSCyto3D scratch + pretrained leaves
Adds the four FCMAE_VSCyto3D leaves needed to extend the existing
ER (sec61b) + Mito (tomm20) benchmark matrix to all four iPSC
organelles (ER / Mito / Nucleus / Membrane), each with a paired
{scratch, pretrained-encoder} pair so we have a paper-adjacent
init-ablation baseline per organelle.
Layout matches the existing ER/Mito leaves under
``benchmarks/virtual_staining/{er,mito}/fcmae_vscyto3d_{pretrained,scratch}/
ipsc_confocal/train.yml`` — same shared overlays
(``train_sets/ipsc_confocal``, ``data_overlays/fcmae_vscyto3d_fit``,
``model_overlays/fcmae_vscyto3d_fit``, ``hardware_4gpu``) and the same
``benchmark.dataset_ref`` pattern. Pretrained leaves load encoder-only
weights from ``/hpc/projects/virtual_staining/models/mehta-lab/VSCyto3D/
fcmae.ckpt`` (the canonical 400 ep VSCyto3D ckpt); scratch leaves are
random-init.
Both nucleus and membrane targets resolve to the multi-marker
``cell.zarr`` (Nuclei / Membrane channels respectively) per the
aics-hipsc manifest at ``configs/datasets/aics-hipsc/manifest.yaml``,
so no new zarrs needed.
Resolved configs validated end-to-end via
``submit_benchmark_job.py --print-resolved-config``; all four compose
cleanly with the bf16-SSIM-helper-equipped ``viscy_utils`` at HEAD.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(dynacell): override Structure aug-keys for nucleus + membrane FCMAE leaves
Job 31474018 (FCMAE_VSCyto3D_Pretrained_Nucleus) crashed with
``KeyError: 'Structure'`` from ``RandWeightedCropd`` 47 minutes in.
Root cause: ``_internal/shared/model/data_overlays/fcmae_vscyto3d_fit.yml``
hardcodes ``keys: [Phase3D, Structure]`` and ``w_key: Structure`` — that
overlay was authored for ER (sec61b) and Mito (tomm20), which both
have ``target_channel == "Structure"``. Compose's list-replace semantics
mean the data overlay's augmentations list completely overrides the
target overlay's augmentations, so the per-organelle channel name from
``targets/{nucleus,membrane}.yml`` (Nuclei / Membrane) never reaches
``RandWeightedCropd``.
Fix at the leaf level: each of the 4 nucleus + membrane FCMAE leaves
now declares its own ``data.init_args.augmentations`` with the right
channel name. The leaf is composed last, so its augmentations list wins
over the data overlay. The augmentation policy (spatial_size [20, 600,
600], num_samples 4, RandWeightedCropd as the only CPU aug) is kept
identical to the FCMAE overlay so behaviour matches ER/Mito.
A more robust fix would be to make the FCMAE data overlay's CPU
augmentation block parameterized by ``target_channel`` (or to drop it
from the overlay entirely and rely on the per-target overlay), but
that's broader infra surgery — leaving for a separate refactor.
ER (sec61b) and Mito (tomm20) FCMAE leaves are unchanged because they
already work — their target_channel is "Structure".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): add eval script for FNet3D paper predictions on iPSC confocal
Evaluates SEC61B, membrane, TOMM20, and nucleus FNet3D predictions against
ground truth using pixel and DynaCLR feature metrics.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Disable PHATE; bump wrapper RAM; add microglia Wave-2 SLURM script
PHATE in reduce_combined silently deadlocks under scipy 1.17.1 +
sklearn 1.8.0 (graphtools -> sklearn PCA -> scipy.lu, ~0% CPU forever
even with BLAS thread caps). Set phate: null in the recipe; PCA-only
reductions are sufficient for Table 1 + matrix comparison plots.
Re-enable when scipy < 1.17, X_pca_combined is wired through PHATE,
or upstream fix lands.
Wave-1 wrapper bumped to 32G/4cpu (was 8G/2cpu). Per-experiment PLOT
runs locally on the wrapper node and 19 sequential plots on 350k cells
were OOM-killing the 8G allocation.
run_microglia.sh: new Wave-2 SLURM submission for the microglia leaf.
Same pattern as run_infectomics_annotated.sh; reads pipelines from
.../linear_classifiers/{model}/infectomics/latest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): unblock joint training + add A549 cross-eval (#415)
* chore(deps): pin iohub to PR #408 (RFC-9 zipped OME-Zarr); bump Python to >=3.12
Pin iohub to czbiohub-sf/iohub PR #408 head SHA
(53b10acb7a30a2c7e8dfd9b04258dea073e14088 — "feat: Added RFC-9 Zipped
OME-Zarr"). The PR adds zipped OME-Zarr store support; this pin
unblocks downstream consumers before the PR merges upstream.
The PR's iohub head requires Python >=3.12, so bump every workspace
pyproject.toml's `requires-python` from >=3.11 to >=3.12, drop the
3.11 classifier, and update ruff `target-version` to py312. Also
folds in the previously-pending dynacell[wandb] optional-extra
registration that the lockfile was already lagging on.
SHA pin (not branch) survives force-pushes; bump deliberately when
the PR moves.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): bake A100 exclusion into hardware_4gpu launcher profile
After repeat NCCL BROADCAST/ALLREDUCE coordination hangs at first-batch
on this cluster's A100 partition (FCMAE jobs 31474030 + 31474038 timed
out at 1h13m / 1h21m on watchdog; joint smoke 31480607 hung
indefinitely at `Loading 'train_dataloader'` on gpu-a-3), every 4-GPU
train leaf needed the same `--override
launcher.sbatch.constraint='h100|h200|a40|a6000|l40s'` workaround.
Job 31481032 with that override scheduled to gpu-h-3 (H100) and pushed
past the same hang point, confirming the workaround.
Bake the exclusion into the shared profile so it applies by default to
all 11 4-GPU consumers (FCMAE pretrained × 4 + scratch × 4 + er/unext2
+ joint celldiff train.yml + train_smoke_4gpu.yml). Leaves that
genuinely need A100 must opt out via
`--override launcher.sbatch.constraint=null`.
Lock-in test: `test_4gpu_train_leaves_inherit_a100_exclude` walks
every train leaf under benchmarks/virtual_staining/ and asserts the
constraint on 4-GPU leaves; new leaves picking up the profile are
covered automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(viscy-data): port to iohub 0.3.2 ImageArray API (#407)
The iohub PR #408 pin (Commit 1) ships a refactored ``ImageArray`` API
that removed ``ImageArray.name`` and other legacy attributes used
across viscy-data, viscy-utils, dynaclr, and qc. Without this port,
the existing test suite fails immediately on dataset iteration
(``AttributeError: 'ImageArray' object has no attribute 'name'`` at
``sliding_window.py``).
Cherry-picks the migration shipped in PR #407 (squash 737cedf on
``modular-viscy-staging``), preserving local extensions:
- ``sliding_window.py``: take ``f"/{img.path}"`` from #407 (replaces
``img.name``) but keep dynacell-models's ``to(torch.float32, copy=True)``
cast on the mmap_preload path (post-dates #407).
- ``viscy-data/pyproject.toml``: keep ``iohub>=0.3.2`` lower bound from
#407; the ``[tool.uv.sources]`` SHA pin in the workspace root
pyproject.toml (Commit 1) is what actually controls resolution.
- ``uv.lock``: regenerate.
All 25 ``packages/viscy-data/tests/test_combined.py`` tests pass post-
migration, including the previously-failing
``test_combined_datamodule_fit``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(viscy-data): propagate trainer to children in ConcatDataModule.setup
Under DDP, ``Trainer.fit`` calls ``prepare_data`` only on rank 0 of each
node when ``prepare_data_per_node=True``, but ``setup`` runs on every
rank. ``ConcatDataModule.prepare_data`` forwarded ``self.trainer`` to
each child datamodule; ``setup`` did not. As a result, non-rank-0
children kept ``self.trainer = None``, and trainer-gated paths in the
children's ``on_after_batch_transfer`` silently skipped — most visibly
``HCSDataModule``'s
``if self.trainer and self.trainer.training`` guard at ``hcs.py:628``,
which gates ``gpu_augmentations``.
Production failure: SLURM 31481032 (joint celldiff smoke on 4× H100,
post-PR #413 + post-A100-exclude). Ranks 1, 2, 3 raised
``ValueError: x spatial size [13, 624, 624] does not match expected
[8, 512, 512]`` at the first training step — rank 0 ran fine because
its child had received the trainer via ``prepare_data``. Rank
asymmetry was the smoking gun.
Fix: propagate ``dm.trainer = self.trainer`` at the top of each loop
iteration in ``setup``, mirroring ``CombinedDataModule.setup`` (which
was already correct). Same fix applied to ``CachedConcatDataModule``,
which has the identical latent bug. ``BatchedConcatDataModule``
inherits from ``ConcatDataModule`` so the fix flows through.
Single-process regression test exercises the bug by skipping
``prepare_data`` entirely (mimicking the non-rank-0 lifecycle) and
asserting that ``setup`` alone propagates the trainer and that
``gpu_augmentations`` actually run through ``on_after_batch_transfer``.
The existing ``test_combined_ddp.py`` (gloo, ``pin_memory=False``, no
``gpu_augmentations``) cannot catch this — it doesn't exercise the
gpu-aug branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(dynacell): mirror a549-mantis manifest fixtures (3 plates)
Add three local fixture manifests so VisCy's dataset_ref resolver
tests can resolve a549-mantis datasets without requiring a
``dynacell-paper`` install. Mirrors the canonical manifests at
``dynacell-paper/_configs/datasets/a549-mantis/<date>/manifest.yaml``
landed via dynacell-paper PRs A1.1 (#14) and A2.1 (#15) plus the
prior aeef64c registration.
- ``a549-mantis-2024_10_29`` — TOMM20 plate (mito cross-eval).
- ``a549-mantis-2024_11_07`` — SEC61B plate (er cross-eval; also
feeds the joint celldiff train leaf).
- ``a549-mantis-2026_03_26`` — h2b/caax mantis_v2 plate
(nucleus + membrane cross-eval).
Resolver discovery: ``DYNACELL_MANIFEST_ROOTS`` is set to
``tests/fixtures/manifests/`` by the autouse
``_dynacell_manifest_root_env`` fixture in ``tests/conftest.py``.
Layout matches existing aics-hipsc fixture
(``<root>/<dataset_name>/manifest.yaml`` per
``resolver.py:131-140``); ``splits/`` subdirs are not mirrored
because the resolver doesn't read them.
Closes Stage 5 of A549_EXPANSION_ROADMAP.md on the VisCy side and is
the prerequisite for Stage 6 cross-eval leaves.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): add a549_mantis predict_set fragments + Stage 6 cross-eval leaves
Closes Stage 6 of A549_EXPANSION_ROADMAP.md: every existing iPSC-trained
``<organelle>/<model>/ipsc_confocal/`` cell now has a sibling
``predict__a549_mantis.yml`` + ``eval__a549_mantis.yaml`` so iPSC-trained
models can be evaluated on the a549 test split.
8 cells × 2 leaves + 8 eval symlinks + 6 predict_set fragments
= 30 file additions.
Predict_set fragments are per-plate because the a549-mantis registry is
split by acquisition date:
- ``2024_11_07`` — SEC61B (er cross-eval)
- ``2024_10_29`` — TOMM20 (mito cross-eval)
- ``2026_03_26`` — H2B + CAAX (nucleus + membrane cross-eval)
Each fragment exists on both the model side
(``_internal/shared/model/predict_sets/``) and the Hydra side
(``dynacell/evaluation/_configs/predict_set/``), matching the existing
ipsc_confocal pattern.
Stage 6 leaves keep the iPSC-trained checkpoint and inherit the cell's
per-model predict configuration; only the predict_set base, output
store, and ``benchmark.experiment_id`` differ. CellDiff outputs use
``_iterative`` suffix; UNetViT3D outputs do not.
Nucleus + membrane leaves additionally set
``benchmark.dataset_ref.target: h2b`` (or ``caax``) at the leaf level —
the iPSC manifest keys those targets by `nucleus`/`membrane` while a549
keys them by gene. Eval leaves also override
``target_name: h2b`` (or ``caax``) so the segmentation / cache layers
(``mask_plate(target_name)`` → ``{target_name}.zarr``) remain consistent
if ``compute_feature_metrics`` is enabled later.
A549 manifests don't currently carry ``cell_segmentation`` or
``gt_cache_dir`` paths (no segmentation pipeline yet), so eval leaves
set ``compute_feature_metrics: false``. Pixel metrics (PCC, SSIM, NRMSE,
PSNR, FSC, spectral PCC) work without segmentation. Flip the flag in a
follow-up once segmentation lands in dynacell-paper.
Composition tests:
- ``test_a549_predict_leaf_composes`` parametrizes over the 8-cell
matrix and asserts ``data.init_args.{data_path, source_channel,
target_channel}``, ``benchmark.dataset_ref.{dataset, target}``,
``experiment_id``, and ``launcher.sbatch.constraint == 'h200'``
(single-GPU predict topology).
- ``test_a549_eval_leaf_composes_and_splices`` composes each Hydra eval
leaf, calls ``apply_dataset_ref(cfg)``, and asserts ``io.{gt_path,
gt_channel_name, pred_channel_name}``, ``cell_segmentation_path is
None``, ``gt_cache_dir is None``, manifest spacing (mantis_v1 0.1494
µm vs mantis_v2 0.116 µm), and ``compute_feature_metrics is False``.
16 new tests, all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor: /simplify cleanup of joint-training + a549 cross-eval branch
Apply post-implementation simplification per /simplify review:
- ``combined.py``: drop SLURM job ID from ``ConcatDataModule`` Notes block.
Symptom (rank-asymmetric un-cropped batches) is non-obvious and stays;
the specific failing job belongs in PR description, not a docstring
that ages out.
- ``test_combined.py``: switch ``SimpleNamespace`` fake trainer to
``MagicMock`` to match the existing pattern in ``test_hcs.py``. Move
imports to module level (per CLAUDE.md "Import at the top of the
file"). Drop redundant inline narration comments.
- ``hardware_4gpu.yml``: trim 4 SLURM job IDs from the rationale block
(task narration; belongs in PR/commit messages) and drop the rotting
"11 4-GPU consumers (FCMAE × 8 + ...)" count — the data-driven
``test_4gpu_train_leaves_inherit_a100_exclude`` enforces the invariant
without needing a maintenance-prone count in a YAML comment.
No behavior change. All previously-green tests remain green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(dynacell): clarify _A549_EVAL_EXPECTATIONS shape comment
The comment listed four fields, but the dict values are 3-tuples (the
organelle is the dict key, not part of the value). Rewrite as
``{organelle: (target_group, gt_channel, gt_suffix)}`` so the shape is
unambiguous when the matrix is extended.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(dynacell): final findings + 8-job FCMAE benchmark, open items
Adds a "Final findings (2026-04-26) — bf16 fix shipped, 4-organelle
benchmark in flight" section that:
- States the solution in one paragraph (bf16 SSIM helper +
redundant-decorator removal, with reference to the merged PR #412
squash commit 48f4878).
- Records the live 4-organelle x {scratch, pretrained} = 8-job FCMAE
matrix with median + p10 s/step, sourced directly from wandb
loss/train_step step x timestamp deltas (dt < 60s filter excludes
ckpt/val/epoch boundaries — directly measured per the
no-walltime-estimates rule).
- Calls out three claims we can support today:
* pre-fix Hopper at 45-75 s/step -> post-fix at 4.77-5.76 s/step,
9-14x recovery
* Hopper now competitive with L40S but not visibly faster
* A40 (gpu-c-1) still wins at 2.40 s/step steady-state for
reasons not yet measured (node-local I/O, /dev/shm topology,
etc.); same shared-FS data path as Hopper, so the difference is
on the compute-node side.
- Lists what's still open: warmup-vs-steady-state on Hopper (current
numbers are gstep < 1.7k vs A40 baseline at gstep 35k), an unprofiled
A40-vs-Hopper compute-path gap, the 20h host-RAM leak (separate
thread), and the A100 NCCL BROADCAST hang (mitigation in place via
--constraint='h100|h200|a40|a6000|l40s' but root cause not
investigated).
Adds Recommendation #4 (the A100 exclude). Numbers explicitly noted as
"will be re-pulled once new Hopper jobs reach gstep >=10k" so the
follow-up is on the page, not a chat-only TODO.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(dynacell): clean up resolved planning docs; refresh A549 roadmap
Remove three temporary planning docs whose work has shipped, and
refresh the overall A549 expansion roadmap to reflect current
progress, gaps, and next steps.
Removed:
- DATASET_REF_RESOLVER_SPEC.md — Stages 1-3 spec; all merged
(38d47b3, 4bb9f09, 11836c8, 326b2d0, 6273439, 8924ab2, f5a6e56,
a984384).
- fcmae_hopper_slowdown.md — bf16 SSIM kernel fix shipped via PR #412
(squash 48f4878); 4-organelle FCMAE benchmark complete.
- submit-reliability-plan.md — Gap 1 (--exclude SBATCH directive)
and Gap 3 (NCCL preflight smoke test) merged into
submit_benchmark_job.py + sbatch_template.sbatch +
tools/nccl_smoke_test.py; Gap 2 (auto-requeue) explicitly deferred
upstream of the doc.
A549_EXPANSION_ROADMAP.md refresh:
- Status snapshot table: Stages 1-3 done with commit refs; Stage 5
partial (canonical manifests in dynacell-paper @ aeef64c, VisCy
fixture mirror still missing); Stage 6 not started; Stage 7 (joint
training) in flight with first leaf shipped.
- Reconcile commit-message "Stage 7" with the roadmap by promoting
joint-training leaves to a first-class stage (was implicit under
the old Stage 6 description).
- Drop the now-stale DATASET_REF_RESOLVER_SPEC.md cross-reference.
- Trim ~234 → ~130 lines: collapse done stages, keep open work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(dynacell): rename ER+Mito FCMAE pretrained outputs to _ws8500
The original ER and Mito FCMAE pretrained leaves predated the
warmup_steps=8500 model overlay (commit d296c7f, 2026-04-23) and
were submitted with the engine default warmup_steps=3, while their
scratch counterparts ran with warmup_steps=8500. To make the
pretrained vs scratch pairs directly comparable on training
protocol, rename outputs to a sibling _ws8500 directory so both
the abandoned ws=3 baselines and the new ws=8500 runs coexist on
disk. README.md files in the parent dirs document the dual-dir
layout.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore(dynacell): pin compute-job CWD to repo_root in sbatch template
Lets the dynacell launcher work from a git worktree: uv resolves
project context (pyproject + .venv) from CWD, so without an
explicit cd the compute-node srun would inherit whatever CWD sbatch
was invoked from and may pick up the wrong worktree's environment.
Pinning to @@repo_root (the worktree containing the launcher
script) makes the env deterministic regardless of submit-side CWD.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(dynacell): refresh A549 roadmap with 2026-04-24 status
Stage 5 (a549 manifest): A549 zarr normalization-stats backfill
closed 2026-04-24 in dynacell-paper (f4120e0 + 17-zarr backfill);
no VisCy-side action remaining for that subtask.
Stage 7 (joint training leaves): mark "in flight - blocked". PR #413
(0b04b24) closed one DDP deadlock surface but the 4-GPU smoke
(234819a) still hangs at the same milestone; a second deadlock
surface remains. Joint leaf expansion is paused until that resolves.
Cross-reference the followup handoff and the _test48 debug-zarr
convention for short-wall smoke runs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(dynacell): repoint mito A549 cross-eval to 2024_11_21 plate
The 2024_10_29 plate the Stage 6 mito leaves were wired to is
train-only by design — its splits author 0 test FOVs. Repoint the
fixture mirror, predict_set fragments (model + Hydra side), the four
mito predict/eval leaves, and the two test matrices to
a549-mantis-2024_11_21, which has 11 authored test FOVs.
This unblocks all 8 Stage 6 cross-eval cells; ER, mito, nucleus, and
membrane × {celldiff, unetvit3d} now point at plates with real test
splits on disk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): add fnet3d_paper Stage 6 A549 predict leaves
Author 4 a549_mantis predict leaves for fnet3d_paper × {er, mito,
nucleus, membrane}, mirroring the iPSC fnet predict leaves. Plates
follow the same wiring used by celldiff/unetvit3d a549 leaves: ER/
sec61b → 2024_11_07, mito/tomm20 → 2024_11_21, nucleus and membrane
→ 2026_03_26 (with benchmark.dataset_ref.target overrides h2b and
caax for the gene-keyed a549 targets). Each leaf reuses the same
iPSC best-val checkpoint as predict__ipsc_confocal.yml — Stage 6 is
cross-eval, not retrain.
This rounds out the third and final model row for Stage 6's
"full-but-predictable-only" sub-scope (celldiff + unetvit3d already
landed in #415; fnet3d_paper completes the trio).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): add fcmae_vscyto3d Stage 6 A549 predict scaffolding
Add a shared `fcmae_vscyto3d_predict.yml` overlay (mirrors the model
block from `fcmae_vscyto3d_fit.yml` so checkpoints load with the
matching architecture, plus predict-time hparams) and 8 a549_mantis
predict leaves: 4 organelles × {pretrained, scratch}. Plates follow
the same wiring as the celldiff/unetvit3d/fnet3d_paper a549 leaves
(ER/sec61b → 2024_11_07, mito/tomm20 → 2024_11_21, nucleus/h2b and
membrane/caax → 2026_03_26 with `dataset_ref.target` overrides).
`ckpt_path` is intentionally a `/TODO_FILL_BEFORE_SUBMIT/...` path
on every leaf — iPSC FCMAE training is still in flight (jobs
31475094-31523064). Submitting any of these as-is will fail loudly
with FileNotFoundError on torch.load. Each leaf header documents the
expected post-training path so the swap is mechanical.
Note: ER + Mito pretrained outputs use a `_ws8500` warmup-steps
suffix (per commit 9af8bdf); nucleus + membrane pretrained do not.
Suggested paths in the headers reflect that asymmetry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): add FNet3D ER joint single-GPU smoke leaf
Author the FNet3D counterpart to the existing celldiff
joint_ipsc_confocal_a549_mantis/train_smoke.yml. Pairs the iPSC
SEC61B_test12 zarr (12 FOVs, 2.4 GB) with the 4-FOV a549_mantis
SEC61B store so a single A40 / H200 can iterate the joint loader
end-to-end without the full ~250 GB iPSC SEC61B mmap_preload wait.
Verified end-to-end on a local A40: 10 fwd+bwd+optim steps in 1:19,
26 GB peak GPU mem, loss/train_step 0.680 → 1.187 epoch loss, three
checkpoints written. Joint sharding on FNet3D's 35.3 M-param model
with z=32 / yx=64 / bs=8 fp32 works at single-GPU.
Note: leaf bakes `num_workers: 0` + `pin_memory: false` because the
default `num_workers > 0` + `pin_memory: True` reliably hangs on
forking the dataloader workers from a CUDA-initialized parent on
this cluster's interactive nodes (same pattern as the 4-GPU joint
smoke deadlock). Header documents the workaround so consumers tune
up only when they need throughput.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cytoland): enable mmap+persistent+bf16 for A549 infected finetune
Three flips to maximize throughput on the 4xH200 prod run now that PR #411
lifted the mmap_preload + exclude_fov_names restriction:
- persistent_workers: true on the 3 sub-DMs avoids the worker re-spawn
cost that dominated dynacell's per-epoch overhead.
- mmap_preload: true caches each plate to local /tmp scratch (28 TB on
H200 nodes), eliminating per-batch NFS reads. D3 now mmaps cleanly
even with its 27 exclude_fov_names entries thanks to PR #411.
- precision: bf16-mixed dodges the Hopper fp16 cuDNN slowdown
documented in applications/dynacell/configs/examples/fcmae_hopper_slowdown.md.
D2 smoke explicitly overrides both DataModule flags to false so
single-GPU smokes on the full prod plate stay under their walltime;
the bf16 flip is kept on the smoke to validate the AMP path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore(cytoland): drop h200 constraint on A549 infected sbatch
The 4xH200 slot was scheduling 5 days out. bf16-mixed (landed in
ad1df84) avoids the Hopper fp16 slowdown that was the original
reason to pin to H200, so any 4-GPU node (H100/A100/A6000) is fine
and starts much sooner.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(dynacell): bundle manifest registry as first-class package data
Promote the test fixture manifest mirror to a real registry shipped with
the dynacell wheel. The resolver auto-discovers it via the
``dynacell.manifest_roots`` entry point declared in pyproject.toml, so a
fresh clone resolves ``benchmark.dataset_ref`` lookups out of the box —
no ``DYNACELL_MANIFEST_ROOTS`` env var needed for predict/eval/test runs.
Replaces the previous workaround where users had to either set the env
var manually (multiple sessions hit ``ManifestNotFoundError`` because of
this) or rely on the autouse pytest fixture that injected it for tests
only. The env var stays as an explicit override path for testing
discovery precedence.
Splits/ subdirs were missing in the old fixture mirror but present in
canonical; copy them alongside their manifests so consumers that
resolve ``target.splits`` (currently latent in VisCy, used by
dynacell-paper) work correctly.
Drift between this bundled copy and dynacell-paper canonical is guarded
by ``test_manifest_sync.py`` — parametrized over every shipped dataset,
parses both YAML files and asserts canonical is a subset of VisCy
(VisCy may carry additive fields like ``gt_cache_dir``). Skipped unless
``DYNACELL_PAPER_PATH`` is set, so CI behaves predictably while local
dev catches drift early.
This is the cheapest fix for the viscy <-> dynacell-paper logical
cycle: single PR, no new package. Heavier options (extracting a
``dynacell-manifests`` leaf package + ``DYNACELL_DATA_ROOT`` for path
portability) deferred per architecture discussion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(cytoland): require >=80 GB VRAM for A549 infected sbatch
Adds --constraint='h200|h100_80|a100_80' so the 4xGPU prod run is
guaranteed at least 80 GB per device. The 3-DM CombinedDataModule in
MAX_SIZE_CYCLE produces 16x4x3 = 192 crops/step/GPU — measured
against a live dynacell FCMAE H200 reference (~45 GB at single-DM
128 crops/step), the projected per-GPU footprint is ~67 GB. That
fits comfortably on H200, ~84% of an 80 GB H100/A100, and would OOM
on a 40 GB A100 or 48 GB A40/A6000.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(dynacell): note manifest registry drift policy in README
Direct follow-up to 2395e1d (bundled manifest registry). Explains the
two-source-of-truth split: VisCy ships the manifest content
(``src/dynacell/_manifests/``, auto-discovered via the
``dynacell.manifest_roots`` entry point), and ``dynacell-paper`` remains
the source of truth for manifest authoring. When a new plate lands in
``dynacell-paper``, mirror it back into VisCy; ``test_manifest_sync.py``
catches drift when ``DYNACELL_PAPER_PATH`` is set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): submit_benchmark_job.py supports optional --dependency, --parsable (#416)
* feat(dynacell): submit_benchmark_job.py supports optional --dependency, --parsable
Adds two opt-in flags to enable orchestration that needs to chain
sbatch jobs by ID:
- --dependency afterok:<job_id> appends --dependency=<value> to the
sbatch invocation. Default off.
- --parsable invokes sbatch with --parsable (which makes sbatch print
just the numeric job ID instead of "Submitted batch job <id>"),
captures that ID, and forwards it to stdout. Default off.
Both flags default off. Manual submit_benchmark_job.py <leaf>
invocations build the same ["sbatch", str(sbatch_path)] command as
before, with stdout untouched — every existing workflow (smoke jobs,
ad-hoc submissions, model-iteration scripts) sees no behavior change.
Why: dynacell-paper's upcoming benchmarks orchestrator (Phase 5F)
needs to capture train job IDs so it can submit predict with
--dependency=afterok:<train_id>, chaining the train → predict pair
via SLURM. Without these flags, the orchestrator would have to parse
"Submitted batch job <id>" prose from a non-captured stdout — brittle.
Tests: 5 new monkeypatched cases covering the default shape (backward
compat) plus the three new flag combinations. No live sbatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(dynacell): keep stderr attached + clarify --parsable wording
Two Copilot findings on PR #416:
1. ``--parsable`` previously used ``capture_output=True``, which
captures both stdout and stderr — sbatch warnings and diagnostics
were silently swallowed on success. Switch to
``stdout=subprocess.PIPE`` so stderr stays attached to the parent
and only the parsable line is captured.
2. ``--parsable`` help text claimed to "forward the parsed job ID"
but the implementation forwards sbatch's parsable stdout verbatim,
which can be ``job_id;cluster`` on multi-cluster setups. Update
wording to describe the forwarding accurately.
Test asserts the new kwarg shape (``stdout=PIPE``, no ``stderr``,
no ``capture_output``).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dynacell): bundle 4 missing A549 manifests + predict_set fragments
Adds bundled manifests for a549-mantis plates 2024_10_31, 2024_11_05,
2025_07_24, 2025_08_26 — the per-plate predict_set fragments below need
their dataset_ref to resolve at compose time, so the manifests are
mirrored into dynacell._manifests parity with canonical dynacell-paper
configs (test_manifest_sync.py enforces).
Predict_set fragments author one-line dataset_ref redirects per plate so
existing per-plate predict leaves can switch plates without touching the
data block.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(dynacell): per-plate A549 predict leaves for ER + MITO across 5 models
Today's predict__a549_mantis.yml only covered ONE plate (ER: 2024_11_07,
MITO: 2024_11_21). The A549 SEC61B test budget spans 4 plates and TOMM20
spans 4 plates, so a single-plate predict leaf cannot cover the full
test set. Plan-time gap: per-plate iteration was deferred under the
assumption a single demo plate was e…
Member
Author
The 5 commits on main (waveorder>=3.0, QC metrics, zarr v2/v3 tests, predict_volume, LC CLI, normalize-by-time bugfix) were shipped as the 0.4.1 release and remain reachable via that tag. 0.5 is a full modular refactor with breaking changes and intentionally supersedes the 0.4.x API surface. All conflicts and main-side additions are resolved in favor of modular-viscy-staging; the merged tree is byte-identical to origin/modular-viscy-staging. Closes the conflict state on PR #373.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.