DeepCORO_CLIP is a multi-view foundation model for coronary angiography video-text analysis. It uses video-text contrastive learning to learn study-level representations from coronary angiography videos and associated reports, with downstream support for diagnostic, prognostic, and disease progression tasks.
- Title: DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation
- arXiv: https://arxiv.org/abs/2603.17675
- DOI: https://doi.org/10.48550/arXiv.2603.17675
- Submitted: March 18, 2026
- Trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients across 32,473 studies at the Montreal Heart Institute.
- Externally validated on 4,249 studies from the University of California, San Francisco.
- Uses multi-view aggregation with attention-based pooling for study-level assessment across multiple angiographic projections.
- Reported AUROC 0.888 for significant stenosis detection internally and 0.89 on external validation.
- Reported 13.6% mean absolute error against core-lab quantitative coronary angiography for stenosis estimation, compared with 19.0% for clinical reports.
- Also reports strong performance for chronic total occlusion, intracoronary thrombus, and coronary calcification detection.
- Transfer learning results in the paper include one-year MACE prediction with AUROC 0.79 and left ventricular ejection fraction estimation with 7.3% mean absolute error.
- The paper reports a mean deployment inference time of 4.2 seconds.
- Contrastive Learning: Train on video-report pairs using CLIP-style contrastive learning
- Single video mode: Process one video per study
- Multi-video mode: Process multiple videos per study with aggregation
- Linear Probing: Fine-tune the model for specific tasks using linear probing
- Multi-GPU Training: Support for distributed training across multiple GPUs
- Hyperparameter Optimization: Built-in support for Weights & Biases sweeps
- Automatic Mixed Precision: Optimized training with AMP
- Distributed Data Parallel: Efficient multi-GPU training
- Patch- vs. Video-level Reasoning: Expose all patch tokens, a single
token per video, or a single token per study with two simple flags
(
aggregateandper_video_pool) in theVideoEncoder.
- CUDA-capable GPU
- Python 3.11+
-
π₯ Clone the Repository:
https://github.com/HeartWise-AI/DeepCORO_CLIP.git cd DeepCORO_CLIP -
Set up Virtual Environment:
pip install uv uv sync
-
Activate Virtual Environment:
source .venv/bin/activate -
Install yq required to run scripts/run_sweep.sh:
wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/bin/yq && \ chmod +x /usr/bin/yq -
Log into Weights & Biases required for sweep:
wandb login
-
Make sure you have FFMPEG 4.4.x is installed - required for sweep:
which ffmpeg conda remove ffmpeg # remove if /opt/conda/bin/ffmpeg exists sudo apt update sudo apt install ffmpeg sudo apt install libavcodec-extra ffmpeg -version
The project uses configuration files located in the config/ directory:
-
CLIP Training (
config/clip/base_config.yaml):- Training parameters (epochs, batch size, learning rate)
- Model architecture settings
- Data loading parameters
- Optimization settings
- Video mode settings (single/multi)
- Video aggregation parameters
-
Linear Probing (
config/linear_probing/base_config.yaml):- Task-specific parameters
- Head structure configuration
- Loss function settings
- Backbone freezing options
-
CLIP Training (
config/clip/sweep_config_*.yaml,config/clip/sweep_siglip_output_dataset_*.yaml):- Hyperparameter search space for CLIP training
- Supports both single and multi-video training
sweep_siglip_output_dataset_config.yamlsurfaces conservative learning rate, grad-clipping, and SigLIP weighting knobs for unstable SigLIP runs
-
Linear Probing (
config/linear_probing/sweep_config.yaml):- Hyperparameter optimization for linear probing tasks
- Task-specific parameter ranges
Process multiple videos per study with aggregation:
# Single GPU training without logging results to wandb (see scripts/runner.sh)
bash scripts/runner.sh --base_config config/clip/base_config.yaml --selected_gpus 0 --use_wandb false --run_mode train
# Multi-GPU training with results logging on wandb (see scripts/runner.sh)
bash scripts/runner.sh --base_config config/clip/base_config.yaml --selected_gpus 0,1 --use_wandb true --run_mode train
# Multi-GPU hyperparameters fine-tuning - RunMode and UseWandb are forced to train and true respectively (see scripts/run_sweep.sh)
bash scripts/run_sweep.sh --base_config config/clip/base_config.yaml --sweep_config config/clip/sweep_config_single_video.yaml --selected_gpus 3 --count 5
# SigLIP output dataset stability sweep (tunes grad clipping, AMP, and SigLIP weighting)
bash scripts/run_sweep.sh --base_config config/clip/siglip_output_dataset_config.yaml --sweep_config config/clip/sweep_siglip_output_dataset_config.yaml --selected_gpus 0,1 --count 10Not supported
Process validation data from input CSV (rows where Split == 'inference') - working on single GPU only
bash scripts/runner.sh --selected_gpus 0 --base_config config/clip/base_config.yaml --run_mode inference --use_wandb falseNot supported
Fine-tune the model for specific tasks using linear probing - couple of combination examples:
# Single GPU training without logging results to wandb (see script/runner.sh)
bash scripts/runner.sh --base_config config/linear_probing/base_config.yaml --selected_gpus 0 --use_wandb false --run_mode train
# Multi-GPU training with results logging on wandb (see script/runner.sh)
bash scripts/runner.sh --base_config config/linear_probing/base_config.yaml --selected_gpus 0,1 --use_wandb true --run_mode train
# Multi-GPU hyperparameters fine-tuning - RunMode and UseWandb are forced to train and true respectively (see scripts/run_sweep.sh)
bash scripts/run_sweep.sh --base_config config/linear_probing/base_config.yaml --sweep_config config/linear_probing/sweep_config.yaml --selected_gpus 0,1 --count 5
Process validation data from input CSV (rows where Split == 'val') and compute CI for each head
bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode val --selected_gpus 1,2,3Process validation data from input CSV (rows where Split == 'test') and compute CI for each head
bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode test --selected_gpus 1,2,3Process validation data from input CSV (rows where Split == 'inference')
bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode inference --selected_gpus 1,2,3Optionally, you can build a Docker container to validate and run the inference pipeline. The current Docker workflow is centered on scripts/external_validation.py, which expects a regular comma-separated CSV input and will internally:
- convert DICOM cine files to AVI
- generate intermediate
Ξ±-separated CSVs under/app/tmp - run VasoVision preprocessing
- run DeepCORO linear-probing validation via
scripts/runner.sh
Input CSV guidance:
DICOMPathis required byscripts/external_validation.py- for the full downstream DeepCORO validation step, the CSV must also contain the target columns expected by
config/linear_probing/stenosis/docker_base_config.yaml - use scripts/preprocess_dataset_template.csv as the input template for a custom dataset
- use scripts/prepare_input_for_preprocess.py to normalize a custom CSV into the expected format
The container must be able to see the files referenced in DICOMPath. If your CSV contains absolute host paths, mount that host path into the container at the exact same container path.
Model weights (VasoVision and DeepCORO-CLIP) are downloaded at build time using a Docker BuildKit secret. Place your api_key.json (containing HUGGING_FACE_API_KEY) in the project root, then build:
DOCKER_BUILDKIT=1 docker build \
--secret id=api_key,src=api_key.json \
-t deepcoro_clip-docker .The API key is only used during the build and is not persisted in the final image.
Requirements:
- Provide a comma-separated CSV with a
DICOMPathcolumn. - If you want the full DeepCORO validation stage to run, include the task/label columns expected by
config/linear_probing/stenosis/docker_base_config.yaml. - Mount the host folder containing the DICOM files at the same absolute path used in
DICOMPath. - Set
EXTERNAL_VALIDATION_DATA_PATHto the container path of your CSV. - Mount a host folder to
/workspace/resultsif you want to keep outputs after the container exits. - Optionally mount a host folder to
/app/tmpif you also want the converted AVI files and other temporary/intermediate artifacts.
Example:
docker run --rm --gpus all --ipc=host \
-e EXTERNAL_VALIDATION_DATA_PATH=/app/data/input.csv \
-v /path/to/your/dicoms:/path/to/your/dicoms \
-v /path/to/your/input.csv:/app/data/input.csv \
-v /path/to/your/results:/workspace/results \
-v /path/to/your/tmp:/app/tmp \
deepcoro_clip-docker \
python scripts/external_validation.pyOutputs and intermediate files:
- temporary converted AVIs and intermediate CSVs are written under
/app/tmpand are persisted to the host if you mount/path/to/your/tmp:/app/tmp - final results are written under the configured
output_folder(default:results/, which maps to/workspace/resultsin the example above) - all pipeline CSV files are also copied into
/workspace/results/csv_artifacts/for easier host-side access - typical CSV artifacts include:
tmp__df_preprocessed.csvfor the post-DICOM-conversion manifest,tmp__df_vaso_info.csvfor VasoVision/Orion predictions,tmp__df_preprocessed_filtered.csvfor the merged and filtered dataframe passed to DeepCORO, andresults__...prediction...csvfiles for final VasoVision and DeepCORO predictions
Notes:
scripts/external_validation.pyreadsEXTERNAL_VALIDATION_DATA_PATHfrom the environment.- the downstream DeepCORO stage defaults to
inferencemode in the Docker workflow; setDEEPCORO_RUN_MODE=autoto restore schema-based selection orDEEPCORO_RUN_MODE=valto force validation - set
EXTERNAL_VALIDATION_SKIP_VASOVISION=trueto bypass VasoVision entirely and run DeepCORO on all converted videos without the diagnostic/contrast/main-structure filtering step - The script expects the input manifest to be comma-separated, not
Ξ±-separated. - A
DICOMPath-only CSV is enough for the DICOM-to-AVI and VasoVision stages, but not for the full DeepCORO validation stage unless the downstream config is adjusted. scripts/runner.shnow uses the active virtual environment inside the container, so no.venvsymlink workaround is needed.--ipc=hostis recommended for this inference pipeline because PyTorch video loading can exhaust Docker's default shared memory allocation. If you do not want to share the host IPC namespace, use a sufficiently large--shm-sizeinstead.- if you use
--rmwithout mounting/workspace/results, the generated result files are removed with the container. - it is better to create the host output folders yourself first, for example
mkdir -p /path/to/your/results /path/to/your/tmp, to avoid Docker creating them asroot. - files written through the mounted results folder are typically owned by
rootbecause the container runs as root by default; if you want host-owned outputs, you can try adding--user $(id -u):$(id -g)todocker runif your mounted folders have compatible permissions.
- Multiscale Vision Transformer (mVIT) backbone
- Configurable number of heads and layers
- Support for pretrained weights
- Optional backbone freezing
- New flags for fine-grained control over the output:
aggregate=True(default) β returns one study-level vector[B, D].aggregate=False, per_video_pool=Trueβ returns one token per video[B, N, D], ready for MIL / linear probing heads.aggregate=False, per_video_pool=Falseβ returns all patch tokens - ONLY Setting that preeservs all the tokens[B, NΒ·L, D]for the most detailed downstream reasoning.
Example (video-level MIL):
from models.video_encoder import VideoEncoder
from models.multi_instance_linear_probing import MultiInstanceLinearProbing
encoder = VideoEncoder(
backbone="mvit",
aggregate=False, # skip internal aggregator
aggregate_videos_tokens=True, # one token per video
)
probe = MultiInstanceLinearProbing(
embedding_dim=encoder.embedding_dim,
head_structure={"severity": 4},
pooling_mode="attention",
)
video_batch = ... # [B, N, T, H, W, C]
feats = encoder(video_batch) # [B, N, D]
logits = probe(feats) # dict with head output- BioMedBERT for medical text encoding
- Configurable freezing ratio
- Contrastive learning with video features
- Task-specific classification heads
- Configurable dropout and architecture
- Support for multiple output classes per head
We use pre-commit hooks to ensure code quality and consistency:
# Install pre-commit
uv pip install pre-commit
pre-commit install
# Run hooks manually
pre-commit run --all-files| GPU Memory | Recommended Batch Size | Command |
|---|---|---|
| 8GB | 4-8 | --batch-size 8 |
| 12GB | 8-16 | --batch-size 16 |
| 16GB | 16-24 | --batch-size 24 |
| 24GB+ | 24-32 | --batch-size 32 |
-
Batch Size Selection:
- Start with smaller batch sizes and increase if memory allows
- Larger batch sizes generally allow faster training
- Reduce if you get OOM errors
-
Number of Workers:
- Rule of thumb:
num_workers = 4 * num_gpus - Reduce if you get memory or file handle errors
- Example:
--num-workers 2for slower storage systems
- Rule of thumb:
-
Learning Rate:
- Default (1e-4) works well for most cases
- For larger batch sizes:
lr = 1e-4 * (batch_size/32) - Example:
--lr 2e-4for batch size 64
-
Number of Epochs:
- Default (50) is good for most cases
- Increase for better performance:
--epochs 100 - Decrease for quick experiments:
--epochs 10
-
Out of Memory (OOM):
- Reduce batch size
- Use gradient accumulation
- Force single GPU mode
-
GPU Selection:
- Use
CUDA_VISIBLE_DEVICESto select specific GPUs - Monitor GPU usage with
nvidia-smi
- Use
-
Training Speed:
- Multi-GPU isn't always faster due to overhead
- Start with single GPU and scale up if needed
- GPU Memory Usage:
nvidia-smi -l 1 # Monitor GPU usage every second- Training Progress:
- Progress bar shows current epoch and batch
- Loss values are printed every 10 batches
- Checkpoints are saved every 5 epochs
- WandB Logging:
- Training metrics are logged to Weights & Biases
- Includes loss, learning rate, batch size
- Access via WandB dashboard
heartwise-ai-deepcoro_clip/
βββ config/ # Configuration files
β βββ clip/ # CLIP training configs
β βββ linear_probing/ # Linear probing configs
βββ dataloaders/ # Data loading modules
βββ dataset_creation/ # How MHI dataset was built
βββ docs/ # Documentation on CLS-Token implementation
βββ models/ # Neural network models
βββ projects/ # Project implementations
βββ runners/ # Training runners
βββ scripts/ # Training scripts
βββ utils/ # Utility functions
βββ tests/ # Unit test pipeline
Contributions to DeepCoro_CLIP repository are welcome! Please follow these steps to contribute:
- Fork the repository
- Create a new branch for your feature or bug fix
- Make your changes and commit them with clear, descriptive messages
- Push your changes to your fork
- Submit a pull request to the main repository
If you find this repository useful, please cite our work:
@article{harrabi2026deepcoro_clip,
title={DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation},
author={Harrabi, Sarra and Wu, Yichen and Tison, Geoffrey H. and Ansari, Minhaj and Vukadinovic, Milos and Ouyang, David and Barrios, Joshua P. and Delfrate, Jacques and Avram, Robert},
journal={arXiv preprint arXiv:2603.17675},
year={2026},
doi={10.48550/arXiv.2603.17675}
}