DeepCORO_CLIP

DeepCORO_CLIP is a multi-view foundation model for coronary angiography video-text analysis. It uses video-text contrastive learning to learn study-level representations from coronary angiography videos and associated reports, with downstream support for diagnostic, prognostic, and disease progression tasks.

📘 Paper

Title: DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation
arXiv: https://arxiv.org/abs/2603.17675
DOI: https://doi.org/10.48550/arXiv.2603.17675
Submitted: March 18, 2026

Highlights From the Paper

Trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients across 32,473 studies at the Montreal Heart Institute.
Externally validated on 4,249 studies from the University of California, San Francisco.
Uses multi-view aggregation with attention-based pooling for study-level assessment across multiple angiographic projections.
Reported AUROC 0.888 for significant stenosis detection internally and 0.89 on external validation.
Reported 13.6% mean absolute error against core-lab quantitative coronary angiography for stenosis estimation, compared with 19.0% for clinical reports.
Also reports strong performance for chronic total occlusion, intracoronary thrombus, and coronary calcification detection.
Transfer learning results in the paper include one-year MACE prediction with AUROC 0.79 and left ventricular ejection fraction estimation with 7.3% mean absolute error.
The paper reports a mean deployment inference time of 4.2 seconds.

🚀 Features

Contrastive Learning: Train on video-report pairs using CLIP-style contrastive learning
- Single video mode: Process one video per study
- Multi-video mode: Process multiple videos per study with aggregation
Linear Probing: Fine-tune the model for specific tasks using linear probing
Multi-GPU Training: Support for distributed training across multiple GPUs
Hyperparameter Optimization: Built-in support for Weights & Biases sweeps
Automatic Mixed Precision: Optimized training with AMP
Distributed Data Parallel: Efficient multi-GPU training
Patch- vs. Video-level Reasoning: Expose all patch tokens, a single token per video, or a single token per study with two simple flags (aggregate and per_video_pool) in the VideoEncoder.

🛠️ Environment Setup

Prerequisites

CUDA-capable GPU
Python 3.11+

Steps

📥 Clone the Repository:

https://github.com/HeartWise-AI/DeepCORO_CLIP.git
cd DeepCORO_CLIP

Set up Virtual Environment:
```
pip install uv
uv sync
```
Activate Virtual Environment:
```
source .venv/bin/activate
```

Install yq required to run scripts/run_sweep.sh:

wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/bin/yq && \
chmod +x /usr/bin/yq

Log into Weights & Biases required for sweep:
```
wandb login
```

Make sure you have FFMPEG 4.4.x is installed - required for sweep:

which ffmpeg
conda remove ffmpeg # remove if /opt/conda/bin/ffmpeg exists
sudo apt update
sudo apt install ffmpeg
sudo apt install libavcodec-extra
ffmpeg -version

📄 Configuration Files

The project uses configuration files located in the config/ directory:

Base Configurations

CLIP Training (config/clip/base_config.yaml):
- Training parameters (epochs, batch size, learning rate)
- Model architecture settings
- Data loading parameters
- Optimization settings
- Video mode settings (single/multi)
- Video aggregation parameters
Linear Probing (config/linear_probing/base_config.yaml):
- Task-specific parameters
- Head structure configuration
- Loss function settings
- Backbone freezing options

Sweep Configurations

CLIP Training (config/clip/sweep_config_*.yaml, config/clip/sweep_siglip_output_dataset_*.yaml):
- Hyperparameter search space for CLIP training
- Supports both single and multi-video training
- sweep_siglip_output_dataset_config.yaml surfaces conservative learning rate, grad-clipping, and SigLIP weighting knobs for unstable SigLIP runs
Linear Probing (config/linear_probing/sweep_config.yaml):
- Hyperparameter optimization for linear probing tasks
- Task-specific parameter ranges

💻 Run Modes

1. Contrastive Learning (CLIP)

Train the model on video-report pairs using contrastive learning

Process multiple videos per study with aggregation:

# Single GPU training without logging results to wandb (see scripts/runner.sh)
bash scripts/runner.sh --base_config config/clip/base_config.yaml --selected_gpus 0 --use_wandb false --run_mode train

# Multi-GPU training with results logging on wandb (see scripts/runner.sh)
bash scripts/runner.sh --base_config config/clip/base_config.yaml --selected_gpus 0,1 --use_wandb true --run_mode train

# Multi-GPU hyperparameters fine-tuning - RunMode and UseWandb are forced to train and true respectively (see scripts/run_sweep.sh)
bash scripts/run_sweep.sh --base_config config/clip/base_config.yaml --sweep_config config/clip/sweep_config_single_video.yaml --selected_gpus 3 --count 5

# SigLIP output dataset stability sweep (tunes grad clipping, AMP, and SigLIP weighting)
bash scripts/run_sweep.sh --base_config config/clip/siglip_output_dataset_config.yaml --sweep_config config/clip/sweep_siglip_output_dataset_config.yaml --selected_gpus 0,1 --count 10

Run validation

Not supported

Run inference

Process validation data from input CSV (rows where Split == 'inference') - working on single GPU only

bash scripts/runner.sh --selected_gpus 0 --base_config config/clip/base_config.yaml --run_mode inference --use_wandb false

Run test

Not supported

2. Linear Probing

Fine-tune the model for specific tasks using linear probing - couple of combination examples:

# Single GPU training without logging results to wandb (see script/runner.sh)
bash scripts/runner.sh --base_config config/linear_probing/base_config.yaml --selected_gpus 0 --use_wandb false --run_mode train

# Multi-GPU training with results logging on wandb (see script/runner.sh)
bash scripts/runner.sh --base_config config/linear_probing/base_config.yaml --selected_gpus 0,1 --use_wandb true --run_mode train

# Multi-GPU hyperparameters fine-tuning - RunMode and UseWandb are forced to train and true respectively (see scripts/run_sweep.sh)
bash scripts/run_sweep.sh --base_config config/linear_probing/base_config.yaml --sweep_config config/linear_probing/sweep_config.yaml --selected_gpus 0,1 --count 5

Run validation

Process validation data from input CSV (rows where Split == 'val') and compute CI for each head

bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode val --selected_gpus 1,2,3

Run test

Process validation data from input CSV (rows where Split == 'test') and compute CI for each head

bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode test --selected_gpus 1,2,3

Run inference

Process validation data from input CSV (rows where Split == 'inference')

bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode inference --selected_gpus 1,2,3

🐳 Docker Setup

Optionally, you can build a Docker container to validate and run the inference pipeline. The current Docker workflow is centered on scripts/external_validation.py, which expects a regular comma-separated CSV input and will internally:

convert DICOM cine files to AVI
generate intermediate α-separated CSVs under /app/tmp
run VasoVision preprocessing
run DeepCORO linear-probing validation via scripts/runner.sh

Input CSV guidance:

DICOMPath is required by scripts/external_validation.py
for the full downstream DeepCORO validation step, the CSV must also contain the target columns expected by config/linear_probing/stenosis/docker_base_config.yaml
use scripts/preprocess_dataset_template.csv as the input template for a custom dataset
use scripts/prepare_input_for_preprocess.py to normalize a custom CSV into the expected format

The container must be able to see the files referenced in DICOMPath. If your CSV contains absolute host paths, mount that host path into the container at the exact same container path.

Build Docker Image

Model weights (VasoVision and DeepCORO-CLIP) are downloaded at build time using a Docker BuildKit secret. Place your api_key.json (containing HUGGING_FACE_API_KEY) in the project root, then build:

DOCKER_BUILDKIT=1 docker build \
  --secret id=api_key,src=api_key.json \
  -t deepcoro_clip-docker .

The API key is only used during the build and is not persisted in the final image.

Run Docker

Requirements:

Provide a comma-separated CSV with a DICOMPath column.
If you want the full DeepCORO validation stage to run, include the task/label columns expected by config/linear_probing/stenosis/docker_base_config.yaml.
Mount the host folder containing the DICOM files at the same absolute path used in DICOMPath.
Set EXTERNAL_VALIDATION_DATA_PATH to the container path of your CSV.
Mount a host folder to /workspace/results if you want to keep outputs after the container exits.
Optionally mount a host folder to /app/tmp if you also want the converted AVI files and other temporary/intermediate artifacts.

Example:

docker run --rm --gpus all --ipc=host \
  -e EXTERNAL_VALIDATION_DATA_PATH=/app/data/input.csv \
  -v /path/to/your/dicoms:/path/to/your/dicoms \
  -v /path/to/your/input.csv:/app/data/input.csv \
  -v /path/to/your/results:/workspace/results \
  -v /path/to/your/tmp:/app/tmp \
  deepcoro_clip-docker \
  python scripts/external_validation.py

Outputs and intermediate files:

temporary converted AVIs and intermediate CSVs are written under /app/tmp and are persisted to the host if you mount /path/to/your/tmp:/app/tmp
final results are written under the configured output_folder (default: results/, which maps to /workspace/results in the example above)
all pipeline CSV files are also copied into /workspace/results/csv_artifacts/ for easier host-side access
typical CSV artifacts include: tmp__df_preprocessed.csv for the post-DICOM-conversion manifest, tmp__df_vaso_info.csv for VasoVision/Orion predictions, tmp__df_preprocessed_filtered.csv for the merged and filtered dataframe passed to DeepCORO, and results__...prediction...csv files for final VasoVision and DeepCORO predictions

Notes:

scripts/external_validation.py reads EXTERNAL_VALIDATION_DATA_PATH from the environment.
the downstream DeepCORO stage defaults to inference mode in the Docker workflow; set DEEPCORO_RUN_MODE=auto to restore schema-based selection or DEEPCORO_RUN_MODE=val to force validation
set EXTERNAL_VALIDATION_SKIP_VASOVISION=true to bypass VasoVision entirely and run DeepCORO on all converted videos without the diagnostic/contrast/main-structure filtering step
The script expects the input manifest to be comma-separated, not α-separated.
A DICOMPath-only CSV is enough for the DICOM-to-AVI and VasoVision stages, but not for the full DeepCORO validation stage unless the downstream config is adjusted.
scripts/runner.sh now uses the active virtual environment inside the container, so no .venv symlink workaround is needed.
--ipc=host is recommended for this inference pipeline because PyTorch video loading can exhaust Docker's default shared memory allocation. If you do not want to share the host IPC namespace, use a sufficiently large --shm-size instead.
if you use --rm without mounting /workspace/results, the generated result files are removed with the container.
it is better to create the host output folders yourself first, for example mkdir -p /path/to/your/results /path/to/your/tmp, to avoid Docker creating them as root.
files written through the mounted results folder are typically owned by root because the container runs as root by default; if you want host-owned outputs, you can try adding --user $(id -u):$(id -g) to docker run if your mounted folders have compatible permissions.

Model Architecture

Video Encoder

Multiscale Vision Transformer (mVIT) backbone
Configurable number of heads and layers
Support for pretrained weights
Optional backbone freezing
New flags for fine-grained control over the output:
- aggregate=True (default) – returns one study-level vector [B, D].
- aggregate=False, per_video_pool=True – returns one token per video [B, N, D], ready for MIL / linear probing heads.
- aggregate=False, per_video_pool=False – returns all patch tokens - ONLY Setting that preeservs all the tokens [B, N·L, D] for the most detailed downstream reasoning.

Example (video-level MIL):

from models.video_encoder import VideoEncoder
from models.multi_instance_linear_probing import MultiInstanceLinearProbing

encoder = VideoEncoder(
    backbone="mvit",
    aggregate=False,        # skip internal aggregator
    aggregate_videos_tokens=True,    # one token per video
)

probe = MultiInstanceLinearProbing(
    embedding_dim=encoder.embedding_dim,
    head_structure={"severity": 4},
    pooling_mode="attention",
)

video_batch = ...                  # [B, N, T, H, W, C]
feats = encoder(video_batch)       # [B, N, D]
logits = probe(feats)              # dict with head output

Text Encoder

BioMedBERT for medical text encoding
Configurable freezing ratio
Contrastive learning with video features

Linear Probing Heads

Task-specific classification heads
Configurable dropout and architecture
Support for multiple output classes per head

Development Setup

We use pre-commit hooks to ensure code quality and consistency:

# Install pre-commit
uv pip install pre-commit
pre-commit install

# Run hooks manually
pre-commit run --all-files

Performance Guidelines

Recommended Batch Sizes by GPU Memory

GPU Memory	Recommended Batch Size	Command
8GB	4-8	`--batch-size 8`
12GB	8-16	`--batch-size 16`
16GB	16-24	`--batch-size 24`
24GB+	24-32	`--batch-size 32`

Training Tips

Batch Size Selection:
- Start with smaller batch sizes and increase if memory allows
- Larger batch sizes generally allow faster training
- Reduce if you get OOM errors
Number of Workers:
- Rule of thumb: num_workers = 4 * num_gpus
- Reduce if you get memory or file handle errors
- Example: --num-workers 2 for slower storage systems
Learning Rate:
- Default (1e-4) works well for most cases
- For larger batch sizes: lr = 1e-4 * (batch_size/32)
- Example: --lr 2e-4 for batch size 64
Number of Epochs:
- Default (50) is good for most cases
- Increase for better performance: --epochs 100
- Decrease for quick experiments: --epochs 10

Common Issues

Out of Memory (OOM):
- Reduce batch size
- Use gradient accumulation
- Force single GPU mode
GPU Selection:
- Use CUDA_VISIBLE_DEVICES to select specific GPUs
- Monitor GPU usage with nvidia-smi
Training Speed:
- Multi-GPU isn't always faster due to overhead
- Start with single GPU and scale up if needed

Monitoring Training

GPU Memory Usage:

nvidia-smi -l 1  # Monitor GPU usage every second

Training Progress:

Progress bar shows current epoch and batch
Loss values are printed every 10 batches
Checkpoints are saved every 5 epochs

WandB Logging:

Training metrics are logged to Weights & Biases
Includes loss, learning rate, batch size
Access via WandB dashboard

Project Structure

heartwise-ai-deepcoro_clip/
├── config/                        # Configuration files
│   ├── clip/                     # CLIP training configs
│   └── linear_probing/           # Linear probing configs
├── dataloaders/                  # Data loading modules
├── dataset_creation/             # How MHI dataset was built
├── docs/                         # Documentation on CLS-Token implementation
├── models/                       # Neural network models
├── projects/                     # Project implementations
├── runners/                      # Training runners
├── scripts/                      # Training scripts
├── utils/                        # Utility functions
└── tests/                        # Unit test pipeline

🤝 Contributing

Contributions to DeepCoro_CLIP repository are welcome! Please follow these steps to contribute:

Fork the repository
Create a new branch for your feature or bug fix
Make your changes and commit them with clear, descriptive messages
Push your changes to your fork
Submit a pull request to the main repository

📚 Citation

If you find this repository useful, please cite our work:

@article{harrabi2026deepcoro_clip,
  title={DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation},
  author={Harrabi, Sarra and Wu, Yichen and Tison, Geoffrey H. and Ansari, Minhaj and Vukadinovic, Milos and Ouyang, David and Barrios, Joshua P. and Delfrate, Jacques and Avram, Robert},
  journal={arXiv preprint arXiv:2603.17675},
  year={2026},
  doi={10.48550/arXiv.2603.17675}
}

Name		Name	Last commit message	Last commit date
Latest commit History 682 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
config		config
dataloaders		dataloaders
dataset_creation		dataset_creation
dev_scripts		dev_scripts
docs		docs
examples		examples
manuscript/analyses		manuscript/analyses
models		models
output_dataset		output_dataset
projects		projects
runners		runners
scripts		scripts
tests		tests
utils		utils
.cursorrules		.cursorrules
.coveragerc		.coveragerc
.cursorrules		.cursorrules
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
api_key_template.json		api_key_template.json
codecov.yml		codecov.yml
docker_dependencies.txt		docker_dependencies.txt
dockerfile		dockerfile
extract_pci_embeddings.py		extract_pci_embeddings.py
ground_truth_pos_probs_explanation.md		ground_truth_pos_probs_explanation.md
monitor_run.sh		monitor_run.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run_vectorized_analysis.sh		run_vectorized_analysis.sh
uv.lock		uv.lock
visualize_disease_progression.py		visualize_disease_progression.py

Folders and files

Latest commit

History

Repository files navigation

DeepCORO_CLIP

📘 Paper

Highlights From the Paper

🚀 Features

🛠️ Environment Setup

Prerequisites

Steps

📄 Configuration Files

Base Configurations

Sweep Configurations

💻 Run Modes

1. Contrastive Learning (CLIP)

Train the model on video-report pairs using contrastive learning

Run validation

Run inference

Run test

2. Linear Probing

Run validation

Run test

Run inference

🐳 Docker Setup

Build Docker Image

Run Docker

Model Architecture

Video Encoder

Text Encoder

Linear Probing Heads

Development Setup

Performance Guidelines

Recommended Batch Sizes by GPU Memory

Training Tips

Common Issues

Monitoring Training

Project Structure

🤝 Contributing

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages