Skip to content

HeartWise-AI/DeepCORO_CLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

682 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DeepCORO_CLIP

DeepCORO_CLIP is a multi-view foundation model for coronary angiography video-text analysis. It uses video-text contrastive learning to learn study-level representations from coronary angiography videos and associated reports, with downstream support for diagnostic, prognostic, and disease progression tasks.

πŸ“˜ Paper

Highlights From the Paper

  • Trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients across 32,473 studies at the Montreal Heart Institute.
  • Externally validated on 4,249 studies from the University of California, San Francisco.
  • Uses multi-view aggregation with attention-based pooling for study-level assessment across multiple angiographic projections.
  • Reported AUROC 0.888 for significant stenosis detection internally and 0.89 on external validation.
  • Reported 13.6% mean absolute error against core-lab quantitative coronary angiography for stenosis estimation, compared with 19.0% for clinical reports.
  • Also reports strong performance for chronic total occlusion, intracoronary thrombus, and coronary calcification detection.
  • Transfer learning results in the paper include one-year MACE prediction with AUROC 0.79 and left ventricular ejection fraction estimation with 7.3% mean absolute error.
  • The paper reports a mean deployment inference time of 4.2 seconds.

πŸš€ Features

  • Contrastive Learning: Train on video-report pairs using CLIP-style contrastive learning
    • Single video mode: Process one video per study
    • Multi-video mode: Process multiple videos per study with aggregation
  • Linear Probing: Fine-tune the model for specific tasks using linear probing
  • Multi-GPU Training: Support for distributed training across multiple GPUs
  • Hyperparameter Optimization: Built-in support for Weights & Biases sweeps
  • Automatic Mixed Precision: Optimized training with AMP
  • Distributed Data Parallel: Efficient multi-GPU training
  • Patch- vs. Video-level Reasoning: Expose all patch tokens, a single token per video, or a single token per study with two simple flags (aggregate and per_video_pool) in the VideoEncoder.

πŸ› οΈ Environment Setup

Prerequisites

  • CUDA-capable GPU
  • Python 3.11+

Steps

  1. πŸ“₯ Clone the Repository:

    https://github.com/HeartWise-AI/DeepCORO_CLIP.git
    cd DeepCORO_CLIP
  2. Set up Virtual Environment:

    pip install uv
    uv sync
  3. Activate Virtual Environment:

    source .venv/bin/activate
  4. Install yq required to run scripts/run_sweep.sh:

    wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/bin/yq && \
    chmod +x /usr/bin/yq
  5. Log into Weights & Biases required for sweep:

    wandb login
  6. Make sure you have FFMPEG 4.4.x is installed - required for sweep:

    which ffmpeg
    conda remove ffmpeg # remove if /opt/conda/bin/ffmpeg exists
    sudo apt update
    sudo apt install ffmpeg
    sudo apt install libavcodec-extra
    ffmpeg -version

πŸ“„ Configuration Files

The project uses configuration files located in the config/ directory:

Base Configurations

  1. CLIP Training (config/clip/base_config.yaml):

    • Training parameters (epochs, batch size, learning rate)
    • Model architecture settings
    • Data loading parameters
    • Optimization settings
    • Video mode settings (single/multi)
    • Video aggregation parameters
  2. Linear Probing (config/linear_probing/base_config.yaml):

    • Task-specific parameters
    • Head structure configuration
    • Loss function settings
    • Backbone freezing options

Sweep Configurations

  1. CLIP Training (config/clip/sweep_config_*.yaml, config/clip/sweep_siglip_output_dataset_*.yaml):

    • Hyperparameter search space for CLIP training
    • Supports both single and multi-video training
    • sweep_siglip_output_dataset_config.yaml surfaces conservative learning rate, grad-clipping, and SigLIP weighting knobs for unstable SigLIP runs
  2. Linear Probing (config/linear_probing/sweep_config.yaml):

    • Hyperparameter optimization for linear probing tasks
    • Task-specific parameter ranges

πŸ’» Run Modes

1. Contrastive Learning (CLIP)

Train the model on video-report pairs using contrastive learning

Process multiple videos per study with aggregation:

# Single GPU training without logging results to wandb (see scripts/runner.sh)
bash scripts/runner.sh --base_config config/clip/base_config.yaml --selected_gpus 0 --use_wandb false --run_mode train

# Multi-GPU training with results logging on wandb (see scripts/runner.sh)
bash scripts/runner.sh --base_config config/clip/base_config.yaml --selected_gpus 0,1 --use_wandb true --run_mode train

# Multi-GPU hyperparameters fine-tuning - RunMode and UseWandb are forced to train and true respectively (see scripts/run_sweep.sh)
bash scripts/run_sweep.sh --base_config config/clip/base_config.yaml --sweep_config config/clip/sweep_config_single_video.yaml --selected_gpus 3 --count 5

# SigLIP output dataset stability sweep (tunes grad clipping, AMP, and SigLIP weighting)
bash scripts/run_sweep.sh --base_config config/clip/siglip_output_dataset_config.yaml --sweep_config config/clip/sweep_siglip_output_dataset_config.yaml --selected_gpus 0,1 --count 10

Run validation

Not supported

Run inference

Process validation data from input CSV (rows where Split == 'inference') - working on single GPU only

bash scripts/runner.sh --selected_gpus 0 --base_config config/clip/base_config.yaml --run_mode inference --use_wandb false

Run test

Not supported

2. Linear Probing

Fine-tune the model for specific tasks using linear probing - couple of combination examples:

# Single GPU training without logging results to wandb (see script/runner.sh)
bash scripts/runner.sh --base_config config/linear_probing/base_config.yaml --selected_gpus 0 --use_wandb false --run_mode train

# Multi-GPU training with results logging on wandb (see script/runner.sh)
bash scripts/runner.sh --base_config config/linear_probing/base_config.yaml --selected_gpus 0,1 --use_wandb true --run_mode train

# Multi-GPU hyperparameters fine-tuning - RunMode and UseWandb are forced to train and true respectively (see scripts/run_sweep.sh)
bash scripts/run_sweep.sh --base_config config/linear_probing/base_config.yaml --sweep_config config/linear_probing/sweep_config.yaml --selected_gpus 0,1 --count 5

Run validation

Process validation data from input CSV (rows where Split == 'val') and compute CI for each head

bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode val --selected_gpus 1,2,3

Run test

Process validation data from input CSV (rows where Split == 'test') and compute CI for each head

bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode test --selected_gpus 1,2,3

Run inference

Process validation data from input CSV (rows where Split == 'inference')

bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode inference --selected_gpus 1,2,3

🐳 Docker Setup

Optionally, you can build a Docker container to validate and run the inference pipeline. The current Docker workflow is centered on scripts/external_validation.py, which expects a regular comma-separated CSV input and will internally:

  • convert DICOM cine files to AVI
  • generate intermediate Ξ±-separated CSVs under /app/tmp
  • run VasoVision preprocessing
  • run DeepCORO linear-probing validation via scripts/runner.sh

Input CSV guidance:

  • DICOMPath is required by scripts/external_validation.py
  • for the full downstream DeepCORO validation step, the CSV must also contain the target columns expected by config/linear_probing/stenosis/docker_base_config.yaml
  • use scripts/preprocess_dataset_template.csv as the input template for a custom dataset
  • use scripts/prepare_input_for_preprocess.py to normalize a custom CSV into the expected format

The container must be able to see the files referenced in DICOMPath. If your CSV contains absolute host paths, mount that host path into the container at the exact same container path.

Build Docker Image

Model weights (VasoVision and DeepCORO-CLIP) are downloaded at build time using a Docker BuildKit secret. Place your api_key.json (containing HUGGING_FACE_API_KEY) in the project root, then build:

DOCKER_BUILDKIT=1 docker build \
  --secret id=api_key,src=api_key.json \
  -t deepcoro_clip-docker .

The API key is only used during the build and is not persisted in the final image.

Run Docker

Requirements:

  • Provide a comma-separated CSV with a DICOMPath column.
  • If you want the full DeepCORO validation stage to run, include the task/label columns expected by config/linear_probing/stenosis/docker_base_config.yaml.
  • Mount the host folder containing the DICOM files at the same absolute path used in DICOMPath.
  • Set EXTERNAL_VALIDATION_DATA_PATH to the container path of your CSV.
  • Mount a host folder to /workspace/results if you want to keep outputs after the container exits.
  • Optionally mount a host folder to /app/tmp if you also want the converted AVI files and other temporary/intermediate artifacts.

Example:

docker run --rm --gpus all --ipc=host \
  -e EXTERNAL_VALIDATION_DATA_PATH=/app/data/input.csv \
  -v /path/to/your/dicoms:/path/to/your/dicoms \
  -v /path/to/your/input.csv:/app/data/input.csv \
  -v /path/to/your/results:/workspace/results \
  -v /path/to/your/tmp:/app/tmp \
  deepcoro_clip-docker \
  python scripts/external_validation.py

Outputs and intermediate files:

  • temporary converted AVIs and intermediate CSVs are written under /app/tmp and are persisted to the host if you mount /path/to/your/tmp:/app/tmp
  • final results are written under the configured output_folder (default: results/, which maps to /workspace/results in the example above)
  • all pipeline CSV files are also copied into /workspace/results/csv_artifacts/ for easier host-side access
  • typical CSV artifacts include: tmp__df_preprocessed.csv for the post-DICOM-conversion manifest, tmp__df_vaso_info.csv for VasoVision/Orion predictions, tmp__df_preprocessed_filtered.csv for the merged and filtered dataframe passed to DeepCORO, and results__...prediction...csv files for final VasoVision and DeepCORO predictions

Notes:

  • scripts/external_validation.py reads EXTERNAL_VALIDATION_DATA_PATH from the environment.
  • the downstream DeepCORO stage defaults to inference mode in the Docker workflow; set DEEPCORO_RUN_MODE=auto to restore schema-based selection or DEEPCORO_RUN_MODE=val to force validation
  • set EXTERNAL_VALIDATION_SKIP_VASOVISION=true to bypass VasoVision entirely and run DeepCORO on all converted videos without the diagnostic/contrast/main-structure filtering step
  • The script expects the input manifest to be comma-separated, not Ξ±-separated.
  • A DICOMPath-only CSV is enough for the DICOM-to-AVI and VasoVision stages, but not for the full DeepCORO validation stage unless the downstream config is adjusted.
  • scripts/runner.sh now uses the active virtual environment inside the container, so no .venv symlink workaround is needed.
  • --ipc=host is recommended for this inference pipeline because PyTorch video loading can exhaust Docker's default shared memory allocation. If you do not want to share the host IPC namespace, use a sufficiently large --shm-size instead.
  • if you use --rm without mounting /workspace/results, the generated result files are removed with the container.
  • it is better to create the host output folders yourself first, for example mkdir -p /path/to/your/results /path/to/your/tmp, to avoid Docker creating them as root.
  • files written through the mounted results folder are typically owned by root because the container runs as root by default; if you want host-owned outputs, you can try adding --user $(id -u):$(id -g) to docker run if your mounted folders have compatible permissions.

Model Architecture

Video Encoder

  • Multiscale Vision Transformer (mVIT) backbone
  • Configurable number of heads and layers
  • Support for pretrained weights
  • Optional backbone freezing
  • New flags for fine-grained control over the output:
    • aggregate=True (default) – returns one study-level vector [B, D].
    • aggregate=False, per_video_pool=True – returns one token per video [B, N, D], ready for MIL / linear probing heads.
    • aggregate=False, per_video_pool=False – returns all patch tokens - ONLY Setting that preeservs all the tokens [B, NΒ·L, D] for the most detailed downstream reasoning.

Example (video-level MIL):

from models.video_encoder import VideoEncoder
from models.multi_instance_linear_probing import MultiInstanceLinearProbing

encoder = VideoEncoder(
    backbone="mvit",
    aggregate=False,        # skip internal aggregator
    aggregate_videos_tokens=True,    # one token per video
)

probe = MultiInstanceLinearProbing(
    embedding_dim=encoder.embedding_dim,
    head_structure={"severity": 4},
    pooling_mode="attention",
)

video_batch = ...                  # [B, N, T, H, W, C]
feats = encoder(video_batch)       # [B, N, D]
logits = probe(feats)              # dict with head output

Text Encoder

  • BioMedBERT for medical text encoding
  • Configurable freezing ratio
  • Contrastive learning with video features

Linear Probing Heads

  • Task-specific classification heads
  • Configurable dropout and architecture
  • Support for multiple output classes per head

Development Setup

We use pre-commit hooks to ensure code quality and consistency:

# Install pre-commit
uv pip install pre-commit
pre-commit install

# Run hooks manually
pre-commit run --all-files

Performance Guidelines

Recommended Batch Sizes by GPU Memory

GPU Memory Recommended Batch Size Command
8GB 4-8 --batch-size 8
12GB 8-16 --batch-size 16
16GB 16-24 --batch-size 24
24GB+ 24-32 --batch-size 32

Training Tips

  1. Batch Size Selection:

    • Start with smaller batch sizes and increase if memory allows
    • Larger batch sizes generally allow faster training
    • Reduce if you get OOM errors
  2. Number of Workers:

    • Rule of thumb: num_workers = 4 * num_gpus
    • Reduce if you get memory or file handle errors
    • Example: --num-workers 2 for slower storage systems
  3. Learning Rate:

    • Default (1e-4) works well for most cases
    • For larger batch sizes: lr = 1e-4 * (batch_size/32)
    • Example: --lr 2e-4 for batch size 64
  4. Number of Epochs:

    • Default (50) is good for most cases
    • Increase for better performance: --epochs 100
    • Decrease for quick experiments: --epochs 10

Common Issues

  1. Out of Memory (OOM):

    • Reduce batch size
    • Use gradient accumulation
    • Force single GPU mode
  2. GPU Selection:

    • Use CUDA_VISIBLE_DEVICES to select specific GPUs
    • Monitor GPU usage with nvidia-smi
  3. Training Speed:

    • Multi-GPU isn't always faster due to overhead
    • Start with single GPU and scale up if needed

Monitoring Training

  1. GPU Memory Usage:
nvidia-smi -l 1  # Monitor GPU usage every second
  1. Training Progress:
  • Progress bar shows current epoch and batch
  • Loss values are printed every 10 batches
  • Checkpoints are saved every 5 epochs
  1. WandB Logging:
  • Training metrics are logged to Weights & Biases
  • Includes loss, learning rate, batch size
  • Access via WandB dashboard

Project Structure

heartwise-ai-deepcoro_clip/
β”œβ”€β”€ config/                        # Configuration files
β”‚   β”œβ”€β”€ clip/                     # CLIP training configs
β”‚   └── linear_probing/           # Linear probing configs
β”œβ”€β”€ dataloaders/                  # Data loading modules
β”œβ”€β”€ dataset_creation/             # How MHI dataset was built
β”œβ”€β”€ docs/                         # Documentation on CLS-Token implementation
β”œβ”€β”€ models/                       # Neural network models
β”œβ”€β”€ projects/                     # Project implementations
β”œβ”€β”€ runners/                      # Training runners
β”œβ”€β”€ scripts/                      # Training scripts
β”œβ”€β”€ utils/                        # Utility functions
└── tests/                        # Unit test pipeline

🀝 Contributing

Contributions to DeepCoro_CLIP repository are welcome! Please follow these steps to contribute:

  1. Fork the repository
  2. Create a new branch for your feature or bug fix
  3. Make your changes and commit them with clear, descriptive messages
  4. Push your changes to your fork
  5. Submit a pull request to the main repository

πŸ“š Citation

If you find this repository useful, please cite our work:

@article{harrabi2026deepcoro_clip,
  title={DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation},
  author={Harrabi, Sarra and Wu, Yichen and Tison, Geoffrey H. and Ansari, Minhaj and Vukadinovic, Milos and Ouyang, David and Barrios, Joshua P. and Delfrate, Jacques and Avram, Robert},
  journal={arXiv preprint arXiv:2603.17675},
  year={2026},
  doi={10.48550/arXiv.2603.17675}
}

About

Text & video embeddings for report generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors