ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
Code and data for the manuscript: [ICCV 2025] ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology.
Linked preprint: https://arxiv.org/abs/2503.17564
Huge datasets in digital pathology have paved the way for slide-level foundation models (SLFMs) that can improve performance on prediction tasks in low-data regimes. However, these foundation models typically only receive image inputs, under-utilizing shared information between tasks and modalities. Medical institutions often only have small but varied annotated datasets which can be multi-modal (clinical records, genomics, reports, etc.), multi-task (subtype, stage, prognosis, etc.) and from many cancer sites.
We propose ModalTune, a novel fine-tuning framework to interface with existing SLFMs to integrate multi-modal, multi-task, and pan-cancer features to maximally leverage existing datasets.
Figure 1: Overview of the ModalTune framework compared to standard fine-tuning framework.
Figure 2: Overview of ModalTune. Built on a frozen pre-trained slide encoder (a), our Modal Adapter (b) integrates extra-modal features
via the Multi-Modal Feature Injector (c) and Extractor (d). A text embedding module (e) unifies multiple tasks and cancer sites, while the
enriched image, task, and modal embeddings are fused for training, ensuring a robust multi-modal, multi-task representation.
- Modal Adapter: a plug-and-play module that works with any Transformer-based SLFM, does not alter pretrained weights, and adapts to arbitrary number of input modalities
- Text-Based Multitask and Pan-Cancer Learning: Permits combining arbitrary tasks and cancer sites by embedding as text while introducing language-based semantic information
We evaluate ModalTune and baselines on cancer subtype prediction and survival prediction for TCGA.
Figure 3: Qualitative analysis of breast cancer cases, highlighting a high-risk (a) and a low-risk case (b). (1) Attention Maps (c,d) depict
cross-modal and cross-task interactions, with heatmap colors indicating importance (red: high, blue: low). (2) Integrated Gradients
(e,f) shows the top 10 pathways influencing risk, with orange bars indicating pathways increased risk and blue bars indicating pathways
decreased risk.
We compare ModalTune against linear probed (LP) and fine-tuned (Tuned) SLFMs, single-modal baselines, and multimodal baselines -- all trained on tasks separately. "cat" indicates late fusion via simple concatenation of features from each modality.
| Modality | Method | BRCA | GBMLGG | NSCLC | RCC | Overall |
|---|---|---|---|---|---|---|
| WSI-only | Prov-Gigapath LP | 0.612 | 0.900 | 0.821 | 0.851 | 0.796 |
| ABMIL | 0.853 ± 0.015 | 0.931 ± 0.040 | 0.920 ± 0.012 | 0.921 ± 0.017 | 0.906 | |
| TransMIL | 0.828 ± 0.011 | 0.978 ± 0.012 | 0.934 ± 0.007 | 0.918 ± 0.016 | 0.915 | |
| Prov-Gigapath (Tuned) | 0.860 ± 0.013 | 0.931 ± 0.029 | 0.916 ± 0.012 | 0.939 ± 0.016 | 0.912 | |
| Genomics | MLP | 0.752 ± 0.032 | 0.998 ± 0.002 | 0.926 ± 0.007 | 0.883 ± 0.016 | 0.890 |
| S-MLP | 0.839 ± 0.015 | 1.000 ± 0.000 | 0.941 ± 0.005 | 0.890 ± 0.003 | 0.917 | |
| Gene Mixer | 0.840 ± 0.014 | 1.000 ± 0.000 | 0.932 ± 0.005 | 0.898 ± 0.036 | 0.917 | |
| Multi-modal | MCAT | 0.875 ± 0.014 | 0.973 ± 0.005 | 0.921 ± 0.006 | 0.933 ± 0.028 | 0.925 |
| SurvPath | 0.858 ± 0.028 | 0.995 ± 0.007 | 0.932 ± 0.007 | 0.936 ± 0.030 | 0.930 | |
| ABMIL (cat) | 0.861 ± 0.010 | 1.000 ± 0.000 | 0.951 ± 0.003 | 0.937 ± 0.028 | 0.937 | |
| TransMIL (cat) | 0.847 ± 0.027 | 1.000 ± 0.000 | 0.950 ± 0.004 | 0.957 ± 0.020 | 0.939 | |
| Prov-Gigapath (cat) | 0.850 ± 0.017 | 0.998 ± 0.002 | 0.924 ± 0.011 | 0.926 ± 0.034 | 0.925 | |
| ModalTune (Ours) | 0.899 ± 0.026 | 1.000 ± 0.000 | 0.956 ± 0.010 | 0.959 ± 0.003 | 0.954 | |
| ModalTune Pan-Cancer (Ours) | 0.858 ± 0.001 | 0.990 ± 0.009 | 0.958 ± 0.004 | 0.902 ± 0.033 | 0.927 |
| Modality | Method | BRCA | GBMLGG | NSCLC | RCC | Overall |
|---|---|---|---|---|---|---|
| WSI | Prov-Gigapath LP | 0.647 | 0.795 | 0.562 | 0.679 | 0.671 |
| ABMIL | 0.712 ± 0.004 | 0.854 ± 0.004 | 0.582 ± 0.002 | 0.670 ± 0.004 | 0.704 | |
| TransMIL | 0.742 ± 0.015 | 0.868 ± 0.009 | 0.586 ± 0.011 | 0.676 ± 0.003 | 0.718 | |
| Prov-Gigapath (Tuned) | 0.680 ± 0.024 | 0.824 ± 0.017 | 0.546 ± 0.005 | 0.685 ± 0.012 | 0.684 | |
| Genomics | MLP | 0.629 ± 0.039 | 0.884 ± 0.004 | 0.542 ± 0.013 | 0.720 ± 0.002 | 0.694 |
| S-MLP | 0.749 ± 0.030 | 0.887 ± 0.002 | 0.571 ± 0.011 | 0.735 ± 0.003 | 0.736 | |
| Gene Mixer | 0.762 ± 0.049 | 0.870 ± 0.007 | 0.556 ± 0.033 | 0.690 ± 0.009 | 0.719 | |
| Multi-modal | MCAT | 0.673 ± 0.044 | 0.880 ± 0.006 | 0.592 ± 0.002 | 0.697 ± 0.004 | 0.710 |
| SurvPath | 0.741 ± 0.031 | 0.895 ± 0.003 | 0.613 ± 0.010 | 0.677 ± 0.011 | 0.732 | |
| ABMIL (cat) | 0.736 ± 0.007 | 0.896 ± 0.004 | 0.605 ± 0.011 | 0.690 ± 0.004 | 0.732 | |
| TransMIL (cat) | 0.666 ± 0.018 | 0.872 ± 0.030 | 0.595 ± 0.003 | 0.689 ± 0.004 | 0.705 | |
| Prov-Gigapath (cat) | 0.678 ± 0.004 | 0.861 ± 0.023 | 0.573 ± 0.003 | 0.693 ± 0.016 | 0.701 | |
| ModalTune (Ours) | 0.772 ± 0.008 | 0.879 ± 0.004 | 0.608 ± 0.023 | 0.743 ± 0.004 | 0.750 | |
| ModalTune Pan-Cancer (Ours) | 0.757 ± 0.039 | 0.860 ± 0.006 | 0.586 ± 0.020 | 0.705 ± 0.007 | 0.727 |
We perform OOD evaluation of ModalTune on unseen cancer sites from TCGA versus baseline single-modal and multimodal tuning methods. "Sup." indicates that the method was fine-tuned in a supervised manner on the OOD cancer site. "Cls." and "Surv." indicate that the method was fine-tuned using only classification or survival objectives, respectively.
| Method | COADREAD | BLCA |
|---|---|---|
| Prov-Gigapath Sup. (cat) | 0.581 ± 0.006 | 0.703 ± 0.018 |
| Prov-Gigapath LP | 0.510 | 0.569 |
| Prov-Gigapath Cls. (cat) | 0.504 ± 0.030 | 0.497 ± 0.005 |
| Prov-Gigapath Surv. (cat) | 0.500 ± 0.000 | 0.497 ± 0.005 |
| ModalTune | 0.574 ± 0.024 | 0.664 ± 0.025 |
| ModalTune Pan-Cancer | 0.564 ± 0.034 | 0.689 ± 0.035 |
| Method | COADREAD | BLCA |
|---|---|---|
| Prov-Gigapath Sup. (cat) | 0.528 ± 0.023 | 0.673 ± 0.020 |
| Prov-Gigapath LP | 0.482 | 0.603 |
| Prov-Gigapath Cls. (cat) | 0.479 ± 0.042 | 0.610 ± 0.063 |
| Prov-Gigapath Surv. (cat) | 0.512 ± 0.061 | 0.552 ± 0.052 |
| ModalTune | 0.539 ± 0.068 | 0.629 ± 0.041 |
| ModalTune Pan-Cancer | 0.543 ± 0.020 | 0.672 ± 0.046 |
ModalTune/
├── train_modaltune.py # Main training script for single cancer types
├── train_modaltune_pancancer.py # Pan-cancer multi-task training script
│
├── data_utils/ # Data processing and preparation
│
├── dataset/ # Processed datasets and splits
│
├── models/ # Model architectures and components
│ ├── aggregators/ # MIL aggregation modules including ModalAdapters
│ ├── genomic_utils/ # Genomic data processing
│ ├── vitadapter/ # Vision adapter utils
│ └── prov_gigapath/ # Prov-GigaPath model components
│
├── model_configs/ # Model configuration files
│
├── utils/ # Utility functions and helpers
│ ├── constants.py # Constants and paths
│ ├── test_utils_modaltune.py # Evaluation utilities
│ └── test_utils_pancancer.py # Pan-cancer evaluation utilities
│
└── scripts/ # Execution scripts
├── deploy_modaltune.sh # Deployment script
├── submit_extract_patches.sh # Patch extraction pipeline
├── submit_get_dataset.sh # Dataset creation pipeline
└── submit_modaltune.sh # Training pipeline
# Core ML Libraries
torch==2.0.0
torchvision==0.15.0
# Deep Learning & Vision
timm==1.0.7
transformers==4.36.2
# Additional
warmup_scheduler
gene_thesaurus #gene processing
dplabtools #patch extraction
conch #conch and conch related packages (https://github.com/mahmoodlab/CONCH)
prov_gigapath #For running longnetvit with flash attention and other dependencies (https://github.com/prov-gigapath/prov-gigapath)
wandb #for experiment tracking
lifelines #for survival analysis
sklearn #for metricsSince this repository works with multiple datasets and modalities, we follow a consistent directory structure throughout the project. You are free to modify this structure if needed, as long as the corresponding paths are updated accordingly.
-
Raw image data:
PATHTODATABASE/TCGA/TCGA-{ONCO_CODE}/ -
Raw genomics data:
PATHTODATABASE/TCGA/TCGA-genomics/raw/ -
Processed image features:
PATHTODATABASE/TCGA/TCGA-extractedfeatures/ -
Processed genomics features:
PATHTODATABASE/TCGA/TCGA-genomics/processed/ -
Processed text data:
PATHTODATABASE/TCGA/TCGA-extractedtexts/
Please update the paths in dataset/json_splits, scripts/, and utils/constant.py accordingly.
Download TCGA whole slide images along with clinical data from the GDC Data Portal (https://portal.gdc.cancer.gov/)
Download genomic data from UCSC Xena Browser.
Pan-Cancer dataset was downloaded from here under the Gene expression RNAseq section.
For other individual cancer specific RNA seq, can be downloaded by navigating TCGA Hub in the Xena database. (https://xenabrowser.net/datapages/).
We used the following foundation models in our experiments.
- CONCH (for text embeddings): https://github.com/mahmoodlab/CONCH
- TITAN (for ModalTune TITAN): https://huggingface.co/MahmoodLab/TITAN
- Prov-GigaPath (for ModalTune Gigapath): https://github.com/prov-gigapath/prov-gigapath
At the end ensure utils/constants.py has the correct paths set for your data directories.
Patch features can be extracted using either TITAN or Prov-GigaPath models. Use the respective line in scripts/submit_extract_patches.sh to extract features.
bash scripts/submit_extract_patches.shUse the scripts in scripts/submit_get_dataset.sh to process genomic data, clinical text and create dataset splits.
bash scripts/submit_get_dataset.shEnsure that the paths in scripts/submit_modaltune.sh are correctly set for your dataset and model (prov-gigapath/titan) configuration. By default, we only consider imaging and genomic modalities. To include clinical information also ensure the paths are added and $TYPE=gene_clinical in the scripts/submit_modaltune.sh. Then run for each cancer type:
bash scripts/submit_modaltune.shFor pan-cancer, ensure you have the pancancer dataset splits created and modify scripts/submit_modaltune.sh to point to the pancancer config and dataset. Then run:
bash scripts/submit_modaltune.shFor out-of-distribution evaluation or testing on a new test set, modify the paths, model weights and cancer types in scripts/deploy_modaltune.sh and run:
bash scripts/deploy_modaltune.shWe would like to express our gratitude to the following projects and resources that have significantly contributed to the development of ModalTune.
- SurvPath: https://github.com/mahmoodlab/SurvPath/
- CONCH: https://github.com/mahmoodlab/CONCH
- TITAN: https://huggingface.co/MahmoodLab/TITAN
- Prov-GigaPath: https://github.com/prov-gigapath/prov-gigapath
- ViT-Adapter: https://github.com/czczup/ViT-Adapter
- Mask2Former: https://github.com/facebookresearch/Mask2Former
- MLP-Mixer: https://github.com/lucidrains/mlp-mixer-pytorch
- PromptKD: https://github.com/zhengli97/PromptKD
- timm: https://github.com/rwightman/pytorch-image-models
- TCGA: The Cancer Genome Atlas
- UCSC Xena: UCSC Xena Browser
If you use ModalTune in your research, please cite:
@InProceedings{Ramanathan_2025_ICCV,
author = {Ramanathan, Vishwesh and Xu, Tony and Pati, Pushpak and Ahmed, Faruk and Goubran, Maged and Martel, Anne L.},
title = {ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {23912-23923}
}You can reach the authors by raising an issue in this repo or email them at vishwesh.ramanathan@mail.utoronto.ca / tonylt.xu@mail.utoronto.ca / a.martel@mail.utoronto.ca