CliniBench is the first comprehensive benchmark for comparing encoder-based classifiers and generative large language models (LLMs) for discharge diagnosis prediction from admission notes in the MIMIC-IV dataset.
- Python 3.11
- CUDA-capable GPU (recommended for encoder training and LLM inference)
- Access to MIMIC-IV dataset (requires PhysioNet credentialing)
# Clone the repository
git clone https://github.com/your-org/clinibench.git
cd clinibench
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync
# Activate the virtual environment
source .venv/bin/activate# Build the Docker image
docker build -t clinibench:latest .
# Run container with GPU support
docker run --gpus all -it -v $(pwd):/clinibench clinibench:latest {script_name.py}- Complete the CITI "Data or Specimens Only Research" course
- Request access to MIMIC-IV on PhysioNet
- Download MIMIC-IV dataset (version 2.2)
- Download MIMIC-IV-Note dataset (version 2.2)
- Follow https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iv/buildmimic/postgres for the general mimic-iv content
- Follow https://github.com/MIT-LCP/mimic-code/blob/main/mimic-iv-note/buildmimic/postgres/README.md for mimic-iv notes
Adjust the PostgreSQL connection string (username, password, database name, hostname and port) in the create_admission_note_dataset.sh script and execute it.
You may adjust the paths where the dataset is stored within the script. All following steps and default parameters assume the paths defined in the script.
This script will 1. create the admission from discharge notes dataset, 2. create the train/dev/test splits from the full dataset and 3. create a file that contains the notes from all train splits which is used for few-shot experiments.
Step 1: Start the vLLM server
Launch a vLLM server that provides access to your model. By default, the script assumes it's running on localhost. If your server is on a different host, you'll need to adjust the --vllm_ip parameter in Step 2.
Step 2a: Generate diagnosis predictions (zero-shot)
Run the prediction script with your model and data:
python src/generative_models/predict_diagnoses.py \
--vllm_ip=http://localhost:8000 \
--test_data=data/mimic-iv/icd-10/icu/test_10_icu.parquet \
--icd10_code_file=data/icd10_codes.csv \
--icd9_code_file=data/icd9_codes.csv \
--results_file=qwen2.5-3B-mimic-iv-icu-icd10.parquetStep 2b: Generate diagnosis predictions (few-shot)
Run the prediction script including few-shot examples:
python src/generative_models/predict_diagnoses.py \
--vllm_ip=http://localhost:8000 \
--test_data=data/mimic-iv/icd-10/icu/test_10_icu.parquet \
--icd10_code_file=data/icd10_codes.csv \
--icd9_code_file=data/icd9_codes.csv \
--results_file=qwen2.5-3B-mimic-iv-icu-icd10.parquet \
--few_shot \
--few_shot_note_data=./data/fewshot-candidates/notes.parquet \
--few_shot_ids=./data/fewshot-candidates/gold_shots/test_few_shots.pq \
--few_shot_column=icd10_icu \
--num_few_shot_candidates=5Step 3: Evaluate the predictions
Map the generated diagnosis descriptions to ICD codes and calculate evaluation metrics:
python src/generative_models/map_and_evaluate.py \
--predictions_file_path=qwen2.5-3B-mimic-iv-icu-icd10.parquet \
--results_save_path=results.json \
--icd_version=10 \
--icd10_code_file=data/icd10_codes.csv \
--icd9_code_file=data/icd9_codes.csvStep 4: View results
All evaluation metrics will be saved to the file specified in --results_save_path (e.g., results.json).
If you use CliniBench in your research, please cite:
@misc{grundmann2025clinibenchclinicaloutcomeprediction,
title={CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models},
author={Paul Grundmann and Dennis Fast and Jan Frick and Thomas Steffek and Felix Gers and Wolfgang Nejdl and Alexander Löser},
year={2025},
eprint={2509.26136},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.26136},
}This project is licensed under the MIT License - see the LICENSE file for details.