LLM evaluation of ENT clinical cases for surgical recommendation

Overview

ent-llm evaluates otolaryngology (ENT) clinical cases using Large Language Models. It processes chronic sinusitis patient data from Stanford's medical records and generates surgical recommendations with confidence scores.

Installation

Create Virtual Environment

python -m venv .venv
source .venv/bin/activate

Install Dependencies

pip install -e .

Required environment variables:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/gcp_credentials.json"  # BigQuery access
export VAULT_SECRET_KEY="your_private_key"                             # SecureLLM API access

Quick Start

Full Pipeline

# Step 1: Extract data from BigQuery
ent-llm-extract --output cases.csv

# Step 2: Run LLM analysis
ent-llm --model apim:gpt-4.1 --input cases.csv --output results.csv

Testing with Limited Data

# Extract only 100 patients for testing
python cli_extract.py --output test_cases.csv --limit 100

# Run analysis
python cli.py --model apim:claude-3.7 --input test_cases.csv --output test_results.csv

CLI Reference

`ent-llm-extract` - Data Extraction

Extracts and preprocesses clinical data from BigQuery.

ent-llm-extract [OPTIONS]

Option	Short	Description
`--output`	`-o`	Output CSV file (default: `llm_cases.csv`)
`--batch-size`	`-b`	Patients per batch (default: 100)
`--limit`	`-l`	Max patients to process (default: all)
`--save-processed`		Also save full processed dataframe
`--processed-output`		Path for processed data CSV
`--checkpoint-dir`		Directory for checkpoint files
`--count-only`		Show patient count and exit
`--verbose`	`-v`	Enable verbose logging

Examples:

# Count total patients
ent-llm-extract --count-only

# Extract all data
ent-llm-extract --output cases.csv

# Extract with checkpoints (recommended for large datasets)
ent-llm-extract --output cases.csv --checkpoint-dir ./checkpoints

# Extract both LLM-ready and full processed data
ent-llm-extract --output cases.csv --save-processed --processed-output full_data.csv

`ent-llm` - LLM Analysis

Runs surgical recommendation analysis using various LLM backends.

ent-llm [OPTIONS]

Option	Short	Description
`--model`	`-m`	LLM model to use (default: `apim:gpt-4.1`)
`--input`	`-i`	Input CSV file with case data
`--output`	`-o`	Output CSV file for results
`--delay`	`-d`	Delay between API calls (default: 0.2s)
`--interactive`	`-I`	Interactive query mode
`--list-models`	`-l`	List available models and exit
`--verbose`	`-v`	Enable verbose logging

Available models:

apim:gpt-4.1
apim:claude-3.7
apim:llama-3.3-70b
apim:gemini-2.5-pro-preview-05-06

Examples:

# List available models
ent-llm --list-models

# Run analysis with specific model
ent-llm --model apim:claude-3.7 --input cases.csv --output results.csv

# Interactive query mode
ent-llm --model apim:gpt-4.1 --interactive

# Demo mode (no input file)
ent-llm --model apim:gpt-4.1

`ent-llm-ablation` - Demographic Ablation Analysis

Measures how demographic variables influence LLM surgical recommendations by selectively excluding demographics from prompts.

ent-llm-ablation [OPTIONS]

Option	Short	Description
`--model`	`-m`	LLM model to use (default: `apim:gpt-4.1`)
`--input`	`-i`	Input CSV file (clinical text + demographics)
`--output-dir`	`-o`	Output directory for result CSVs (default: `./ablation_results`)
`--baseline`	`-b`	Path to pre-computed baseline CSV (skip baseline run)
`--experiments`	`-e`	Which to run: `all`, `individual`, `grouped`, `baseline-only`
`--sample-size`	`-n`	Stratified sample size
`--max-tokens`		Filter out cases exceeding estimated token count
`--ground-truth`	`-g`	Ground truth column name (default: `had_surgery`)
`--delay`	`-d`	Delay between API calls (default: 0.2s)
`--flush-interval`	`-f`	Incremental save interval (default: 10)
`--no-resume`		Start fresh instead of resuming
`--list-experiments`		List all experiments and exit
`--verbose`	`-v`	Enable verbose logging

Input CSV — a single file containing both clinical text and demographic columns:

Column	Required	Description
`llm_caseID`	Yes	Unique case identifier
`formatted_progress_text`	Yes	Concatenated ENT clinical notes
`formatted_radiology_text`	Yes	Concatenated radiology reports
`legal_sex`	At least one	Patient sex
`age`	demographic	Patient age
`race`	column	Patient race
`ethnicity`	must be	Patient ethnicity
`recent_bmi`	present	Most recent BMI
`smoking_hx`		Smoking history
`alcohol_use`		Alcohol use
`zipcode`		Patient zipcode
`insurance_type`		Insurance type
`occupation`		Patient occupation
`had_surgery`	Optional	Ground truth for accuracy analysis

The prepared input file data/sample_with_demo_patient_id.csv contains all required columns.

Experiments (16 total):

Baseline — all demographics included
10 individual ablations — exclude one variable at a time (no_legal_sex, no_age, etc.)
5 grouped ablations — exclude variable groups (no_protected_attributes, no_socioeconomic, no_health_behaviors, no_physical_attributes, no_all_demographics)

Examples:

# List all experiments
ent-llm-ablation --list-experiments

# Run full ablation study
ent-llm-ablation -m apim:gpt-4.1 -i data/sample_with_demo_patient_id.csv

# Run on a stratified sample of 500 cases
ent-llm-ablation -m apim:gpt-4.1 -i data/sample_with_demo_patient_id.csv -n 500

# Filter long cases and run only individual ablations
ent-llm-ablation -m apim:claude-3.7 -i data/sample_with_demo_patient_id.csv --max-tokens 5000 -e individual

# Resume with a pre-computed baseline
ent-llm-ablation -m apim:gpt-4.1 -i data/sample_with_demo_patient_id.csv -b ./ablation_results/baseline_results.csv

Output: Each experiment saves to {output_dir}/{experiment_name}_results.csv. A summary comparing all experiments to baseline is saved to {output_dir}/ablation_summary.csv with flip rates, confidence changes, and (if ground truth provided) accuracy metrics.

Data Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                              DATA EXTRACTION                                 │
│                           (ent-llm-extract CLI)                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   BigQuery (Stanford STARR)                                                  │
│         │                                                                    │
│         ├── clinical_note      → Filter by ENT authors                       │
│         ├── radiology_report   → Filter CT sinus reports                     │
│         └── procedures         → Extract surgery CPT codes                   │
│                   │                                                          │
│                   ▼                                                          │
│         Build patient records                                                │
│                   │                                                          │
│                   ▼                                                          │
│         Censor surgical planning text                                        │
│                   │                                                          │
│                   ▼                                                          │
│         Format for LLM input → cases.csv                                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              LLM ANALYSIS                                    │
│                             (ent-llm CLI)                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   cases.csv                                                                  │
│         │                                                                    │
│         ▼                                                                    │
│   SecureLLM API (GPT-4, Claude, Llama, Gemini)                               │
│         │                                                                    │
│         ▼                                                                    │
│   Parse JSON responses                                                       │
│         │                                                                    │
│         ▼                                                                    │
│   results.csv (decision, confidence, reasoning)                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Data Source

Google BigQuery - Stanford STARR

Setting	Value
Project	`som-nero-phi-roxanad-entllm`
Datasets	Chronic sinusitis cohorts (2016-2025)

Tables:

Table	Description
`clinical_note`	ENT clinical notes (progress notes, consults, H&P)
`radiology_report`	CT sinus scan reports
`procedures`	CPT codes for surgeries/endoscopies

Input/Output Formats

Input CSV (from extraction)

Column	Description
`llm_caseID`	Unique case identifier
`formatted_progress_text`	Concatenated ENT clinical notes
`formatted_radiology_text`	Concatenated radiology reports

Output CSV (from analysis)

Column	Description
`llm_caseID`	Case identifier
`decision`	`Yes` or `No` for surgery recommendation
`confidence`	1-10 confidence score
`reasoning`	2-4 sentence explanation
`api_response`	Raw LLM response

Project Structure

ent-llm/
├── cli.py                    # LLM analysis CLI
├── cli_extract.py            # Data extraction CLI
├── cli_ablation.py           # Demographic ablation CLI
├── data_extraction/          # BigQuery data processing
│   ├── config.py             # Project settings, CPT codes
│   ├── raw_data_parsing.py   # Data extraction functions
│   └── note_extraction.py    # Note filtering and censoring
├── llm_query/                # LLM integration
│   ├── securellm_adapter.py  # SecureLLM client wrapper
│   ├── LLM_analysis.py       # Analysis pipeline
│   ├── ablation_analysis.py  # Ablation experiment logic
│   └── llm_input.py          # Data formatting
├── batch_query/              # Batch processing
├── evaluation/               # Results evaluation
└── training/                 # Training workflows

License

MIT License - See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
batch_query		batch_query
data_extraction		data_extraction
docs		docs
evaluation		evaluation
llm_query		llm_query
notebooks		notebooks
scratch		scratch
scripts		scripts
training		training
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
cli_ablation.py		cli_ablation.py
cli_evaluate.py		cli_evaluate.py
cli_extract.py		cli_extract.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM evaluation of ENT clinical cases for surgical recommendation

Overview

Installation

Create Virtual Environment

Install Dependencies

Quick Start

Full Pipeline

Testing with Limited Data

CLI Reference

`ent-llm-extract` - Data Extraction

`ent-llm` - LLM Analysis

`ent-llm-ablation` - Demographic Ablation Analysis

Data Pipeline

Data Source

Input/Output Formats

Input CSV (from extraction)

Output CSV (from analysis)

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM evaluation of ENT clinical cases for surgical recommendation

Overview

Installation

Create Virtual Environment

Install Dependencies

Quick Start

Full Pipeline

Testing with Limited Data

CLI Reference

ent-llm-extract - Data Extraction

ent-llm - LLM Analysis

ent-llm-ablation - Demographic Ablation Analysis

Data Pipeline

Data Source

Input/Output Formats

Input CSV (from extraction)

Output CSV (from analysis)

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`ent-llm-extract` - Data Extraction

`ent-llm` - LLM Analysis

`ent-llm-ablation` - Demographic Ablation Analysis

Packages