This project generates multi-turn medical conversations and uses LLM-as-judge evaluation to detect self-preference bias.
med-self-preference/
├── config/ # config.yaml
├── src/
│ ├── generation/ # generate_conversations.py, generate_single_turn*.py
│ ├── evaluation/ # pairwise_*.py, individual_*.py, generate_eval_report.py
│ └── test_generation.py
├── visualizer/ # Next.js conversation viewer
├── example_conversations/ # Multi-turn outputs
├── meddialog_output/ # Single-turn MedDialog outputs
├── covid_dialogue_output/ # Single-turn COVID outputs
└── evals/ # Pairwise evaluation results
-
Install dependencies:
pip install -r requirements.txt
-
Set up API keys (create
.envfile in project root):OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... GOOGLE_API_KEY=... -
Verify setup:
python src/test_generation.py # or: make test
Create multi-turn medical dialogues using different LLM models.
python src/generation/generate_conversations.py \
--num_scenarios 100 \
--turns 2 \
--models gpt-4 \
--patient_model gpt-4 \
--output_dir ./example_conversationsKey options:
--num_scenarios: Number of conversations to generate (default: 100)--turns: Turns per conversation (default: 8)--models: List of physician models (e.g.,gpt-4,claude-sonnet-4-5-20250929)--patient_model: Which model simulates the patient (default: gpt-4)--repair: Auto-fix turns that violate constraints--output_dir: Where to save conversations
Output: Saves conversations as {model}_{turns}t_conversations.json (e.g., gpt-4_2t_conversations.json)
Create single-turn physician responses from a local Covid dialogue source file.
python src/generation/generate_single_turn_covid.py \
--source_file ./COVID-Dialogue-Dataset-English.txt \
--num_scenarios 100 \
--models gpt-4o \
--output_dir ./covid_dialogue_output
# or: make covid-genUseful options:
--source_file: Path to the localCOVID-Dialogue-Dataset-English.txtfile--parse_only: Parse source file and writescenarios.json, skip model generation--shuffle,--seed: Randomized scenario sampling controls
Output: scenarios.json, {model}_responses.json, and all_responses.json in your output directory.
Compare two models' responses to detect self-preference bias.
python src/evaluation/pairwise_evaluation.py \
--model_a_file example_conversations/gpt-4_2t_conversations.json \
--model_b_file example_conversations/claude-sonnet-4-5-20250929_2t_conversations.json \
--judge_model claude-3-5-sonnet-20241022 \
--output evals/pairwise_claude_judge.jsonKey options:
--model_a_file,--model_b_file: Conversation files to compare--judge_model: Which model evaluates (default: Claude). Usegpt-4to test if GPT-4 shows self-preference.--output: Where to save results
Output: JSON with preference scores, win rates, and detailed metric breakdown.
Each response is scored 0-5 on:
- Faithfulness - Medical accuracy and appropriateness
- Completeness - Covers diagnosis, medications, follow-up, warning signs
- Safety - Red flag detection, emergency guidance
- Clarity - Communication quality for patient understanding
- Conciseness - Appropriate length, no excessive repetition
Optional: Use config/config.yaml to set defaults instead of command-line args:
data:
num_scenarios: 100
generation:
turns_per_conversation: 8
physician_models:
- gpt-4
- claude-sonnet-4-5-20250929
patient_simulator: gpt-4
physician_temperature: 0.3
patient_temperature: 0.8
max_tokens_per_turn: 500Browse conversations in a web UI:
cd visualizer
npm run devThen visit: http://localhost:3000?file=../example_conversations/gpt-4_2t_conversations.json