Code release for TreeDDx: Benchmarking Differential Diagnostic Reasoning in Large Language Models Using Structured Clinical Decision Trees.
The original case data are based on the JAMA Network Clinical Challenge collection:
https://jamanetwork.com/collections/44038/clinical-challenge
The underlying Clinical Challenge content is not included in this repository. Users must obtain appropriate access, permission, or licensing from the data source before preparing and using the input data.
Prepare a local file named original_data.json. It should be a JSON list, where each item is one clinical case.
Required fields:
[
{
"question": "Clinical case question text",
"opa": "Option A",
"opb": "Option B",
"opc": "Option C",
"opd": "Option D",
"answer_idx": "A",
"dicussion": "Ground-truth diagnostic discussion text",
"model_answer": "The answer and diagnostic reasoning generated by another LLM"
}
]Notes:
model_answershould be produced before running the LLM-tree generation step.
The scripts use an OpenAI-compatible chat-completions API. The default model is:
gpt-5.4-mini
For security, do not hard-code your API key in the scripts. Either pass credentials with command-line arguments:
python gt_decisiontree_generation.py \
--api-key YOUR_API_KEY \
--base-url YOUR_BASE_URLRun the three scripts in order.
Default command:
python gt_decisiontree_generation.py \
--api-key YOUR_API_KEY \
--base-url YOUR_BASE_URLDefault input:
original_data.json
Default output:
gt_decisiontree_output.json
This step appends a new field to each item:
gt_tree
Default command:
python llm_decisiontree_generation.py \
--api-key YOUR_API_KEY \
--base-url YOUR_BASE_URLDefault input:
gt_decisiontree_output.json
Default output:
gtandllm_decisiontree_output.json
This step appends a new field to each item:
llm_tree
Default command:
python evaluation.py \
--api-key YOUR_API_KEY \
--base-url YOUR_BASE_URLDefault input:
gtandllm_decisiontree_output.json
Default output:
eva_output.json
This step appends a new field to each item:
evaluation
The output file has this top-level structure:
{
"items": [],
"overall_mean_scores": {},
"field_mean_scores": {}
}If you use this code, please cite the TreeDDx paper:
TreeDDx: Benchmarking Differential Diagnostic Reasoning in Large Language Models Using Structured Clinical Decision Trees

