A framework for evaluating hallucinations in multi-turn conversations across challenging domains.
Check out our website: https://halluhard.com/
We use pixi to make sure our experiments are reproducible across all enviroments.
Install pixi package manager:
Linux/macOS:
curl -fsSL https://pixi.sh/install.sh | shWindows: Download the installer from pixi.sh and run it.
Running models from hosted providers typically requires API credentials. Export the relevant environment variables for the providers/models you plan to use.
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
# Add others as needed (e.g., Google/DeepSeek/Moonshot), depending on what you run.Generate multi-turn conversations using different model:
# Example: Research Questions task
pixi run python -m research_questions.generate_responses \
--data research_questions/data/research_questions_all.jsonl \
--model gpt-5 \
--max-follow-ups 2 \
--max-concurrent 100 \Tips:
- You may need to change
--max-concurrentto a smaller value if a rate limit is reached. - Start with a tiny run first (the example below generates 3 conversations, each with 2 follow-up questions):
# Example: Research Questions task
pixi run python -m research_questions.generate_responses \
--data research_questions/data/research_questions_all.jsonl \
--model gpt-5 \
--max-follow-ups 2 \
--max-concurrent 100 \
--n 3HalluHard supports two judging modes:
A) Claim-based verification (--type webscraper)
Extracts atomic claims (citation + supported content) per turn, searches the web, and judges claims against retrieved evidence. This is intended for tasks that require citation grounding.
# Evaluate using web scraper method
pixi run python -m judging_pipeline.run_pipeline \
--input "research_questions/results/conversations_gpt-5_250convs.jsonl" \
--type webscraper \
--seed 42 \
--base_path "research_questions" \
--task research_questions \
--max_claims_per_turn 5 \
--n 100B) Response-based verification (--type coding_direct)
Directly evaluates coding-task responses using a coding-specific judge (e.g., checking package installation/importing and function calling behaviors). This mode is intended for the coding task.
# Evaluate coding task using direct coding judge
pixi run python -m judging_pipeline.run_pipeline \
--input "coding/results/conversations_gpt-5_200convs.jsonl" \
--type coding_direct \
--task coding \Generate an HTML report from an evaluation output file:
pixi run report \
--task research_questions \
--input "research_questions/results/conversations_gpt-5_250convs_eval_webscraper.jsonl"research_questions- Academic research question claimslegal_cases- Legal case citations and factsmedical_guidelines- Medical guideline claimscoding- Code implementation claims
Each task follows the same workflow:
data preparation → response generation → judging → reporting
The framework supports multiple LLM providers and models:
- OpenAI:
gpt-5,gpt-5-mini,gpt-5-nano,gpt-5-medium,gpt-5.2,gpt-5.2-medium-websearch - Anthropic:
claude-opus-4-5,claude-sonnet-4-5,claude-haiku-4-5,claude-opus-4-5-websearch - DeepSeek:
deepseek-reasoner,deepseek-chat - Google:
gemini-3-pro,gemini-3-flash - Moonshot:
kimi-k2-thinking - Z.ai:
GLM-4.7-thinking
<task>/
├── data/ # Input data
│ └── *.jsonl # Task-specific question datasets
├── results/ # Generated conversations and evaluations
│ ├── conversations_<model>_<n>convs.jsonl
│ ├── conversations_<model>_<n>convs_eval_<type>.jsonl
│ └── reports/ # HTML reports
├── prompts/ # Task-specific prompts
└── generate_responses.py # Response generation script
--data: Path to input data file--model: Model name to use--max-follow-ups: Number of follow-up questions per conversation (typically 2)--max-concurrent: Number of concurrent API requests (varies by model rate limits)--n: Number of conversations to generate (optional, defaults to all)--output: Custom output path (optional)
--input: Path to conversations file to evaluate--type: Evaluation method (webscraperorcoding_direct)--seed: Random seed for reproducibility--base_path: Base directory for task--task: Task name--max_claims_per_turn: Maximum claims per turn (typically 5)--n: Number of conversations to evaluate (optional)- Worker parameters:
--searchers,--fetchers,--filters,--judges
--task: Task name--input: Path to evaluation results file
See final_run.sh for the final launches used in the paper.
For full transparency, we provide our data generation pipelines under <task_name>/data_fetcher.py. Readers are welcome to re-use this script to generate more questions for their own use.
If you find our code useful, please cite our work
@misc{fan2026halluhard,
title={HalluHard: A Hard Multi-Turn Hallucination Benchmark},
author={Fan, Dongyang and Delsad, Sebastien and Flammarion, Nicolas and Andriushchenko, Maksym},
year={2026}
}