This repository contains the implementation of the experiments from the paper:
"Less is More: Improved Fact-Checking on AVeriTeC with Iterative Question Generation"
The project investigates the isolated impact of iterative question generation in Retrieval-Augmented Generation (RAG) pipelines for fact-checking.
This system follows a standard open-domain fact-checking pipeline:
- Receive a claim
- Generate verification questions
- Use questions as search queries
- Retrieve evidence from the web
- Rank and aggregate evidence
- Predict a final label:
SUPPORTEDREFUTEDNOT ENOUGH EVIDENCE (NEI)CONFLICTING EVIDENCE/CHERRYPICKING
The key contribution is a controlled comparison between:
- Single-step question generation
- Iterative question generation
- Generates up to 5 questions in a single pass
- No access to intermediate evidence
- All queries are independent
prompt = f"""
You are a fact-checking assistant.
Task: Generate up to {num_questions} questions to verify the claim.
Claim:
{context.claim}
Claim date:
{context.claim_date}
Speaker:
{context.speaker}
Instructions:
- Generate as many useful questions as needed (from 1 up to {num_questions})
- Do not force unnecessary questions
- Avoid redundancy
- Questions must be factual and verifiable
- If claim involves time, check chronology
- If claim involves attribution, verify speaker/source
- If claim involves numbers, verify quantities
- Do not provide explanations
Output:
Return a numbered list of questions only.
"""
- Generates one question at a time
- Each question is conditioned on:
- previously generated questions
- previously retrieved evidence
- Stops early if enough evidence is gathered
prompt = f"""
Task: Generate ONE new question to help verify the claim.
Claim:
{context.claim}
Claim date:
{context.claim_date}
Speaker:
{context.speaker}
---
Previous questions:
{previous_questions}
---
Known evidence (partial):
{evidence_snippets}
---
Current stance signals:
{stance_summary}
---
Instructions:
- Generate ONLY ONE question
- Do NOT repeat previous questions
- Focus on missing or uncertain aspects
- Do not force unnecessary questions
- Avoid redundancy
- Questions must be factual and verifiable
- If claim involves time, check chronology
- If claim involves attribution, verify speaker/source
- If claim involves numbers, verify quantities
- Do not provide explanations
Output:
Provide only the question.
"""
👉 The only difference between methods is how questions are generated
python -m venv venv
source venv/bin/activate # Linux / Mac
venv\Scripts\activate # Windowspip install -r requirements.txtCreate a .env file in the root directory:
# ===== LLM =====
OLLAMA_MODEL=llama3.1:8b # Local model served via Ollama
LLM_TEMPERATURE=0.0 # Set to `0.0` for deterministic outputs
LANGUAGE=en # Language used in prompts (default: `en`)
# ===== QUESTION GENERATION =====
MAX_QUESTIONS=5 # Maximum number of generated questions (N = 5)
# ===== SEARCH ENGINE =====
SEARCH_ENGINE=duckduckgo
SEARCH_MAX_RESULTS=10 # Number of results per query
SEARCH_MAX_URLS=30 # Maximum URLs retrieved per query
SEARCH_MAX_WORKERS=10 # Parallel workers for retrieval
# ===== RETRIEVAL =====
BM25_TOP_K=10 # Number of documents selected via BM25
USE_QUESTION_FOR_RETRIEVAL=true
# ===== RERANKING =====
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2 # Cross-encoder model for reranking
RERANKER_TOP_K=5 # Final number of selected passages
RERANKER_THRESHOLD=0.3 # Score threshold for filtering
USE_RERANKER=true # Enable/disable reranking stage
# ===== PASSAGES =====
CHUNK_SIZE=300 # Number of tokens per passage chunkpython experiments/run_averitec_original.pypython experiments/run_averitec_iterative.pyResults are saved in:
outputs/
Includes:
- Predictions per claim
- Aggregated metrics:
- Accuracy
- Macro-F1
- F1 per class
- Number of queries
- Token usage
- Iterative generation:
- Improves accuracy (+1.7 to +17.5)
- Reduces number of queries (~50%)
- Reduces abstention (NEI)
- Increases token usage (7–17%)
👉 Key insight: Iterative generation shifts computation from retrieval to reasoning
- AVeriTeC (development set)
- 500 real-world claims
- Open-domain (web retrieval)
Labels:
SUPPORTEDREFUTEDNOT ENOUGH EVIDENCECONFLICTING EVIDENCE/CHERRYPICKING