Trustworthy question answering on the ACL Anthology using VerbatimRAG.
All reproducible artifacts are on Hugging Face:
- Corpus:
KRLabsOrg/acl-anthology-mdmetadataconfig: paper metadatafulltextconfig: docling-converted markdown
- ACL span dataset:
KRLabsOrg/acl-verbatim-spanscanonicalconfig: silver train/dev rows plus the gold test benchmarkencoderconfig: pretokenized ModernBERT rows for direct token-classification training
- Multi-domain span dataset:
KRLabsOrg/verbatim-spans- ACL silver + RAGBench + Squeez mix used to train the generic v2 model
- ACL-specialized model:
KRLabsOrg/acl-verbatim-modernbert- Best on ACL Anthology paper chunks; loads via
AutoModel.from_pretrained(..., trust_remote_code=True)with a.process()API
- Best on ACL Anthology paper chunks; loads via
- Generic model:
KRLabsOrg/verbatim-rag-modern-bert-v2- v2 of
verbatim-rag-modern-bert-v1; multi-domain training, same.process()API. Strongest off-domain (RAGBench, tool outputs); seedocs/GENERIC_EVAL.mdfor the per-domain comparison
- v2 of
acl_verbatim/eval/: retrieval and span-extraction evaluation entrypoints.acl_verbatim/qa_generation/: synthetic query generation and silver span annotation.acl_verbatim/span_training/: token-classification data prep, training, and evaluation.scripts/corpus/: corpus materialization, metadata extraction, PDF conversion, and indexing.scripts/publish/: Hugging Face dataset/model publishing utilities.scripts/experiments/: adapters for RAGBench, Squeez, MultiSpanQA, and exploratory evals.scripts/maintenance/: operational/admin utilities for external services.- scripts/README.md: complete script inventory grouped by workflow.
docs/PIPELINE.md: full ACL silver-data and ACL model training pipeline.docs/GENERIC_EVAL.md: generic multi-domain model and baseline evaluation.
KRLabsOrg/acl-verbatim-modernbert
is a 150M-parameter token classifier that highlights supporting evidence spans in a paper chunk
given a query. On the gold benchmark it scores 0.536 word-level F1 and is the
best committed run in artifacts/eval/gold_extraction/summary.csv.
from transformers import AutoModel
model = AutoModel.from_pretrained("KRLabsOrg/acl-verbatim-modernbert", trust_remote_code=True)
result = model.process(
question="What is ModernBERT?",
context="ModernBERT is a long-context encoder for NLP. It supports 8192 tokens.",
threshold=0.2,
)
for span in result["spans"]:
print(f"[{span['score']:.2f}] {span['text']}")See the
model card for the full .process()
parameter reference (threshold, min-span length, merge-gap, sentence-level scoring) and for
training details.
pip install -e .For the silver-label generation and token-classifier training:
pip install -e ".[training]"For Hugging Face dataset tooling and semantic-highlighting baselines:
pip install -e ".[hf]"For extracting paper metadata:
pip install -e ".[papermeta]"The gold extraction benchmark is kept in the repo as
333_20260206_dense_top5_20260305.json for local
reproducibility. It is the same split as canonical/test in
KRLabsOrg/acl-verbatim-spans.
For instructions on generating more synthetic queries, see the section on
Generating synthetic evaluation data.
The markdown version of all papers is published as
KRLabsOrg/acl-anthology-md.
Materialize it as the local file layout the scripts expect:
python scripts/corpus/export_hf_corpus.py --output-metadata-file paper_data.jsonl --output-md-dir acl_mdThis gives you paper_data.jsonl (JSONL metadata) and acl_md/*.md (markdown fulltext), sufficient
for indexing and all downstream experiments. For instructions on rebuilding this data locally, see
the sections on Extracting paper metadata and
Obtaining and preprocessing PDFs.
Chunk markdown files and build a Milvus vector index. DEVICE can be set to cuda for GPU and
cpu for CPU. Indexing to a file using LocalMilvusStore is also possible, but we recommend
CloudMilvusStore; for a locally hosted Milvus instance, CLOUD_URI can be set to e.g.
http://localhost:19530.
Indexing to CloudMilvusStore:
python scripts/corpus/index_acl.py --input-dir acl_md --metadata-file paper_data.jsonl --collection-name acl --device cuda --use-cloud --cloud-uri CLOUD_URINote: For non-localhost Milvus instances (e.g. Zilliz Cloud), authentication is typically required. Pass
--milvus-token YOUR_TOKENto authenticate.
Indexing to file using LocalMilvusStore:
python scripts/corpus/index_acl.py --input-dir acl_md --index-file acl.db --metadata-file paper_data.jsonl --collection-name acl --device DEVICELoad a local index and try some queries interactively:
python acl_verbatim/eval/test_index.py --index-file acl.db --device DEVICE --collection-name aclUsing a cloud Milvus instance:
python acl_verbatim/eval/test_index.py --collection-name acl --device DEVICE --use-cloud --cloud-uri CLOUD_URI --milvus-token MILVUS_TOKENNote: LLM answer generation requires the following environment variables:
export OPENAI_API_BASE=https://api.openai.com/v1 # or any compatible endpoint export OPENAI_API_KEY=sk-... export OPENAI_MODEL=gpt-4o-miniUse the
-rflag to skip LLM calls and do retrieval only.
Run queries against the index, store results, and print retrieval metrics. SEARCH_TYPE can be
dense, sparse, hybrid, full_text, or auto (default).
CAUTION: without the -r option the script will also run extraction on all retrieved chunks,
which is slow and LLM-costly.
python acl_verbatim/eval/test_index.py --collection-name acl --device cpu --use-cloud --cloud-uri CLOUD_URI --milvus-token $MILVUS_API_KEY -r -k 500 --questions-dir QUERIES_PATH --query-field query --output-file SEARCH_RESULTS_FILE -s SEARCH_TYPE | tee METRICS_FILECompare two sets of search results at the chunk level for the top-k results:
python acl_verbatim/eval/compare_results.py SEARCH_RESULTS_FILE_1 SEARCH_RESULTS_FILE_2 -k TOP_K_TO_COMPAREThe gold benchmark is 20 queries × 5 retrieved chunks: 100 rows total, with 47 relevant rows containing 78 gold spans and 53 irrelevant negative rows. See acl_verbatim/eval/README.md for supported prediction formats. Committed per-row predictions for every reported system live under artifacts/eval/gold_extraction/ and can be rescored at any time.
Uses any OpenAI-compatible endpoint via the standard trio of environment variables:
export OPENAI_API_BASE=https://api.openai.com/v1 # or any compatible endpoint
export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-4o-mini
python acl_verbatim/eval/evaluate_extractor.py --gold-file 333_20260206_dense_top5_20260305.json --output-file <model>.gold_eval.jsonAdd --extraction-prompt-file acl_verbatim/prompts/extraction_paragraph.txt to reproduce the
paragraph-prompt variant used as our silver teacher.
Provence:
python acl_verbatim/eval/run_provence.py --gold-file 333_20260206_dense_top5_20260305.json --output-file provence.jsonlZilliz Semantic Highlight (token-span mode at threshold 0.3, our best configuration):
python acl_verbatim/eval/run_semantic_highlight.py --gold-file 333_20260206_dense_top5_20260305.json --output-file zilliz_spans_03.jsonl --output-mode spans --threshold 0.3See Using the trained model for the programmatic API. To reproduce the paper's headline gold-benchmark numbers end-to-end:
python acl_verbatim/span_training/evaluate_token_cls.py --hf-dataset KRLabsOrg/acl-verbatim-spans --hf-config canonical --gold-split test --model-dir KRLabsOrg/acl-verbatim-modernbert --threshold 0.2 --min-span-chars 10 --merge-gap-chars 20 --output-file acl-verbatim-modernbert.gold_eval.jsonScore a pre-computed predictions file:
python acl_verbatim/eval/evaluate_predictions.py --gold-file 333_20260206_dense_top5_20260305.json --pred-file <system>.jsonl --output-file <system>.gold_eval.jsonPrint a one-row-per-system comparison across any number of runs:
python acl_verbatim/eval/compare_span_runs.py \
--gold-file 333_20260206_dense_top5_20260305.json \
--run modernbert=artifacts/eval/gold_extraction/gte-reranker.thr_0.2_merged.json \
--run glm=artifacts/eval/gold_extraction/glm-5.json \
--run qwen=artifacts/eval/gold_extraction/qwen_paragraph_gold.json \
--run mistral=artifacts/eval/gold_extraction/mistral-small-2603.json \
--run provence=artifacts/eval/gold_extraction/provence_gold.jsonl \
--run zilliz=artifacts/eval/gold_extraction/zilliz_spans_03_gold.jsonlThe published model was fine-tuned from Alibaba-NLP/gte-reranker-modernbert-base on the
pretokenized encoder split of KRLabsOrg/acl-verbatim-spans:
python acl_verbatim/span_training/train_token_cls.py --hf-dataset KRLabsOrg/acl-verbatim-spans --hf-config encoder --train-split train --eval-split validation --model Alibaba-NLP/gte-reranker-modernbert-base --output-dir acl-verbatim-modernbert --batch-size 8 --lr 2e-5 --epochs 5 --label-scheme binaryPublish a fine-tune of your own with the same AutoModel.from_pretrained(..., trust_remote_code=True)
API as our release:
python scripts/publish/push_model.py --model-dir acl-verbatim-modernbert --repo-id <your-namespace>/acl-verbatim-modernbertAn up-to-date clone of
acl-anthology is necessary to obtain paper metadata.
Since get_anthology_metadata.py relies on Python code from acl-anthology that cannot be
installed as part of a package, the path to the repository must be passed via --anthology-path:
python scripts/corpus/get_anthology_metadata.py --anthology-path /path/to/acl-anthology --output-file paper_data.jsonlPDFs of ACL Anthology papers can be downloaded via the acl-anthology repository, which provides detailed instructions.
NOTE: the markdown version of all papers is sufficient to perform most experiments; downloading PDFs is only required if you need to rerun the conversion step. We kindly ask that you observe the ACL Anthology's request not to download large amounts of data unnecessarily. Without their permissive policies this project would not have been possible.
PDFs can be converted to markdown via docling (batch sizes tested on a single A100 GPU):
python scripts/corpus/preprocess_acl.py --input-dir ../acl-anthology/build/anthology-files/pdf --output-dir acl_md --metadata-file paper_data.jsonl --doc-batch-size 512 --page-batch-size 1024Step 1 — Sample random papers:
python acl_verbatim/qa_generation/sample_papers.py --input-file paper_data.jsonl --output-file SAMPLE_PAPERS_FILE --n NO_OF_PAPERS_TO_SAMPLE --seed RANDOM_SEEDStep 2 — Chunk papers and choose one random chunk per paper, classified by question type:
python acl_verbatim/qa_generation/chunk_and_classify.py --input-dir acl_md --output-dir CHUNKS_DIR --papers-file SAMPLE_PAPERS_FILE --n 1Step 3 — Generate questions for these chunks:
python acl_verbatim/qa_generation/gen_qa.py --input-dir CHUNKS_DIR --output-dir QUESTIONS_PATHStep 4 — Generate retrieval queries from questions:
python acl_verbatim/qa_generation/question_to_query.py --input-dir QUESTIONS_PATH --output-dir QUERIES_PATHThe file generated by Step 4 contains both questions and queries as well as chunk contents and can therefore be used directly as input for all downstream evaluation and silver-generation steps.
The end-to-end silver pipeline (sample → retrieve → LLM-annotate → filter → prepare → train → evaluate → publish) is documented separately in docs/PIPELINE.md.
Apache 2.0. See LICENSE.
Please cite our paper ACL-Verbatim: hallucination-free question answering for research
@misc{Recski:2026,
title={ACL-Verbatim: hallucination-free question answering for research},
author={Gábor Recski and Szilveszter Tóth and Nadia Verdha and István Boros and Ádám Kovács},
year={2026},
eprint={2605.21102},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.21102},
}
ACL-Verbatim was built in collaboration by KR Labs and the TU Wien Data Science Research Unit. Work partially supported by the CLEAR project, funded within the Cybersecurity Programme Kybernet-Pass of the Austrian Federal Ministry of Finance.