Benchmarking of retrieval approaches for the scientific citation retrieval challenge.
Given a query paper, retrieve the most relevant documents from a corpus of 20,000 scientific papers based on citation relationships (100 queries, 19 domains, Semantic Scholar).
Hard Domain Filter + 7-Signal Ensemble:
Run
python scripts/run_benchmark.pyto reproduce.
| # | Approach | Training NDCG@10 |
|---|---|---|
| 1 | BM25 on Title+Abstract | 0.4663 |
| 2 | TF-IDF with bigrams | 0.4724 |
| 3 | MiniLM semantic search | 0.5073 |
| 4 | BM25 on body chunks | 0.5197 |
| 5 | RRF fusion (BM25 + MiniLM) | 0.5534 |
| 6 | Hard domain filter + 5 signals | 0.6398 |
| 7 | Hard domain + 6 signals | 0.6572 |
| 8 | 6 signals + CV-tuned weights | 0.6937 |
| 9 | Fine-tuned MiniLM + 7 signals | 0.7018 |
| 10 | DomainFilter+7-signalensemble | 0.7495 |
| 11 | 10signals+Optuna | 0.7541 |
| 12 | Granite-r2+1000trials | 0.7572 |
| 13 | ModernBERTcMaxSim+1500trials | 0.760 |
Fill in the table by running the benchmark and pasting your results.
.
├── src/
│ ├── data.py # Data loading utilities
│ ├── evaluate.py # Evaluation metrics (NDCG, Recall, MRR, MAP, ...)
│ └── retrievers/
│ ├── base.py # Abstract BaseRetriever interface
│ ├── tfidf.py # TF-IDF sparse retrieval
│ ├── bm25.py # BM25 sparse retrieval
│ ├── dense.py # Dense retrieval (sentence-transformers)
│ ├── hybrid.py # Hybrid: Reciprocal Rank Fusion (RRF)
│ └── reranker.py # Cross-encoder two-stage reranking
├── scripts/
│ ├── run_benchmark.py # Main benchmark runner + leaderboard printer
│ └── embed.py # Pre-compute embeddings for any model
├── data/ # (not committed) corpus, queries, qrels, embeddings
├── results/ # Saved ranked outputs (JSON)
└── notebooks/
└── challenge.ipynb # Exploratory analysis + baseline notebooks
pip install -r requirements.txtData files (corpus.parquet, queries.parquet, qrels.json) and pre-computed embeddings are not included in this repo. Place them under data/ following the structure above.
python scripts/run_benchmark.pypython scripts/run_benchmark.py --retrievers tfidf bm25 dense_minilmpython scripts/run_benchmark.py --save --domains| Key | Description |
|---|---|
tfidf |
TF-IDF on title + abstract |
bm25 |
BM25 on title + abstract |
dense_minilm |
Dense: all-MiniLM-L6-v2 (pre-computed) |
dense_bge |
Dense: BAAI/bge-small-en-v1.5 |
hybrid_bm25 |
BM25 + Dense MiniLM via RRF |
hybrid_tfidf |
TF-IDF + Dense MiniLM via RRF |
rerank_bm25 |
BM25 → CrossEncoder reranking |
rerank_dense |
Dense MiniLM → CrossEncoder reranking |
python scripts/embed.py --model BAAI/bge-small-en-v1.5- Create
src/retrievers/my_retriever.pyinheritingBaseRetriever:
from .base import BaseRetriever
class MyRetriever(BaseRetriever):
name = "My Retriever"
def fit(self, corpus):
... # index the corpus
def retrieve(self, queries, top_k=100):
... # return {query_id: [ranked doc_ids]}
return results- Register it in
scripts/run_benchmark.pyinsidebuild_registry():
"my_retriever": MyRetriever(),- Run:
python scripts/run_benchmark.py --retrievers my_retrieverAll metrics are computed over 100 public queries with citation-based ground truth.
| Metric | Description |
|---|---|
| NDCG@10 / @100 | Normalized Discounted Cumulative Gain |
| Recall@10 / @100 | Fraction of gold docs retrieved in top-k |
| Precision@10 | Precision at rank 10 |
| MRR@10 | Mean Reciprocal Rank |
| MAP | Mean Average Precision |
Pull requests are welcome! To contribute a new retriever:
- Follow the
BaseRetrieverinterface - Add it to the registry in
run_benchmark.py - Fill in the leaderboard table with your results
