IR Challenge Retrieval Benchmark

Benchmarking of retrieval approaches for the scientific citation retrieval challenge.

Given a query paper, retrieve the most relevant documents from a corpus of 20,000 scientific papers based on citation relationships (100 queries, 19 domains, Semantic Scholar).

Hard Domain Filter + 7-Signal Ensemble:

Experimente :

Leaderboard:

Run python scripts/run_benchmark.py to reproduce.

Summary Table

#	Approach	Training NDCG@10
1	BM25 on Title+Abstract	0.4663
2	TF-IDF with bigrams	0.4724
3	MiniLM semantic search	0.5073
4	BM25 on body chunks	0.5197
5	RRF fusion (BM25 + MiniLM)	0.5534
6	Hard domain filter + 5 signals	0.6398
7	Hard domain + 6 signals	0.6572
8	6 signals + CV-tuned weights	0.6937
9	Fine-tuned MiniLM + 7 signals	0.7018
10	DomainFilter+7-signalensemble	0.7495
11	10signals+Optuna	0.7541
12	Granite-r2+1000trials	0.7572
13	ModernBERTcMaxSim+1500trials	0.760

Fill in the table by running the benchmark and pasting your results.

Project Structure

.
├── src/
│   ├── data.py            # Data loading utilities
│   ├── evaluate.py        # Evaluation metrics (NDCG, Recall, MRR, MAP, ...)
│   └── retrievers/
│       ├── base.py        # Abstract BaseRetriever interface
│       ├── tfidf.py       # TF-IDF sparse retrieval
│       ├── bm25.py        # BM25 sparse retrieval
│       ├── dense.py       # Dense retrieval (sentence-transformers)
│       ├── hybrid.py      # Hybrid: Reciprocal Rank Fusion (RRF)
│       └── reranker.py    # Cross-encoder two-stage reranking
├── scripts/
│   ├── run_benchmark.py   # Main benchmark runner + leaderboard printer
│   └── embed.py           # Pre-compute embeddings for any model
├── data/                  # (not committed) corpus, queries, qrels, embeddings
├── results/               # Saved ranked outputs (JSON)
└── notebooks/
    └── challenge.ipynb    # Exploratory analysis + baseline notebooks

Setup

pip install -r requirements.txt

Data files (corpus.parquet, queries.parquet, qrels.json) and pre-computed embeddings are not included in this repo. Place them under data/ following the structure above.

Usage

Run all retrievers

python scripts/run_benchmark.py

Run specific retrievers

python scripts/run_benchmark.py --retrievers tfidf bm25 dense_minilm

Save results + per-domain breakdown

python scripts/run_benchmark.py --save --domains

Available retriever keys

Key	Description
`tfidf`	TF-IDF on title + abstract
`bm25`	BM25 on title + abstract
`dense_minilm`	Dense: `all-MiniLM-L6-v2` (pre-computed)
`dense_bge`	Dense: `BAAI/bge-small-en-v1.5`
`hybrid_bm25`	BM25 + Dense MiniLM via RRF
`hybrid_tfidf`	TF-IDF + Dense MiniLM via RRF
`rerank_bm25`	BM25 → CrossEncoder reranking
`rerank_dense`	Dense MiniLM → CrossEncoder reranking

Pre-compute embeddings for a new model

python scripts/embed.py --model BAAI/bge-small-en-v1.5

Adding a New Retriever

Create src/retrievers/my_retriever.py inheriting BaseRetriever:

from .base import BaseRetriever

class MyRetriever(BaseRetriever):
    name = "My Retriever"

    def fit(self, corpus):
        ...  # index the corpus

    def retrieve(self, queries, top_k=100):
        ...  # return {query_id: [ranked doc_ids]}
        return results

Register it in scripts/run_benchmark.py inside build_registry():

"my_retriever": MyRetriever(),

Run:

python scripts/run_benchmark.py --retrievers my_retriever

Evaluation Metrics

All metrics are computed over 100 public queries with citation-based ground truth.

Metric	Description
NDCG@10 / @100	Normalized Discounted Cumulative Gain
Recall@10 / @100	Fraction of gold docs retrieved in top-k
Precision@10	Precision at rank 10
MRR@10	Mean Reciprocal Rank
MAP	Mean Average Precision

Contributing

Pull requests are welcome! To contribute a new retriever:

Follow the BaseRetriever interface
Add it to the registry in run_benchmark.py
Fill in the leaderboard table with your results

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
imgs		imgs
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
Presentation.pdf		Presentation.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IR Challenge Retrieval Benchmark

Leaderboard:

Summary Table

Project Structure

Setup

Usage

Run all retrievers

Run specific retrievers

Save results + per-domain breakdown

Available retriever keys

Pre-compute embeddings for a new model

Adding a New Retriever

Evaluation Metrics

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IR Challenge Retrieval Benchmark

Leaderboard:

Summary Table

Project Structure

Setup

Usage

Run all retrievers

Run specific retrievers

Save results + per-domain breakdown

Available retriever keys

Pre-compute embeddings for a new model

Adding a New Retriever

Evaluation Metrics

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages