RAG evaluation framework for comparing chunking strategies.
ContextRAG is a reproducible benchmarking harness built to answer a specific question: does routing documents to different chunk sizes based on length improve retrieval quality? It grew out of a 2022 chatbot project and evolved into a focused evaluation tool with statistical comparison infrastructure. The answer, on the benchmarks tested, is no -- adaptive chunking does not outperform uniform chunking.
# Install
git clone https://github.com/seanbrar/ContextRAG.git
cd ContextRAG
uv sync --all-extras
# Run offline demo (no API keys needed)
uv run contextrag demoOutput: runs/demo_eval.json with precision/recall metrics.
ContextRAG loads a dataset, chunks documents using a configurable strategy, embeds them with chromaroute, indexes into ChromaDB, and scores retrieval against ground-truth queries:
- Metrics: precision@k, recall@k, nDCG@k, MRR@k, hit@k
- Statistical comparison: bootstrap confidence intervals, randomization tests, paired TOST equivalence testing, Cohen's d effect sizes
- Experiment matrix: sweep baselines and k values in one command, get per-cell and aggregate reports
| Command | Description |
|---|---|
contextrag eval |
Run a single evaluation (supports YAML configs) |
contextrag demo |
Offline evaluation with local embeddings |
contextrag matrix |
Run baseline-by-k experiment matrix |
contextrag compare |
Compare two runs with per-query deltas |
contextrag validate-dataset |
Validate dataset/query schema |
contextrag doctor |
Check configuration health |
contextrag db index |
Build vector index from documents |
contextrag db query |
Query the vector index |
The adaptive router classifies documents by token count and assigns chunk sizes accordingly:
| Category | Token Range | Chunking |
|---|---|---|
| Short | <=3,500 | None (full document) |
| Medium | 3,500-15,000 | 2,000-token chunks |
| Long | >15,000 | 1,000-token chunks |
Finding: no benefit. Across three datasets and multiple k values, the router never outperforms uniform 1,000-token chunking -- and sometimes underperforms it. Modern embedding models handle chunk-size variation well enough that length-based routing adds complexity without improving retrieval.
To reproduce: make reproduce runs the uniform-vs-router matrix on data/eval-expanded with local embeddings.
See docs/results.md for the full matrix and discussion.
dataset/
├── documents/ # one text file per document
└── queries.jsonl # {"query": "...", "relevant_ids": ["doc1", "doc2"]}
Five datasets are included:
data/demo-- minimal 3-document set for smoke testingdata/eval-mixed-- mixed-domain corpus with varied document lengthsdata/eval-expanded-- larger multi-domain corpus used for the primary comparisondata/eval-external-- external documents not seen during developmentdata/eval-scifact-mini-- subset of the SciFact benchmark for external validation
YAML configs drive reproducible experiments:
uv run contextrag eval --config experiments/eval_expanded_uniform_local.yamlEnvironment variables (or .env):
OPENROUTER_API_KEY=sk-or-... # For hosted embeddings
EMBED_PROVIDER=auto # auto | openrouter | local
LOCAL_EMBEDDINGS_MODEL=sentence-transformers/all-MiniLM-L6-v2flowchart LR
A[Documents] --> B[Index]
B --> C[chromaroute]
C --> D{Provider}
D -->|OpenRouter| E[text-embedding-3]
D -->|Local| F[MiniLM]
E & F --> G[(ChromaDB)]
G --> H[Query]
H --> I[Evaluate]
Built on chromaroute, a provider-agnostic embedding library for ChromaDB. For the project's evolution from a 2022 chatbot to this evaluation framework, see docs/evolution.md.
make install # uv sync --all-extras
make all # lint + typecheck + tests
make test-cov # pytest with coverage (90% gate)- docs/results.md - Evaluation results and discussion
- docs/reproducibility.md - How to reproduce evaluations
- docs/design-decisions.md - Architecture rationale
- chromaroute - Provider-agnostic embeddings for ChromaDB (extracted from this project)
- ChromaDB - Vector database
- OpenRouter - Multi-provider API gateway