Question answering over PDFs whose answers live in figures, charts, and tables, where text-only search comes up short. Two retrievers, a per-query router, and an eval behind every change.
▶ Live demo: https://spectrarag-web-997818771923.europe-west1.run.app
A PDF's text layer is only half the document. Ask an ordinary RAG system "in Figure 5, what colour is the line with no intersections?" and it will tell you the answer isn't in the context. It's right: the colour lives in the chart's pixels, not the text. The same blind spot covers plot geometry, screenshots, and image-only diagrams.
SpectraRAG runs a text retriever and a visual retriever over rendered page
images, and a per-query classifier decides which to use. When a question is
visual, the page image is sent to a vision model at answer time. The corpus
here is scientific PDFs and MMLongBench-Doc, but nothing is domain-specific:
point the ingester at any folder of .pdf files.
Page-level retrieval on MMLongBench-Doc
(20 documents, 149 queries, 107 in-corpus). The metric is recall@10 over
retrieved pages, scored paper-aware (a page counts only if it is the gold
paper's), so it is independent of any generator. The router uses an
LLM classifier (gemma3:4b via Ollama) to send figure and table queries to
the visual leg.
| retrieval | recall@10 | figures only |
|---|---|---|
| text-only | 0.55 | 0.51 |
| text + visual router | 0.75 | 0.76 |
| relative lift | +35 % | +48 % |
The exact values (text-only 0.5545, router 0.7461; figure subset 0.5111 →
0.7578) and the per-query records are committed under
data/eval/ as baseline-mmlongbench-text.json and
baseline-mmlongbench-router.json.
The gain is mechanical, not a metric artefact: on every figure query that
improved, the router retrieved a page the text leg never returned, while
text-routed (factual) queries scored identically across both runs.
MMLongBench-Doc answers are ~93 % visual, which rewards routing
aggressively to the visual leg. On a text-heavy corpus the lift is smaller. Full methodology and failure modes are in
docs/results.md.
flowchart LR
PDF[PDFs] -->|Docling ingest| TXTIDX[(Qdrant + BM25<br/>text/figure/table chunks)]
PDF -->|build page index| VISIDX[(Qdrant<br/>ColQwen2<br/>page multi-vectors)]
Q[User query] --> CLF{Classifier}
CLF --> TXT[Text leg<br/>BM25 + BGE-M3 + rerank]
CLF -->|hybrid route| VIS[Visual leg<br/>ColQwen2 MaxSim]
TXTIDX --> TXT
VISIDX --> VIS
TXT --> LLM[Vision LLM]
VIS --> LLM
LLM --> A[Answer + citations]
- Ingest. Each PDF goes through Docling for layout-aware, section-attributed text chunks plus figure and table extraction, with a figure-role classifier separating real figures from page decoration. Text, figure, and table chunks are indexed twice: BGE-M3 dense vectors in Qdrant and a BM25 sparse index in process. Pages are rendered to PNG, and ColQwen2 embeds each page into a multi-vector page index persisted in Qdrant (built offline; ADR 0028).
- Classify. A per-query classifier routes to text-only or text+visual.
The default is an LLM zero-shot classifier (
gemma3:4bover Ollama, no API key); a regex classifier is the fallback. - Retrieve. The text leg (BM25 + BGE-M3 dense + reciprocal-rank fusion + BGE-reranker-v2-m3) always runs. On hybrid routes the visual leg (ColQwen2 late-interaction MaxSim over page images) also runs, and the two fuse at page granularity.
- Generate. A vision-capable model reads the retrieved chunks and their page images and returns an answer with chunk-level citations.
Fastest path: serve the bundled demo corpus self-contained, with no Docker or Ollama (in-process bge-m3 + the committed Qdrant snapshot). The first run downloads the bge-m3 weights.
git clone https://github.com/NorthernLightx/spectrarag
cd spectrarag
uv sync --extra dev
uv run spectrarag serveOpen http://localhost:8000/ and query the 20-paper demo corpus (text
retrieval; set RAG_ENABLE_MULTIMODAL=true to also run the visual router over
the committed page index). For your own PDFs, see
Bring your own PDFs.
For the full local stack (Docker Qdrant + Ollama, plus ingesting your own PDFs):
cp .env.example .env
docker compose up -d qdrant ollama
docker exec rag-ollama ollama pull bge-m3
# fetch the demo corpus (20 arXiv papers from the committed manifest), then ingest
uv run python -m scripts.fetch_papers --manifest data/curated_demo/papers.txt
uv run python -m scripts.bootstrap_corpus --pdf-dir data/papers
uv run uvicorn src.api.main:app --reload --port 8000Then open http://localhost:8000/. It's a single-page app with five tabs:
- Chat re-retrieves on every turn (with a condense step on follow-ups). The panel beside the answer shows the route the server picked, the ranked chunks, and the page images it read.
- Inspection traces one query through routing, retrieval, and rerank.
- Papers and Figures browse the indexed corpus. Figures are bbox-cropped thumbnails with caption search.
- Why multimodal? walks through real MMLongBench questions where text-only retrieval misses the page the answer is on and the router finds it.
Chat and Inspection both carry an Advanced panel to force the route, switch intent vs cascade routing, set top-K, and filter by paper. Chat's routing also has an agentic option (DCI): an LLM agent greps the corpus with terminal-style tools instead of vector search. It's off by default, text-only, and slower, so treat it as a demo of the approach, not the default path.
Generation has two paths. By default the demo answers through a keyless server
route (/demo/chat) backed by a caged,
free-tier-only OpenRouter key, so a visitor with no key still gets an answer.
Paste your own key into the UI to upgrade: the chat call then goes
browser-direct to OpenRouter for stronger models, and the server never sees,
logs, or stores the key. The one exception is the opt-in DCI mode, whose agent
runs server-side: it holds your key in memory for that request only, never
stored or logged. Vision-capable
models (gpt-4o, claude-sonnet-4.x, qwen3-vl) receive the retrieved page
PNGs as image blocks when RAG_PAGES_DIR is set; populate it with
python -m scripts.render_pages --pdf-dir data/papers.
API surface:
/health— component-wiring check (status, version, env,pages_available)/query— retrieval only, no generation/demo/chat— keyless generation through the server's caged free-tier key (ADR 0027); the default answer path when no visitor key is set/answer— full server-side generation with a configured key; returns 503 on the demo, which carries no shared full key
The hosted demo runs a fixed 20-paper corpus baked into the image and has no upload. Locally, point the ingester at any directory:
mkdir mydocs # drop your .pdf files here
uv run python -m scripts.bootstrap_corpus \
--pdf-dir ./mydocs --collection my_corpusSet RAG_CORPUS_COLLECTION=my_corpus in .env, restart uvicorn, and the
corpus is queryable through /query and the UI. The eval harness works
against any collection; write a golden set at data/golden/<name>.yaml.
For a single document, set RAG_ENABLE_UPLOAD=true and use the Papers tab's
Add PDF button (or POST /ingest): the PDF is ingested into the live corpus
and text-retrievable on the next query, no restart. The flag stays off on the
hosted demo; enable it only
on a local or API-key-gated deploy — the route carries no auth or rate limit of
its own.
The visual leg needs a CUDA GPU to build the page index (ColQwen2-v1.0 fits an 8 GB card); serving it then runs on CPU. Build the persisted index and point the app at it:
uv run python -m scripts.build_visual_index --pdf-dir ./mydocs \
--qdrant http://localhost:6333 \
--corpus-collection my_corpus --collection my_corpus_visualRAG_ENABLE_MULTIMODAL=true
RAG_VISUAL_COLLECTION=my_corpus_visual
RAG_PAGES_DIR=data/pages
scripts/eval_run.py replays retrieval (and optionally generation + an LLM
judge) against a golden YAML and writes a run JSON. scripts/check_regression.py
is the gate: it compares a run against a committed baseline and fails on any
metric that drops more than 5 %. MMLongBench scoring is page-level, so a run
JSON is post-processed by scripts/rescore_mmlb_pages.py before it becomes a
baseline.
Every retrieval knob (chunk size, fusion weights, rerank cutoff, router
classifier) is measured in isolation, so a recall change traces to one knob
rather than a framework default. See docs/evals.md for the golden schema and
metric definitions.
Beyond retrieval, the project pins down where end-to-end answer accuracy actually tops out, and part of the apparent ceiling turned out to be the scorer rather than the model.
- It's a RAG ↔ long-context tradeoff. Where a document fits the model's context, feeding the whole document beats a top-5 retrieval cut by ~0.12 (tables +0.18); past context, retrieval is required. We measured both directions and shipped route-by-fit as an opt-in eval policy (ADR 0024). It is deliberately not wired into the corpus-wide demo, which would first have to identify the target document.
- The strict scorer understated accuracy by ~0.11, and we caught it. The standard extract-then-match step marks terse-but-correct answers as "Not answerable" (even GPT-4o does this). A strictness-checked re-grade lifts the oracle read from ~0.45 to ~0.55. The honest ceiling is ~0.55; the published SOTA is ~0.62 (whole document, full 1082-query set).
- Scaling the model doesn't move the reading. A 31B, a 235B, and frontier gemini-2.5-pro read the gold pages within a point of each other; the bottleneck is fine-grained figure and table reading, not model size.
- Changing what the reader sees helps where a bigger model doesn't. The
bottleneck is reading figures and tables, so this lever works on the input rather
than the model: transcribe a page's tables and charts to text offline and feed it
to the reader alongside the page image. On the post-retrieval failure set that
adds about 0.12, but on too few cases to call significant yet. The extractor can
be a local 1.2B model (MinerU2.5) instead of a cloud one: it matches
qwen3-vl-235b on extraction recall, a tie rather than a win. The backend selector
is in place (
RAG_EXTRACTOR_BACKEND, default off); the ingest-time path that would feed it to the reader waits until the result holds up (ADR 0025). - Negatives are measured, not assumed. GraphRAG lost to plain RAG (ADR 0018, 5–1 on global synthesis); agentic query-decomposition did not transfer and hurt retrieval on this corpus (ADR 0019); text rerankers were a wash (ADR 0012); and direct-corpus-interaction (a grep-tool agent) is off the reading bottleneck here, so it ships as an experimental opt-in, not a default (ADR 0026).
Full methodology in docs/results.md. For how SpectraRAG
compares to other document-RAG tools, see
docs/comparison.md.
- The demo corpus is text-heavy. Visual routing is on, but the baked 20-paper arXiv set has few figure or table answers, so the visual lift you see here is small. The +35 % above is the MMLongBench result, not what these papers will show.
- Demo answers are free-tier unless you bring a key. Out of the box the demo generates with free models. Paste an OpenRouter key for stronger ones.
- The LLM judge under-rates pixel answers. When the answer is in the image (e.g. "the line is red") and the judge sees only text, faithfulness is scored low. For generation quality, trust gold-answer match, not the judge.
uv run ruff check . && uv run ruff format --check .
uv run mypy src tests scripts # strict
uv run pytest -v # unit + integrationCI runs the same set on every push and PR. To run it locally before each push
(plus a gitleaks scan), enable the in-tree hook once: git config core.hooksPath .githooks.
Local setup, commit conventions, and the leakage rules are in
CONTRIBUTING.md.
Common setup issues: model 'bge-m3' not found means Ollama hasn't pulled it
(docker exec rag-ollama ollama pull bge-m3); expected 1024, got 768 means
the collection was built with a different embedder (re-ingest with --force);
a ColQwen2 OutOfMemoryError means the GPU is below ~8 GB, so disable the
visual leg with RAG_ENABLE_MULTIMODAL=false.
src/ FastAPI app, retrievers, ingestion, eval, observability
scripts/ CLI entry points (bootstrap, render, eval, regression)
web/ BYOK frontend — React via in-browser Babel, no build step, baked into the image
data/ gitignored except curated_demo/papers.txt, eval baselines,
golden sets, and the committed demo page renders
docs/ ADRs, eval methodology, results
tests/ unit + integration suites, mirrors src/
- Retrieval: Qdrant, BGE-M3, BGE-reranker-v2-m3, rank-bm25
- Visual retrieval: ColQwen2 (vidore)
- Document parsing: Docling (layout, tables, figure classification), PyMuPDF (page rendering)
- Models: OpenRouter for cloud generation (caged demo key and browser-side BYOK), Ollama for local embeddings and the routing classifier
- API: FastAPI, Pydantic v2, uv
- Observability: OpenTelemetry, Sentry, Langfuse
- Deploy: Cloud Run via GitHub Actions with Workload Identity Federation
- Eval benchmark: MMLongBench-Doc
Papers and benchmarks this project builds on or measures against:
- ColPali: Efficient Document Retrieval with Vision Language Models (Faysse et al., arXiv:2407.01449). The late-interaction visual-retrieval architecture; the deployed visual leg runs ColQwen2 from this line.
- BGE M3-Embedding (Chen et al., arXiv:2402.03216). The dense and sparse text embeddings behind the text leg.
- Docling Technical Report (Auer et al., IBM, arXiv:2408.09869). Layout-aware PDF parsing for the structure-attributed chunker.
- MMLongBench-Doc (arXiv:2407.01523). The long-document multimodal benchmark behind the headline retrieval result.
- BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval (Su et al., arXiv:2407.12883). Retrieval that needs reasoning rather than surface similarity; the benchmark the Agentic search experiment is scored on.
- Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction (arXiv:2605.05242). The grep-the-raw-corpus agent behind the experimental Agentic search toggle (ADR 0026).
Why not LlamaIndex or LangChain? Both would have shipped faster. The cost is opacity: every retrieval choice becomes a knob inside someone else's abstraction, and a +2 % recall change is hard to attribute. This repo measures each choice against a committed baseline instead. The retrievers conform to a small protocol if you later want to wrap them in a framework.
Why visual retrieval instead of OCR-ing the figures?
OCR recovers figure-internal text and captions, which PyMuPDF often already
extracts from modern PDFs. It cannot recover what isn't text: chart colours,
geometric layout, screenshot contents, axis positions relative to data. Visual
retrieval over rendered pages keeps all of that. The canonical example is
mmlb_0008 — "what colour is the line with no intersections?", gold answer
red, a fact that exists only in the pixels.
Why MMLongBench-Doc? The in-repo golden set is too easy to separate text from visual retrieval. MMLongBench-Doc is the harder regime: long documents, ~22 % unanswerable queries (useful for the refusal gate), and it isn't saturated (GPT-4o tops out near 45 % F1). Being published, its numbers can be cross-referenced.
MIT. See LICENSE.