- Two-stage retrieval architecture — bi-encoder (FAISS, ~5–20ms) for candidate retrieval, cross-encoder re-ranking for precision. P@5 improves from ~0.60 to ~0.80 at +500ms cost. Each stage has a measurable cost/benefit tradeoff.
-
- Source diversity enforcement — a custom O(n) post-ranking step guaranteeing ≥1 passage per fetched Wikipedia article in the final top-k. Without it, the cross-encoder fills all slots from the single highest-scoring article, silently dropping context from other sources. Not in any RAG tutorial.
-
- Faithfulness scoring (LLM-as-judge) — after generation, the same LLM is asked: "Given only these passages, is this answer supported?" Detects hallucination before it reaches the user.
-
- SSE token streaming —
/ask/streamendpoint emits status events then word-by-word tokens, identical pattern to production LLM APIs. -
-
Full observability stack — Prometheus
/metricsendpoint + auto-provisioned Grafana dashboard (docker-compose). Tracks latency histograms, request counts, in-flight requests. -
- Multi-LLM, multi-vector-store — Groq/OpenAI/Anthropic/local behind a
LLM_PROVIDERflag; FAISS/Pinecone behindVECTOR_STORE. Both are drop-in swaps, not rewrites.
┌─────────────────────────────┐ SSE stream ┌──────────────────────────────────────────────┐ │ React 18 Frontend │ ──────────────► │ FastAPI Backend │ │ Vite · Tailwind · :3000 │ │ 1. Fetch 1–5 Wikipedia articles │ │ Streaming chat UI │ ◄────────────── │ 2. Sentence-boundary chunking (NLTK) │ │ History · Benchmark · Auth │ token / done │ 3. Bi-encoder → FAISS IndexFlatIP │ └─────────────────────────────┘ │ 4. Cross-encoder re-ranking │ │ 5. Source-diversity enforcement │ │ 6. LLM generation (streaming) │ │ 7. TTL cache · Prometheus metrics │ └──────────────────────────────────────────────┘ Observability: Prometheus scrapes /metrics/prometheus every 15s → Grafana auto-dashboard
Category What's Included RAG Pipeline Bi-encoder → cross-encoder re-ranking → source diversity → LLM generation Multi-LLM Groq (llama-3.3-70b, free) · OpenAI · Anthropic · local (flan-t5, no key) Vector Store FAISS (local, ~5ms) or Pinecone (serverless) — swap with VECTOR_STOREflagToken Streaming SSE endpoint — status events + word-by-word tokens (ChatGPT pattern) Faithfulness LLM-as-judge hallucination detection after every generation Observability Prometheus metrics + Grafana dashboard (auto-provisioned via docker-compose) Evaluation In-app Precision@k, MRR, faithfulness, latency across benchmark queries Auth JWT register/login, per-user query history in SQLite Deployment Docker Compose, GitHub Actions CI/CD, EC2 free-tier optimised
Python 3.11FastAPIFAISSsentence-transformersReact 18ViteTailwind CSSPrometheusGrafanaDockerGroqOpenAIAnthropicPineconeSQLiteGitHub Actions
Why two retrieval stages? The bi-encoder is fast (sub-20ms) but scores query and passage independently. The cross-encoder scores them jointly, catching semantic matches the bi-encoder misses. The +500ms cost is worth it for accuracy-critical queries; for latency-sensitive workloads,
ENABLE_RERANKER=falsedrops it instantly.Source diversity is non-obvious. When you fetch 3 Wikipedia articles, the cross-encoder can legitimately rank all 5 top-k slots from one article. That maximises per-slot relevance but destroys breadth. The
enforce_source_diversity()post-processing step reserves one slot per unique source before filling remaining slots by score — O(n), zero latency impact.Faithfulness != retrieval quality. Precision@k measures whether the right passages were retrieved. It says nothing about whether the LLM stayed faithful to them. The LLM-as-judge step closes this gap, using the same model to verify its own output against the retrieved passages.
Content-hash FAISS persistence — the FAISS index is keyed by MD5 of passage content. Re-running the same query skips re-encoding (saves ~300ms) and re-fetching (saves Wikipedia API roundtrip).
Metric Value FAISS retrieval (top-20) ~5–20ms Cross-encoder re-ranking +~500ms Full pipeline (Groq) ~800ms–1.2s Precision@5 (with reranker) ~0.80 Precision@5 (without reranker) ~0.60 MRR (with reranker) ~0.85
# Clone and configure git clone https://github.com/UTKARSH698/wikiqa-rag-system cd wikiqa-rag-system cp .env.example .env # Set GROQ_API_KEY=gsk_... (free at console.groq.com) # Docker (recommended) — starts backend + React + Prometheus + Grafana docker compose up --build # Backend → http://localhost:8000 # React UI → http://localhost:3000 # Grafana → http://localhost:3001 (admin/admin) # Or run manually pip install -r requirements.txt uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000 cd frontend_react && npm install && npm run dev # Run tests pytest tests/ -v
- Multi-hop questions require chained retrieval (not implemented — would need query decomposition before the first FAISS call)
-
- Post-2024 knowledge gaps: Wikipedia coverage is finite; recent events return stale or empty results
-
-
NLTK sentence chunking breaks on code snippets, tables, and mathematical notation
-
- No per-user retrieval personalisation — all users get identical results for identical queries (shared cache)
→ TECHNICAL.md — retrieval system design decisions, reranker tradeoffs, faithfulness evaluation methodology, vector store comparison, scaling architecture (Pinecone/GPU/ECS), and known failure modes.
MIT License · Utkarsh Batham
-
- Multi-LLM, multi-vector-store — Groq/OpenAI/Anthropic/local behind a
-
- SSE token streaming —
UTKARSH698/wikiqa-rag-system
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|