A local-first visual document analysis workbench. Upload PDFs and images, ask questions, and get answers grounded in your documents — all running on your own hardware.
RAG Lab uses ColPali visual embeddings to understand document layout and content visually, paired with any Ollama-compatible LLM for inference. Run models locally or connect to Ollama Cloud for larger models.
- Visual RAG — ColPali embeddings capture page layout, tables, and figures natively without text extraction. Optional OCR mode extracts text from retrieved pages for LLMs that work better with text input.
- Hybrid retrieval — Combines ColPali visual search with BM25 keyword search for the best of both worlds. Visual embeddings find charts and layouts; keyword matching anchors on specific terms. Matching text snippets appear as expandable citations below each retrieved page.
- LLM reranking (opt-in) — Second-stage relevance scoring of the top-K retrieved pages by a small LLM. Improves precision for ambiguous queries. Toggle in Settings → Retrieval.
- HyDE query expansion (opt-in) — Generates a hypothetical answer passage and retrieves against it, bridging the lexical gap between questions and documents. Toggle in Settings → Retrieval.
- Adaptive retrieval — Score-slope analysis dynamically adjusts how many pages are retrieved per query.
- Batch processing — Process multiple documents with per-document streaming responses.
- Full-document summarization — Generic queries ("summarize") automatically process all pages in sequential chunks.
- Any Ollama model — Works with any local model, plus Ollama Cloud models via API key (Settings → Advanced). Vision, text, and reasoning models supported.
- Prompt templates — Create custom extraction templates (e.g., K-1 line item extractor) for structured data output.
- Conversation memory — Mem0 automatically extracts and recalls context across chat sessions (disabled during RAG to keep answers document-grounded).
- Multi-session — Create, switch, and manage independent analysis sessions. Chat history is paginated (50 messages at a time) with a Load earlier pill for long conversations.
- Stream retry — If an LLM stream is interrupted, the partial response stays visible and a one-click Retry button reruns the same request.
- Dark mode — Theme toggle persists via localStorage; respects
prefers-color-schemeon first load. - Keyboard shortcuts —
?opens the overlay.Esccloses modals.Ctrl+Nnew session,Ctrl+,settings,Ctrl+Bsidebar.
Defaults are safe; most integrations are opt-in so a local install ships nothing externally.
- Auth — FastAPI-Users with username + argon2, JWT cookie (
SameSite=lax,HttpOnly). Password policy: 8–128 chars, one letter + one digit. Login is constant-time to defeat account enumeration. - Rate limiting — Per-IP limits on
/auth/login,/auth/register, and/documents/upload(env-overridable). - CSRF protection — Origin-header validation on cookie-authenticated state-changing requests.
- Security headers —
X-Frame-Options: DENY,X-Content-Type-Options: nosniff,Referrer-Policy,Permissions-Policyalways on. Content-Security-Policy opt-in viaCSP_ENABLE=true. - Path traversal + upload guards — Filenames validated against an extension allowlist, resolved paths must stay within the session's upload dir, size-capped at
MAX_UPLOAD_SIZE. - No pickle — Caches use numpy/JSON; inter-process IPC uses
torch.save(weights_only=True)+ base64-in-JSON. Pickle deserialization is an RCE vector and this project refuses to rely on it. - Log redaction —
JWT_SECRET,OLLAMA_API_KEY,Authorization: Bearer …,hf_…,sk-…are scrubbed from every log line before it reaches a handler. - Structured logs (opt-in) —
LOG_FORMAT=jsonflips console + file output to one JSON record per line for Loki/Elastic/Datadog. - Error tracking (opt-in) —
SENTRY_DSNactivates Sentry on both backend (sentry-sdk[fastapi]) and frontend (@sentry/sveltekit). - Metrics — Prometheus scrape at
/metrics(admin-only) with HTTP volume + latency, cache hit rates, LLM inference + retrieval durations.
- Python 3.10+ with CUDA-capable GPU
- Ollama installed and running
- Node.js 18+ for the frontend
git clone https://github.com/inkind79/rag-lab.git
cd rag-lab
# Python environment
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Frontend
cd frontend
npm install
cd ..# Any Ollama model for chat — vision models recommended for document analysis
ollama pull <your-preferred-model> # e.g., gemma4, qwen3-vl, llama3.2-vision, phi4
# Required by Mem0 for conversation memory
ollama pull nomic-embed-text # embeddings
ollama pull gemma3:4b # memory extraction (small text model)RAG Lab auto-detects your installed Ollama models on startup — no configuration needed.
cp .env.example .env
# Optionally edit .env to set a persistent JWT_SECRET (a random one is generated if not set)# Terminal 1: Backend
uvicorn fastapi_app:app --host 127.0.0.1 --port 8000
# Terminal 2: Frontend
cd frontend && npm run devOpen http://localhost:5173 — follow the "Create one" link on the login page to register, then sign in.
rag-lab/
├── fastapi_app.py # Application entry point
├── src/
│ ├── api/ # FastAPI routes (auth, chat, sessions, documents)
│ ├── models/ # ColPali adapter, Ollama/HF handlers, LanceDB, RAG retriever
│ ├── services/ # Document processor, response generator, batch processor
│ └── utils/ # Memory management, model configs, logging
├── frontend/
│ └── src/
│ ├── routes/ # SvelteKit pages (+page.svelte, +layout.svelte)
│ └── lib/
│ ├── components/ # Markdown, DocumentPanel, TemplatePanel, Settings
│ ├── stores/ # Svelte stores (chat, session, toast)
│ └── api/ # API client (streamChat, sessions, documents)
└── config/ # Model configs, global settings
| Layer | Technology |
|---|---|
| Frontend | SvelteKit 2, Svelte 5, TypeScript |
| Backend | FastAPI, Python 3.10+ |
| Embeddings | ColQwen3.5 (ColPali visual embeddings) |
| Vector store | LanceDB (multi-vector, local) |
| LLM inference | Ollama (any local model — vision, text, or reasoning) |
| Memory | Mem0 (automatic, local Ollama + ChromaDB) |
| Auth | FastAPI-Users (SQLite, registration, argon2 hashing, JWT cookies) |
- GPU: NVIDIA with 8GB+ VRAM (16GB+ recommended for larger models)
- RAM: 16GB+
- Storage: 20GB+ (models + dependencies)
- OS: Linux / WSL2 (requires NVIDIA CUDA support)
# Run tests (CI-minimal install — no torch/colpali/lancedb needed)
pip install -r requirements-test.txt
pytest # ~5s full suite
# Lint
ruff check .
cd frontend && npm run check # svelte-check
# Regenerate TS types after changing a Pydantic request/response model
python scripts/gen_types.py
# Benchmark retrieval against a golden set
python -m src.eval.cli --golden tests/fixtures/eval/sample_golden_set.json
# Browser E2E (requires backend + Vite running)
cd frontend && npm run test:e2e # Playwright smoke suiteContinuous integration runs test / lint / security (gitleaks + bandit) on every PR — see .github/workflows/.
- ARCHITECTURE.md — subsystem-by-subsystem tour (auth, sessions, retrieval pipeline, observability) with file pointers.
- CONTRIBUTING.md — dev setup, security expectations for contributors.
- .env.example — every env var grouped by concern.
Contributions are welcome. Please open an issue first to discuss what you'd like to change.
- Fork the repo
- Create a feature branch (
git checkout -b feature/my-feature) - Commit your changes
- Push to the branch and open a Pull Request
MIT — see LICENSE for details.

