A production-ready Retrieval-Augmented Generation (RAG) platform featuring local LLM inference, hybrid retrieval, multi-agent orchestration, semantic caching, and full observability.
- π€ Local LLM Inference - Run models locally with vLLM (no API costs for development)
- π Cloud LLM Fallback - Route complex queries to OpenRouter (Claude, GPT-4, etc.)
- π Hybrid Retrieval - Dense + Sparse vector search with Qdrant
- π Full Observability - Langfuse tracing with session & user tracking
- πΎ Semantic Caching - Instant responses for similar queries
- π Multi-Format Ingestion - PDF, DOCX, HTML, Markdown (+ OCR for images)
- π― OpenAI-Compatible API - Drop-in replacement for the OpenAI API
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Open WebUI β
β (localhost:3000) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG Backend (FastAPI) β
β (localhost:5001) β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Semantic β β Query β β Re-Ranker β β Model β β
β β Cache β β Rewriting β β (Cross-Enc)β β Router β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ
β Qdrant β β vLLM β β OpenRouter API β
β (Vector DB) β β (Local LLM) β β (Cloud Fallback) β
β localhost:6333 β β localhost:9999 β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Observability Stack β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Langfuse β β ClickHouse β β MinIO β β Redis β β
β β (UI:3001) β β (OLAP) β β (S3) β β (Queue) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Container | Image | Port | Purpose |
|---|---|---|---|
rag-open-webui |
ghcr.io/open-webui/open-webui |
3000 | Chat UI (like ChatGPT) |
rag-backend |
Custom (Dockerfile) | 5001 | FastAPI RAG orchestrator |
rag-vllm |
vllm/vllm-openai |
9999 | Local LLM inference |
rag-qdrant |
qdrant/qdrant |
6333 | Vector database |
rag-langfuse |
langfuse/langfuse:3 |
3001 | Observability UI |
rag-langfuse-worker |
langfuse/langfuse-worker:3 |
3030 | Trace processing |
rag-clickhouse |
clickhouse/clickhouse-server |
18123 | Trace storage (OLAP) |
rag-minio |
minio/minio |
9000/9001 | S3-compatible blob storage |
rag-redis |
redis:7.2 |
6379 | Queue & cache |
rag-langfuse-db |
postgres:16 |
- | Langfuse metadata DB |
- Docker Desktop (with GPU support for vLLM)
- NVIDIA GPU (Recommended, 8GB+ VRAM)
- Git
git clone https://github.com/yourusername/Advanced-RAG.git
cd Advanced-RAG
# Copy environment template
cp .env.example .env
# Edit .env with your API keys (OpenRouter, etc.)docker compose up -d- Chat UI: http://localhost:3000 (Open WebUI)
- Langfuse Dashboard: http://localhost:3001
- API Docs: http://localhost:5001/docs
- Email:
admin@rag.local - Password:
ragadmin123
| Variable | Description | Default |
|---|---|---|
OPENROUTER_API_KEY |
API key for cloud LLM fallback | Required for cloud models |
LOCAL_MODEL_NAME |
Model to run with vLLM | Qwen/Qwen2.5-0.5B-Instruct |
ENABLE_OCR |
Enable OCR for image files (GPU intensive) | false |
LANGFUSE_DEBUG |
Enable Langfuse debug logging | false |
WEBUI_SECRET_KEY |
Secret for Open WebUI sessions | Set in compose |
See .env.example for the full list.
Advanced-RAG/
βββ src/
β βββ main.py # FastAPI app & endpoints
β βββ config.py # Model & provider configuration
β βββ ingestion/ # Document processing pipeline
β β βββ router.py # Ingestion orchestrator
β β βββ docling_parser.py # PDF/DOCX parser
β β βββ deepseek_ocr.py # OCR for images (optional)
β β βββ metadata.py # LLM-based metadata extraction
β β βββ chunking.py # Hierarchical chunking
β βββ retrieval/ # Search & retrieval
β β βββ engine.py # Query rewriting, HyDE
β β βββ qdrant_client.py # Vector DB operations
β β βββ reranker.py # Cross-encoder reranking
β βββ generation/ # Response generation
β β βββ agents.py # Multi-agent orchestration
β β βββ router.py # Model routing (local/cloud)
β β βββ semantic_cache.py # Query caching
β βββ observability/ # Monitoring
β βββ config.py # Langfuse setup
βββ docker-compose.yml # All services
βββ Dockerfile # RAG backend image
βββ pyproject.toml # Python dependencies
βββ requirements.txt # Pip dependencies
- File Detection β Route to Docling (PDF/DOCX) or OCR (images)
- Text Extraction β Preserve structure (tables, headers)
- Metadata Enrichment β LLM extracts department, date, summary
- Hierarchical Chunking β Parent (1024 tok) + Child (256 tok) chunks
- Vector Upsert β Dense + Sparse embeddings to Qdrant
- Semantic Cache Check β Return cached answer if similarity > 0.95
- Query Rewriting β Expand ambiguous queries
- Hybrid Search β Dense (semantic) + Sparse (keyword) in Qdrant
- Re-ranking β Cross-encoder scores top 50 β keep top 5
- Model Routing β Simple β Local vLLM, Complex β OpenRouter
- Response Generation β Stream answer with context
- Cache Update β Store Q&A for future queries
Access the Langfuse dashboard at http://localhost:3001
- Traces - Full execution path for each request
- Sessions - Group traces by conversation (chat thread)
- Users - Track usage per user
- Costs - Token usage and cost breakdown
- Scores - User feedback (thumbs up/down)
Open WebUI automatically sends session headers when ENABLE_OPENWEBUI_USER_HEADERS=true:
X-OpenWebUI-Chat-Idβ Groups all messages in a conversationX-OpenWebUI-User-Idβ Links traces to users
# Install dependencies
pip install poetry
poetry install
# Start backend
poetry run uvicorn src.main:app --reload --port 8000Edit src/config.py to add new models:
ModelConfig(
id="your-model-id",
name="Display Name",
provider=Provider.OPENROUTER, # or Provider.VLLM
context_window=8192,
)- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- LlamaIndex - RAG framework
- vLLM - Fast LLM inference
- Qdrant - Vector database
- Langfuse - LLM observability
- Open WebUI - Chat interface
- Docling - Document parsing