Skip to content

πŸš€ Production-ready modular RAG monorepo: Local LLM inference (vLLM) β€’ Hybrid retrieval with Qdrant β€’ Semantic caching β€’ Docling document parsing β€’ Cross-encoder reranking β€’ DeepEval evaluation β€’ Full observability with Langfuse β€’ Open WebUI chat interface β€’ OpenAI-compatible API β€’ Fully Dockerized

License

Notifications You must be signed in to change notification settings

MERakram/Advanced-RAG-monorepo

Repository files navigation

πŸš€ Advanced RAG System

A production-ready Retrieval-Augmented Generation (RAG) platform featuring local LLM inference, hybrid retrieval, multi-agent orchestration, semantic caching, and full observability.

Architecture Python Docker License


✨ Features

  • πŸ€– Local LLM Inference - Run models locally with vLLM (no API costs for development)
  • 🌐 Cloud LLM Fallback - Route complex queries to OpenRouter (Claude, GPT-4, etc.)
  • πŸ” Hybrid Retrieval - Dense + Sparse vector search with Qdrant
  • πŸ“Š Full Observability - Langfuse tracing with session & user tracking
  • πŸ’Ύ Semantic Caching - Instant responses for similar queries
  • πŸ“„ Multi-Format Ingestion - PDF, DOCX, HTML, Markdown (+ OCR for images)
  • 🎯 OpenAI-Compatible API - Drop-in replacement for the OpenAI API

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              Open WebUI                                β”‚
β”‚                           (localhost:3000)                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         RAG Backend (FastAPI)                          β”‚
β”‚                           (localhost:5001)                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  Semantic   β”‚  β”‚   Query     β”‚  β”‚  Re-Ranker  β”‚  β”‚   Model     β”‚    β”‚
β”‚  β”‚   Cache     β”‚  β”‚  Rewriting  β”‚  β”‚  (Cross-Enc)β”‚  β”‚   Router    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                   β”‚                              β”‚
           β–Ό                   β–Ό                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Qdrant      β”‚  β”‚      vLLM       β”‚           β”‚    OpenRouter API   β”‚
β”‚  (Vector DB)    β”‚  β”‚  (Local LLM)    β”‚           β”‚   (Cloud Fallback)  β”‚
β”‚  localhost:6333 β”‚  β”‚  localhost:9999 β”‚           β”‚                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Observability Stack                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  Langfuse   β”‚  β”‚ ClickHouse  β”‚  β”‚    MinIO    β”‚  β”‚    Redis    β”‚    β”‚
β”‚  β”‚  (UI:3001)  β”‚  β”‚   (OLAP)    β”‚  β”‚    (S3)     β”‚  β”‚   (Queue)   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🐳 Docker Services

Container Image Port Purpose
rag-open-webui ghcr.io/open-webui/open-webui 3000 Chat UI (like ChatGPT)
rag-backend Custom (Dockerfile) 5001 FastAPI RAG orchestrator
rag-vllm vllm/vllm-openai 9999 Local LLM inference
rag-qdrant qdrant/qdrant 6333 Vector database
rag-langfuse langfuse/langfuse:3 3001 Observability UI
rag-langfuse-worker langfuse/langfuse-worker:3 3030 Trace processing
rag-clickhouse clickhouse/clickhouse-server 18123 Trace storage (OLAP)
rag-minio minio/minio 9000/9001 S3-compatible blob storage
rag-redis redis:7.2 6379 Queue & cache
rag-langfuse-db postgres:16 - Langfuse metadata DB

πŸš€ Quick Start

Prerequisites

  • Docker Desktop (with GPU support for vLLM)
  • NVIDIA GPU (Recommended, 8GB+ VRAM)
  • Git

1. Clone & Configure

git clone https://github.com/yourusername/Advanced-RAG.git
cd Advanced-RAG

# Copy environment template
cp .env.example .env
# Edit .env with your API keys (OpenRouter, etc.)

2. Start All Services

docker compose up -d

3. Access the UI

Default Langfuse Credentials

  • Email: admin@rag.local
  • Password: ragadmin123

βš™οΈ Environment Variables

Variable Description Default
OPENROUTER_API_KEY API key for cloud LLM fallback Required for cloud models
LOCAL_MODEL_NAME Model to run with vLLM Qwen/Qwen2.5-0.5B-Instruct
ENABLE_OCR Enable OCR for image files (GPU intensive) false
LANGFUSE_DEBUG Enable Langfuse debug logging false
WEBUI_SECRET_KEY Secret for Open WebUI sessions Set in compose

See .env.example for the full list.


πŸ“ Project Structure

Advanced-RAG/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py                 # FastAPI app & endpoints
β”‚   β”œβ”€β”€ config.py               # Model & provider configuration
β”‚   β”œβ”€β”€ ingestion/              # Document processing pipeline
β”‚   β”‚   β”œβ”€β”€ router.py           # Ingestion orchestrator
β”‚   β”‚   β”œβ”€β”€ docling_parser.py   # PDF/DOCX parser
β”‚   β”‚   β”œβ”€β”€ deepseek_ocr.py     # OCR for images (optional)
β”‚   β”‚   β”œβ”€β”€ metadata.py         # LLM-based metadata extraction
β”‚   β”‚   └── chunking.py         # Hierarchical chunking
β”‚   β”œβ”€β”€ retrieval/              # Search & retrieval
β”‚   β”‚   β”œβ”€β”€ engine.py           # Query rewriting, HyDE
β”‚   β”‚   β”œβ”€β”€ qdrant_client.py    # Vector DB operations
β”‚   β”‚   └── reranker.py         # Cross-encoder reranking
β”‚   β”œβ”€β”€ generation/             # Response generation
β”‚   β”‚   β”œβ”€β”€ agents.py           # Multi-agent orchestration
β”‚   β”‚   β”œβ”€β”€ router.py           # Model routing (local/cloud)
β”‚   β”‚   └── semantic_cache.py   # Query caching
β”‚   └── observability/          # Monitoring
β”‚       └── config.py           # Langfuse setup
β”œβ”€β”€ docker-compose.yml          # All services
β”œβ”€β”€ Dockerfile                  # RAG backend image
β”œβ”€β”€ pyproject.toml              # Python dependencies
└── requirements.txt            # Pip dependencies

πŸ”„ How It Works

Ingestion Pipeline (Upload a Document)

  1. File Detection β†’ Route to Docling (PDF/DOCX) or OCR (images)
  2. Text Extraction β†’ Preserve structure (tables, headers)
  3. Metadata Enrichment β†’ LLM extracts department, date, summary
  4. Hierarchical Chunking β†’ Parent (1024 tok) + Child (256 tok) chunks
  5. Vector Upsert β†’ Dense + Sparse embeddings to Qdrant

Query Pipeline (Ask a Question)

  1. Semantic Cache Check β†’ Return cached answer if similarity > 0.95
  2. Query Rewriting β†’ Expand ambiguous queries
  3. Hybrid Search β†’ Dense (semantic) + Sparse (keyword) in Qdrant
  4. Re-ranking β†’ Cross-encoder scores top 50 β†’ keep top 5
  5. Model Routing β†’ Simple β†’ Local vLLM, Complex β†’ OpenRouter
  6. Response Generation β†’ Stream answer with context
  7. Cache Update β†’ Store Q&A for future queries

πŸ“Š Observability (Langfuse)

Access the Langfuse dashboard at http://localhost:3001

Features

  • Traces - Full execution path for each request
  • Sessions - Group traces by conversation (chat thread)
  • Users - Track usage per user
  • Costs - Token usage and cost breakdown
  • Scores - User feedback (thumbs up/down)

Session Tracking

Open WebUI automatically sends session headers when ENABLE_OPENWEBUI_USER_HEADERS=true:

  • X-OpenWebUI-Chat-Id β†’ Groups all messages in a conversation
  • X-OpenWebUI-User-Id β†’ Links traces to users

πŸ› οΈ Development

Running Locally (without Docker)

# Install dependencies
pip install poetry
poetry install

# Start backend
poetry run uvicorn src.main:app --reload --port 8000

Adding New Models

Edit src/config.py to add new models:

ModelConfig(
    id="your-model-id",
    name="Display Name",
    provider=Provider.OPENROUTER,  # or Provider.VLLM
    context_window=8192,
)

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

About

πŸš€ Production-ready modular RAG monorepo: Local LLM inference (vLLM) β€’ Hybrid retrieval with Qdrant β€’ Semantic caching β€’ Docling document parsing β€’ Cross-encoder reranking β€’ DeepEval evaluation β€’ Full observability with Langfuse β€’ Open WebUI chat interface β€’ OpenAI-compatible API β€’ Fully Dockerized

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published