A full-featured Retrieval-Augmented Generation system built with FastAPI, Milvus, and LM Studio. Implements advanced RAG techniques including Hybrid Search, HyDE, Cross-Encoder Reranking, Sub-Query Decomposition, Prompt Optimization, and Self-RAG.
π Learning Project β This project is built purely for learning purposes. The goal is to mimic production-level architecture and components to understand how RAG systems are implemented behind the scenes in real-world applications β from the layered backend architecture (API β Controller β Services) to the vector search pipeline, cross-encoder reranking, and self-reflection loops.
π₯οΈ Local LLM β The LLM runs entirely locally via LM Studio on a machine with 128GB RAM, so there are no external API calls or cloud dependencies for inference.
Streaming chat with RAG responses and source attribution
demo.mp4
- Demo
- Architecture Overview
- RAG Pipeline β Step by Step
- Folder Structure
- Tech Stack
- Getting Started
- API Reference
- Configuration
- Understanding the Code β Jupyter Notebook
βββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββ ββββββββββββ
β Client ββββββΆβ FastAPI Application ββββββΆβ Milvus β
β (Swagger UI)β β β β Vector DBβ
βββββββββββββββ β βββββββββββββββββββββββββββββββββββββ β ββββββββββββ
β β RAG Controller β β
β β β β ββββββββββββ
β β Query βββΆ HyDE βββΆ Hybrid Search β ββββββΆβ Redis β
β β β β β β β (Celery) β
β β βΌ βΌ β β ββββββββββββ
β β Rerank βββΆ Optimize βββΆ LLM Gen β β
β β β β β ββββββββββββ
β β Self-RAG Loop β ββββββΆβ LM Studioβ
β βββββββββββββββββββββββββββββββββββββ β β (LLM) β
ββββββββββββββββββββββββββββββββββββββββββ ββββββββββββ
| Layer | Directory | Responsibility |
|---|---|---|
| API | src/api/ |
HTTP endpoints, request/response handling, JWT auth |
| Controller | src/controllers/ |
Business logic orchestration, RAG pipeline |
| Services | src/services/ |
Domain logic β ingestion, retrieval, generation |
| Infrastructure | src/vector_db/ |
Database clients, schema management |
| Core | src/core/ |
Config, security, shared utilities |
| Models | src/models/ |
Pydantic schemas, data models |
When a user sends a query, the following happens inside the RAGController:
User Query
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 1: QUERY ENHANCEMENT (HyDE) β
β β
β β’ LLM generates a "hypothetical answer" β
β β’ That answer is embedded into a vector β
β β’ This vector is used for retrieval β
β β’ Why? The hypothetical doc is closer in β
β embedding space to real answers than the β
β original question β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 2: HYBRID SEARCH β
β β
β Two parallel search strategies: β
β β
β Dense Search (Milvus): β
β β’ COSINE similarity on HNSW index β
β β’ Captures semantic meaning β
β β
β Sparse Search (BM25): β
β β’ Keyword-based term matching β
β β’ Captures exact keyword relevance β
β β
β Fusion: β
β β’ Reciprocal Rank Fusion (RRF) β
β β’ Merges both ranked lists β
β β’ RRF(d) = Ξ£ 1/(k + rank(d)) β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 3: RERANKING β
β β
β β’ Cross-Encoder: ms-marco-MiniLM-L-6-v2 β
β β’ Scores each (query, passage) pair β
β β’ Much more accurate than bi-encoder β
β β’ Re-orders by true relevance score β
β β’ Selects top-K most relevant chunks β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 4: PROMPT OPTIMIZATION β
β β
β Lost-in-the-Middle Reordering: β
β β’ LLMs pay more attention to start/end β
β β’ Best chunks β positions 1 and N β
β β’ Weaker chunks β middle positions β
β β
β Prompt Compression: β
β β’ Removes filler phrases β
β β’ Normalizes whitespace β
β β’ Trims to token budget β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 5: LLM GENERATION β
β β
β β’ Context + query β LLM (LM Studio) β
β β’ System prompt enforces grounding β
β β’ "Only answer from the provided context" β
β β’ Supports streaming (SSE) β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step 6: SELF-RAG (Reflection) β
β β
β β’ LLM evaluates its own answer β
β β’ Checks: Is it grounded? Complete? β
β β’ If confidence < threshold: β
β β Generates a refined query β
β β Re-retrieves with new query β
β β Merges new context + regenerates β
β β’ Maximum 1 retry to avoid loops β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
Final Answer + Sources + Metadata
When a document is uploaded:
Upload File (PDF/TXT/MD)
β
βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββ
β Load File βββββΆβ Chunking βββββΆβ Embedding βββββΆβ Milvus β
β (PyPDF2) β β (512 tok, β β (MiniLM-L6 β β Insert β
β β β 128 overlap)β β 384-dim) β β β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββ
β β
Text cleaning L2-normalized
(whitespace, dense vectors
special chars) (cosine ready)
RAG-LLM/
β
βββ src/ # Main application source
β βββ main.py # FastAPI app factory, CORS, rate limiting, lifespan
β βββ rag_utils.py # Legacy RAG utilities (preserved for reference)
β β
β βββ core/ # Cross-cutting concerns
β β βββ config.py # Pydantic Settings β all env vars centralized
β β βββ security.py # JWT auth, password hashing, RBAC, input sanitization
β β
β βββ models/ # Data models
β β βββ schemas.py # Pydantic request/response schemas for all endpoints
β β βββ database.py # In-memory user store (replace with DB in production)
β β
β βββ api/ # API layer
β β βββ dependencies.py # FastAPI dependency injection (user store, vector client)
β β βββ v1/ # Versioned API routes
β β βββ auth.py # POST /register, POST /login, GET /me, GET /users
β β βββ ingest.py # POST /ingest, GET /ingest, DELETE /ingest/{doc_id}
β β βββ query.py # POST /query (JSON + SSE streaming)
β β
β βββ controllers/ # Business logic orchestration
β β βββ rag_controller.py # Full RAG pipeline: HyDE β Search β Rerank β LLM β Self-RAG
β β
β βββ services/ # Domain services (core RAG logic lives here)
β β βββ ingestion/ # Document processing pipeline
β β β βββ chunker.py # Text loading (PDF/TXT) + RecursiveCharacterTextSplitter
β β β βββ embedder.py # Sentence-transformer embeddings (all-MiniLM-L6-v2)
β β β βββ indexer.py # Orchestrates: load β chunk β embed β store in Milvus
β β β
β β βββ retrieval/ # Search & retrieval strategies
β β β βββ hybrid_search.py # Dense (Milvus) + Sparse (BM25) with RRF fusion
β β β βββ reranker.py # Cross-encoder reranking (ms-marco-MiniLM-L-6-v2)
β β β βββ hyde.py # HyDE: hypothetical document embedding for better recall
β β β βββ sub_query.py # Decomposes complex queries into 2-4 sub-queries
β β β
β β βββ generator/ # LLM interaction & response generation
β β βββ llm_client.py # Multi-provider LLM client (LM Studio/OpenAI/Ollama)
β β βββ prompt_optimizer.py # Lost-in-the-middle reordering + compression
β β βββ self_rag.py # Self-reflection loop with confidence evaluation
β β
β βββ vector_db/ # Vector database layer
β β βββ client.py # Milvus client β CRUD, hybrid search, schema migration
β β βββ schema.py # Collection schema constants and index configs
β β
β βββ utils/ # Shared utilities
β βββ logger/logger.py # Centralized logging config
β βββ embedding/generic.py # Advanced TextTransformer (multiple chunking strategies)
β
βββ workers/ # Async task processing
β βββ celery_app.py # Celery configuration (Redis broker)
β βββ ingestion_worker.py # Background document ingestion task
β
βββ tests/ # Test suite
β βββ test_api.py # Integration tests (12 tests covering all endpoints)
β βββ sample.txt # Sample document for testing
β
βββ notebooks/ # Jupyter notebooks
β βββ rag_explained.ipynb # β¬
οΈ Interactive walkthrough of the RAG pipeline
β βββ embedding.ipynb # Embedding experiments
β βββ vectordb.ipynb # Vector database experiments
β
βββ frontend/ # Frontend app (React + Vite + shadcn/ui)
βββ data/ # Data directories
βββ docker-compose.yml # Full stack: Milvus + Redis + API + Worker
βββ Dockerfile # Multi-stage Docker build
βββ requirements.txt # Python dependencies
βββ .env # Environment variables
| Component | Technology | Purpose |
|---|---|---|
| API Framework | FastAPI | Async HTTP server with auto-generated OpenAPI docs |
| Vector Database | Milvus | Dense vector storage with HNSW/COSINE indexing |
| Embedding Model | all-MiniLM-L6-v2 | 384-dim sentence embeddings |
| Reranker Model | ms-marco-MiniLM-L-6-v2 | Cross-encoder for passage reranking |
| LLM Provider | LM Studio / OpenAI / Ollama | Text generation via OpenAI-compatible API |
| Task Queue | Celery + Redis | Async background document processing |
| Auth | JWT (python-jose) | Token-based authentication with RBAC |
| Rate Limiting | SlowAPI | Per-endpoint rate limits |
| Sparse Search | rank_bm25 | BM25 keyword matching for hybrid search |
- Python 3.10+
- Docker & Docker Compose (for Milvus & Redis)
- LM Studio running locally (or any OpenAI-compatible LLM server)
# Start Milvus (vector DB) and Redis (task queue)
docker-compose up -d standalone redispip install -r requirements.txtEdit .env with your settings:
# LLM (point to your LM Studio or OpenAI endpoint)
LLM_BASE_URL=http://127.0.0.1:1234/v1
LLM_API_KEY=lm-studio
LLM_MODEL=local-model
# Milvus
MILVUS_HOST=localhost
MILVUS_PORT=19530
# JWT Secret (change in production!)
JWT_SECRET_KEY=super-secret-change-me-in-produvicorn src.main:app --host 0.0.0.0 --port 8081Visit http://localhost:8081/docs for interactive API documentation.
Quick workflow:
POST /api/v1/auth/registerβ create a userPOST /api/v1/auth/loginβ get a JWT token- Click Authorize π β enter
Bearer <your_token> POST /api/v1/ingestβ upload a documentGET /api/v1/ingestβ see your ingested documentsPOST /api/v1/queryβ ask questions about your documents
| Method | Endpoint | Auth | Description |
|---|---|---|---|
POST |
/api/v1/auth/register |
None | Create a new user account |
POST |
/api/v1/auth/login |
None | Login β returns JWT token |
GET |
/api/v1/auth/me |
Bearer | Current user info |
GET |
/api/v1/auth/users |
Admin | List all users |
| Method | Endpoint | Auth | Description |
|---|---|---|---|
POST |
/api/v1/ingest |
Bearer | Upload document (PDF/TXT/MD) |
GET |
/api/v1/ingest |
Bearer | List your ingested documents |
DELETE |
/api/v1/ingest/{doc_id} |
Bearer | Delete a specific document |
DELETE |
/api/v1/ingest |
Bearer | Delete all your documents |
| Method | Endpoint | Auth | Description |
|---|---|---|---|
POST |
/api/v1/query |
Bearer | Ask a question (supports SSE streaming) |
Query Request Body:
{
"query": "What is retrieval-augmented generation?",
"stream": false,
"enable_hyde": true,
"enable_reranking": true,
"enable_self_rag": true,
"top_k": 5,
"filters": {}
}| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
App info |
GET |
/health |
Health check + Milvus status |
All settings are in .env and loaded via src/core/config.py:
| Variable | Default | Description |
|---|---|---|
LLM_BASE_URL |
http://127.0.0.1:1234/v1 |
LLM endpoint |
LLM_MODEL |
local-model |
Model name |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Sentence-transformer model |
EMBEDDING_DIM |
384 |
Embedding dimension |
CHUNK_SIZE |
512 |
Characters per chunk |
CHUNK_OVERLAP |
128 |
Overlap between chunks |
RETRIEVAL_TOP_K |
20 |
Candidates from search |
RETRIEVAL_FINAL_K |
5 |
Final chunks after reranking |
ENABLE_HYDE |
true |
Enable HyDE query enhancement |
ENABLE_RERANKING |
true |
Enable cross-encoder reranking |
ENABLE_SELF_RAG |
true |
Enable self-reflection loop |
RATE_LIMIT_QUERY |
30/minute |
Query endpoint rate limit |
JWT_SECRET_KEY |
... |
JWT signing key |
For a deep-dive into how each component works, open the interactive notebook:
jupyter notebook notebooks/rag_explained.ipynbThe notebook walks through:
- Document Chunking β how text is split into overlapping segments
- Embedding β how chunks become 384-dim vectors
- Vector Search β how Milvus finds similar documents
- BM25 Sparse Search β keyword-based retrieval
- Hybrid Fusion (RRF) β combining dense + sparse results
- Cross-Encoder Reranking β improving result quality
- HyDE β hypothetical document embeddings
- Prompt Optimization β lost-in-the-middle reordering
- Self-RAG β self-reflection and re-retrieval
# Run all 12 integration tests
python -m pytest tests/test_api.py -vTests cover: health checks, authentication, document ingestion, and RAG queries.
# Full stack (Milvus + Redis + API + Celery Worker)
docker-compose up -d
# API only (if infra already running)
docker-compose up -d rag-apiMIT
Streaming chat with RAG responses and source attribution
Drag-and-drop upload, document listing with metadata
Running selene-1-mini-llama-3.1-8b model locally


