A production-quality RAG system that lets developers ask natural language questions about documentation and receive grounded, source-cited answers — powered by a local LLM running entirely on your machine. No API keys, no cloud dependencies, no data leaves your laptop.
Supports both Naive RAG (fixed retrieve-then-generate pipeline) and Agentic RAG (LLM decides when and how to search, can perform multiple searches per question).
Developer asks: "What is the difference between a 401 and 403 error?"
┌─────────────────────────┐
│ Developer Question │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Embedding (MiniLM) │
│ Question → 384 numbers │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ ChromaDB Vector Search │
│ Find closest chunks │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Mistral via Ollama │
│ Generate grounded │
│ answer from chunks │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Answer + Sources │
│ "401 means token is │
│ missing or invalid..." │
│ [authentication.md] │
└─────────────────────────┘
System response:
{
"answer": "A 401 Unauthorized error means your token is missing or invalid.
Request a new token via POST /auth/token. A 403 Forbidden error
means your token is valid but you lack permission for the action.",
"sources": [
{"file_name": "authentication.md", "similarity_score": 0.551},
{"file_name": "error_handling.md", "similarity_score": 0.359}
],
"model": "mistral"
}Every answer is grounded in your documentation. No hallucination.
This project implements both approaches. Switch between them with one environment variable.
| Feature | Naive RAG | Agentic RAG |
|---|---|---|
| Search strategy | Always retrieves once | LLM decides when and how to search |
| Complex questions | Hopes all info is in top 3 chunks | Breaks question into sub-queries |
| Missing info | Generates answer anyway | Can say "I don't have enough info" |
| Follow-up searches | Never | Searches again with rephrased query |
| Speed | Fast (~5 seconds) | Slower (~15 seconds) |
| Reliability | Predictable, consistent | Depends on LLM reasoning quality |
| Best for | Simple, well-defined questions | Complex, multi-part questions |
Switch modes by setting PIPELINE_MODE in your .env:
PIPELINE_MODE=naive # default — fast and predictable
PIPELINE_MODE=agentic # LLM-driven search decisions
| Component | Technology | Purpose |
|---|---|---|
| Orchestration | LangChain + LangGraph | Pipeline wiring and agent framework |
| Local LLM | Ollama (Mistral 7B) | Answer generation, runs on your machine |
| Vector DB | ChromaDB | Stores and searches document embeddings |
| Embeddings | SentenceTransformers (all-MiniLM-L6-v2) | Converts text to 384-dim vectors |
| API | FastAPI | REST endpoint for queries |
| UI | Streamlit | Chat interface with source citations |
| Evaluation | RAGAS | Measures faithfulness, relevancy, precision |
| Deployment | Docker + Docker Compose | One-command containerized deployment |
| Testing | pytest | 63 automated tests across all modules |
Evaluated using 7 test cases with Mistral as the local LLM judge:
| Metric | Score | What it measures |
|---|---|---|
| Faithfulness | 1.000 | Does the answer only use info from retrieved docs? |
| Context Precision | 1.000 | Were the right chunks retrieved for the question? |
| Answer Relevancy | 0.458 | Does the answer address the question asked? |
| Overall | 0.819 | Weighted average across all metrics |
Faithfulness at 1.0 means the system never hallucinated — every claim traces back to
a retrieved document chunk. Context precision at 1.0 means the retriever consistently
finds the correct documentation. Answer relevancy was affected by timeout issues with
local CPU inference (most questions returned nan due to RAGAS evaluation timeouts).
- Python 3.11+ (tested on 3.13)
- Ollama installed (ollama.com)
- ~4GB disk space for Mistral model
# Clone the repo
git clone https://github.com/sumreen7/developer-knowledge-rag.git
cd developer-knowledge-rag
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Mac/Linux
# Install dependencies
pip install --upgrade pip
pip install -e ".[dev]"
pip install langchain-chroma langchain-huggingface langchain-ollama \
langchain-text-splitters langchain langgraph markdown
# Copy environment config
cp .env.example .env
# Start Ollama and pull Mistral (separate terminal)
ollama serve
ollama pull mistral# 1. Ingest your documentation
python -m src.ingestion.run_ingestion
# 2. Start the API server
.venv/bin/uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000
# 3. Start the chat UI (separate terminal)
streamlit run ui/app.py
# 4. Open the chat UI
open http://localhost:8501
# 5. Or query the API directly
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "How do I authenticate with the API?"}'# Start Ollama
docker-compose up -d ollama
# Pull the model
docker-compose exec ollama ollama pull mistral
# Ingest documents
docker-compose --profile ingestion up ingestion
# Start API + UI
docker-compose up -d api ui
# Open the UI
open http://localhost:8501Drop any .md, .txt, or .pdf file into data/raw/ and run ingestion:
python -m src.ingestion.run_ingestionThe pipeline uses incremental ingestion — it queries ChromaDB's metadata index for already-embedded file names and skips them automatically. Only new, previously unseen documents get chunked, embedded, and stored. No duplicate vectors, no redundant computation, no manual vectorstore cleanup.
| Mode | Command | Behavior |
|---|---|---|
| Incremental (default) | python -m src.ingestion.run_ingestion |
Skips already-indexed files, processes only new ones |
| Force re-index | python -m src.ingestion.run_ingestion --force |
Deletes vectorstore, re-embeds entire corpus from scratch |
| Single-file re-index | python -m src.ingestion.run_ingestion --reindex auth.md |
Deletes stale chunks for one file, re-embeds it |
# Drop a new file into the data directory
cp deployment_guide.md data/raw/
# Run ingestion — only the new file gets processed
python -m src.ingestion.run_ingestion
# Output:
# Skipping 4 already-ingested file(s)
# deployment_guide.md → 3 chunk(s)
# New files ingested: 1
# Total in store: 9Use --reindex when a document has been edited and needs re-embedding,
or --force when you want to rebuild the entire vector index from scratch.
Ask a question about the documentation.
Request:
{
"question": "How do I authenticate with the API?",
"k": 3
}Response:
{
"question": "How do I authenticate with the API?",
"answer": "Use Bearer token authentication. Send a POST request to /auth/token...",
"sources": [
{
"file_name": "authentication.md",
"chunk_index": 0,
"content_preview": "Authentication Guide...",
"similarity_score": 0.611
}
],
"model": "mistral",
"num_chunks_retrieved": 3,
"agent_steps": 0,
"search_queries": []
}Check pipeline status.
{
"status": "healthy",
"pipeline_mode": "naive",
"ollama": "connected",
"model": "mistral",
"config": {
"ollama_model": "mistral",
"embedding_model": "all-MiniLM-L6-v2",
"chroma_collection": "docs"
}
}Interactive Swagger UI for testing the API in your browser.
.venv/bin/pytest tests/ -v63 tests across all modules:
| Module | Tests | What they cover |
|---|---|---|
| API | 10 | Endpoint responses, validation, pipeline integration |
| Embeddings | 6 | Model loading, ChromaDB storage, collection stats |
| Chunker | 9 | Splitting, metadata inheritance, edge cases |
| Loader | 7 | File loading, metadata, unsupported types |
| LLM | 7 | Ollama client, prompt template, error handling |
| RAG Pipeline | 10 | End-to-end query, sources, response format |
| Retrieval | 10 | Similarity search, scores, context formatting |
| Health | 4 | API health check, config exposure |
python -m src.evaluation.evaluateRuns 7 test cases through the full pipeline, uses Mistral as the local judge,
and saves timestamped results to evaluation_results/.
developer-knowledge-rag/
├── src/
│ ├── api/
│ │ └── main.py # FastAPI — POST /query, GET /health
│ ├── embeddings/
│ │ └── embedder.py # SentenceTransformers → ChromaDB + incremental indexing
│ ├── evaluation/
│ │ └── evaluate.py # RAGAS evaluation script
│ ├── ingestion/
│ │ ├── loader.py # Loads .md, .txt, .pdf files
│ │ ├── chunker.py # Splits documents into chunks
│ │ └── run_ingestion.py # Incremental pipeline: load → chunk → embed → store
│ ├── llm/
│ │ └── ollama_client.py # Mistral via Ollama wrapper
│ ├── rag/
│ │ ├── pipeline.py # Naive RAG: retrieve once → generate
│ │ ├── agentic_pipeline.py # Agentic RAG: LLM-driven search decisions
│ │ ├── run_rag.py # Test naive pipeline
│ │ └── run_agentic.py # Test agentic pipeline
│ ├── retrieval/
│ │ ├── retriever.py # ChromaDB similarity search
│ │ └── run_retrieval.py # Test retrieval standalone
│ └── config.py # Pydantic Settings from .env
├── tests/ # 63 tests mirroring src/ structure
├── data/raw/ # Your documentation files go here
├── ui/
│ └── app.py # Streamlit chat interface
├── vectorstore/ # ChromaDB files (auto-generated)
├── evaluation_results/ # RAGAS JSON reports
├── Dockerfile # FastAPI backend container
├── Dockerfile.ui # Streamlit UI container
├── docker-compose.yml # Full stack orchestration
├── .dockerignore
├── .env.example
├── pyproject.toml # Dependencies and project config
└── README.MD
| Phase | Description | Status |
|---|---|---|
| 1 | Project setup, config, health check API | ✅ |
| 2 | Document ingestion pipeline with incremental indexing | ✅ |
| 3 | Document chunking (RecursiveCharacterTextSplitter) | ✅ |
| 4 | Embeddings + ChromaDB vector store | ✅ |
| 5 | Retrieval pipeline (similarity search) | ✅ |
| 6 | Ollama LLM integration (Mistral) | ✅ |
| 7 | RAG pipeline (retrieve → generate) | ✅ |
| 8 | FastAPI inference endpoint | ✅ |
| 9 | Streamlit chat interface | ✅ |
| 10 | RAGAS evaluation (faithfulness, relevancy, precision) | ✅ |
| 11 | Docker deployment + Agentic RAG upgrade | ✅ |
Documents are split into overlapping pieces using LangChain's
RecursiveCharacterTextSplitter. It cuts at natural boundaries
in this priority: paragraph → line → sentence → word.
chunk_size = 512 characters
chunk_overlap = 50 characters
The overlap ensures answers spanning two sections are never lost.
Each chunk is converted into 384 numbers by the all-MiniLM-L6-v2 model.
Chunks with similar meaning produce similar numbers. When a developer asks
a question, the question is also converted to 384 numbers — and ChromaDB
finds the chunks whose numbers are closest.
This is why the system can answer "how do I log in?" even when the docs say "Bearer token authentication" — same meaning, similar vectors.
All settings are managed via .env file using Pydantic Settings:
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=mistral
EMBEDDING_MODEL=all-MiniLM-L6-v2
CHROMA_PERSIST_DIRECTORY=./vectorstore
DATA_DIRECTORY=./data/raw
CHUNK_SIZE=512
CHUNK_OVERLAP=50
PIPELINE_MODE=naive
API_DEBUG=true