Developer Documentation AI Assistant

A production-quality RAG system that lets developers ask natural language questions about documentation and receive grounded, source-cited answers — powered by a local LLM running entirely on your machine. No API keys, no cloud dependencies, no data leaves your laptop.

Supports both Naive RAG (fixed retrieve-then-generate pipeline) and Agentic RAG (LLM decides when and how to search, can perform multiple searches per question).

How It Works

Developer asks: "What is the difference between a 401 and 403 error?"

                        ┌─────────────────────────┐
                        │   Developer Question     │
                        └────────────┬────────────┘
                                     │
                        ┌────────────▼────────────┐
                        │   Embedding (MiniLM)     │
                        │   Question → 384 numbers │
                        └────────────┬────────────┘
                                     │
                        ┌────────────▼────────────┐
                        │   ChromaDB Vector Search │
                        │   Find closest chunks    │
                        └────────────┬────────────┘
                                     │
                        ┌────────────▼────────────┐
                        │   Mistral via Ollama     │
                        │   Generate grounded      │
                        │   answer from chunks     │
                        └────────────┬────────────┘
                                     │
                        ┌────────────▼────────────┐
                        │   Answer + Sources       │
                        │   "401 means token is    │
                        │   missing or invalid..." │
                        │   [authentication.md]    │
                        └─────────────────────────┘

System response:

{
  "answer": "A 401 Unauthorized error means your token is missing or invalid.
             Request a new token via POST /auth/token. A 403 Forbidden error
             means your token is valid but you lack permission for the action.",
  "sources": [
    {"file_name": "authentication.md", "similarity_score": 0.551},
    {"file_name": "error_handling.md", "similarity_score": 0.359}
  ],
  "model": "mistral"
}

Every answer is grounded in your documentation. No hallucination.

Naive RAG vs Agentic RAG

This project implements both approaches. Switch between them with one environment variable.

Feature	Naive RAG	Agentic RAG
Search strategy	Always retrieves once	LLM decides when and how to search
Complex questions	Hopes all info is in top 3 chunks	Breaks question into sub-queries
Missing info	Generates answer anyway	Can say "I don't have enough info"
Follow-up searches	Never	Searches again with rephrased query
Speed	Fast (~5 seconds)	Slower (~15 seconds)
Reliability	Predictable, consistent	Depends on LLM reasoning quality
Best for	Simple, well-defined questions	Complex, multi-part questions

Switch modes by setting PIPELINE_MODE in your .env:

PIPELINE_MODE=naive      # default — fast and predictable
PIPELINE_MODE=agentic    # LLM-driven search decisions

Tech Stack

Component	Technology	Purpose
Orchestration	LangChain + LangGraph	Pipeline wiring and agent framework
Local LLM	Ollama (Mistral 7B)	Answer generation, runs on your machine
Vector DB	ChromaDB	Stores and searches document embeddings
Embeddings	SentenceTransformers (all-MiniLM-L6-v2)	Converts text to 384-dim vectors
API	FastAPI	REST endpoint for queries
UI	Streamlit	Chat interface with source citations
Evaluation	RAGAS	Measures faithfulness, relevancy, precision
Deployment	Docker + Docker Compose	One-command containerized deployment
Testing	pytest	63 automated tests across all modules

Evaluation Results (RAGAS)

Evaluated using 7 test cases with Mistral as the local LLM judge:

Metric	Score	What it measures
Faithfulness	1.000	Does the answer only use info from retrieved docs?
Context Precision	1.000	Were the right chunks retrieved for the question?
Answer Relevancy	0.458	Does the answer address the question asked?
Overall	0.819	Weighted average across all metrics

Faithfulness at 1.0 means the system never hallucinated — every claim traces back to a retrieved document chunk. Context precision at 1.0 means the retriever consistently finds the correct documentation. Answer relevancy was affected by timeout issues with local CPU inference (most questions returned nan due to RAGAS evaluation timeouts).

Getting Started

Prerequisites

Python 3.11+ (tested on 3.13)
Ollama installed (ollama.com)
~4GB disk space for Mistral model

Setup

# Clone the repo
git clone https://github.com/sumreen7/developer-knowledge-rag.git
cd developer-knowledge-rag

# Create virtual environment
python -m venv .venv
source .venv/bin/activate    # Mac/Linux

# Install dependencies
pip install --upgrade pip
pip install -e ".[dev]"
pip install langchain-chroma langchain-huggingface langchain-ollama \
            langchain-text-splitters langchain langgraph markdown

# Copy environment config
cp .env.example .env

# Start Ollama and pull Mistral (separate terminal)
ollama serve
ollama pull mistral

Run the Pipeline

# 1. Ingest your documentation
python -m src.ingestion.run_ingestion

# 2. Start the API server
.venv/bin/uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000

# 3. Start the chat UI (separate terminal)
streamlit run ui/app.py

# 4. Open the chat UI
open http://localhost:8501

# 5. Or query the API directly
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I authenticate with the API?"}'

Run with Docker

# Start Ollama
docker-compose up -d ollama

# Pull the model
docker-compose exec ollama ollama pull mistral

# Ingest documents
docker-compose --profile ingestion up ingestion

# Start API + UI
docker-compose up -d api ui

# Open the UI
open http://localhost:8501

Adding Your Own Documentation

Drop any .md, .txt, or .pdf file into data/raw/ and run ingestion:

python -m src.ingestion.run_ingestion

The pipeline uses incremental ingestion — it queries ChromaDB's metadata index for already-embedded file names and skips them automatically. Only new, previously unseen documents get chunked, embedded, and stored. No duplicate vectors, no redundant computation, no manual vectorstore cleanup.

Ingestion Modes

Mode	Command	Behavior
Incremental (default)	`python -m src.ingestion.run_ingestion`	Skips already-indexed files, processes only new ones
Force re-index	`python -m src.ingestion.run_ingestion --force`	Deletes vectorstore, re-embeds entire corpus from scratch
Single-file re-index	`python -m src.ingestion.run_ingestion --reindex auth.md`	Deletes stale chunks for one file, re-embeds it

Example: Adding a New Document

# Drop a new file into the data directory
cp deployment_guide.md data/raw/

# Run ingestion — only the new file gets processed
python -m src.ingestion.run_ingestion

# Output:
# Skipping 4 already-ingested file(s)
# deployment_guide.md → 3 chunk(s)
# New files ingested: 1
# Total in store:     9

Use --reindex when a document has been edited and needs re-embedding, or --force when you want to rebuild the entire vector index from scratch.

API Reference

POST /query

Ask a question about the documentation.

Request:

{
  "question": "How do I authenticate with the API?",
  "k": 3
}

Response:

{
  "question": "How do I authenticate with the API?",
  "answer": "Use Bearer token authentication. Send a POST request to /auth/token...",
  "sources": [
    {
      "file_name": "authentication.md",
      "chunk_index": 0,
      "content_preview": "Authentication Guide...",
      "similarity_score": 0.611
    }
  ],
  "model": "mistral",
  "num_chunks_retrieved": 3,
  "agent_steps": 0,
  "search_queries": []
}

GET /health

Check pipeline status.

{
  "status": "healthy",
  "pipeline_mode": "naive",
  "ollama": "connected",
  "model": "mistral",
  "config": {
    "ollama_model": "mistral",
    "embedding_model": "all-MiniLM-L6-v2",
    "chroma_collection": "docs"
  }
}

GET /docs

Interactive Swagger UI for testing the API in your browser.

Running Tests

.venv/bin/pytest tests/ -v

63 tests across all modules:

Module	Tests	What they cover
API	10	Endpoint responses, validation, pipeline integration
Embeddings	6	Model loading, ChromaDB storage, collection stats
Chunker	9	Splitting, metadata inheritance, edge cases
Loader	7	File loading, metadata, unsupported types
LLM	7	Ollama client, prompt template, error handling
RAG Pipeline	10	End-to-end query, sources, response format
Retrieval	10	Similarity search, scores, context formatting
Health	4	API health check, config exposure

Running Evaluation

python -m src.evaluation.evaluate

Runs 7 test cases through the full pipeline, uses Mistral as the local judge, and saves timestamped results to evaluation_results/.

Project Structure

developer-knowledge-rag/
├── src/
│   ├── api/
│   │   └── main.py                # FastAPI — POST /query, GET /health
│   ├── embeddings/
│   │   └── embedder.py            # SentenceTransformers → ChromaDB + incremental indexing
│   ├── evaluation/
│   │   └── evaluate.py            # RAGAS evaluation script
│   ├── ingestion/
│   │   ├── loader.py              # Loads .md, .txt, .pdf files
│   │   ├── chunker.py             # Splits documents into chunks
│   │   └── run_ingestion.py       # Incremental pipeline: load → chunk → embed → store
│   ├── llm/
│   │   └── ollama_client.py       # Mistral via Ollama wrapper
│   ├── rag/
│   │   ├── pipeline.py            # Naive RAG: retrieve once → generate
│   │   ├── agentic_pipeline.py    # Agentic RAG: LLM-driven search decisions
│   │   ├── run_rag.py             # Test naive pipeline
│   │   └── run_agentic.py         # Test agentic pipeline
│   ├── retrieval/
│   │   ├── retriever.py           # ChromaDB similarity search
│   │   └── run_retrieval.py       # Test retrieval standalone
│   └── config.py                  # Pydantic Settings from .env
├── tests/                         # 63 tests mirroring src/ structure
├── data/raw/                      # Your documentation files go here
├── ui/
│   └── app.py                     # Streamlit chat interface
├── vectorstore/                   # ChromaDB files (auto-generated)
├── evaluation_results/            # RAGAS JSON reports
├── Dockerfile                     # FastAPI backend container
├── Dockerfile.ui                  # Streamlit UI container
├── docker-compose.yml             # Full stack orchestration
├── .dockerignore
├── .env.example
├── pyproject.toml                 # Dependencies and project config
└── README.MD

Build Phases

Phase	Description	Status
1	Project setup, config, health check API	✅
2	Document ingestion pipeline with incremental indexing	✅
3	Document chunking (RecursiveCharacterTextSplitter)	✅
4	Embeddings + ChromaDB vector store	✅
5	Retrieval pipeline (similarity search)	✅
6	Ollama LLM integration (Mistral)	✅
7	RAG pipeline (retrieve → generate)	✅
8	FastAPI inference endpoint	✅
9	Streamlit chat interface	✅
10	RAGAS evaluation (faithfulness, relevancy, precision)	✅
11	Docker deployment + Agentic RAG upgrade	✅

How Chunking Works

Documents are split into overlapping pieces using LangChain's RecursiveCharacterTextSplitter. It cuts at natural boundaries in this priority: paragraph → line → sentence → word.

chunk_size    = 512 characters
chunk_overlap = 50 characters

The overlap ensures answers spanning two sections are never lost.

How Embeddings Work

Each chunk is converted into 384 numbers by the all-MiniLM-L6-v2 model. Chunks with similar meaning produce similar numbers. When a developer asks a question, the question is also converted to 384 numbers — and ChromaDB finds the chunks whose numbers are closest.

This is why the system can answer "how do I log in?" even when the docs say "Bearer token authentication" — same meaning, similar vectors.

Configuration

All settings are managed via .env file using Pydantic Settings:

OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=mistral
EMBEDDING_MODEL=all-MiniLM-L6-v2
CHROMA_PERSIST_DIRECTORY=./vectorstore
DATA_DIRECTORY=./data/raw
CHUNK_SIZE=512
CHUNK_OVERLAP=50
PIPELINE_MODE=naive
API_DEBUG=true

Built With

LangChain — LLM orchestration
LangGraph — Agentic RAG framework
Ollama — Local LLM inference
ChromaDB — Vector database
SentenceTransformers — Text embeddings
FastAPI — REST API
Streamlit — Chat UI
RAGAS — RAG evaluation
Docker — Containerized deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developer Documentation AI Assistant

How It Works

Naive RAG vs Agentic RAG

Tech Stack

Evaluation Results (RAGAS)

Getting Started

Prerequisites

Setup

Run the Pipeline

Run with Docker

Adding Your Own Documentation

Ingestion Modes

Example: Adding a New Document

API Reference

POST /query

GET /health

GET /docs

Running Tests

Running Evaluation

Project Structure

Build Phases

How Chunking Works

How Embeddings Work

Configuration

Built With

FilesExpand file tree

README.MD

Latest commit

History

README.MD

File metadata and controls

Developer Documentation AI Assistant

How It Works

Naive RAG vs Agentic RAG

Tech Stack

Evaluation Results (RAGAS)

Getting Started

Prerequisites

Setup

Run the Pipeline

Run with Docker

Adding Your Own Documentation

Ingestion Modes

Example: Adding a New Document

API Reference

POST /query

GET /health

GET /docs

Running Tests

Running Evaluation

Project Structure

Build Phases

How Chunking Works

How Embeddings Work

Configuration

Built With