🚀 A repo-aware AI assistant that can answer questions about your codebase with citations, built to run entirely in the browser — no servers, no API keys, zero cost.
- Retrieval-Augmented Generation (RAG): combines semantic + exact code search with a lightweight text generator.
- Embeddings & Search:
sentence-transformers/all-MiniLM-L6-v2for embeddings (semantic similarity).- Riggrep MCP for fast exact search.
- Chroma MCP as a vector database for semantic search.
- Generation:
Xenova/LaMini-Flan-T5-248M(~248M params, quantized) for lightweight instruction following in the browser.
- Client-Side Deployment:
- All models run in the browser using Transformers.js with WebGPU/WebAssembly.
- No backend, no API costs.
- Memory:
- Short-term memory of recent turns for continuity.
- Persistent chat history saved in
localStorage. - Ability to pin notes (embedded and retrieved like code chunks).
- Optional rolling summaries of the dialogue.
- UI:
- Built with Next.js + React + Tailwind CSS.
- Chat interface with sources listed per answer.
- Adjustable Top-K slider.
- Reset chat, add notes, see retrieval scores.
- Deployment:
- Deployed serverlessly on Vercel.
- First load downloads models (cached in IndexedDB for reuse).
Repo → Python ingestion (build_index.py) → index.json (chunks + embeddings) | v Next.js App (Vercel) ←→ Transformers.js (browser) | v User Q → Embedder → Cosine Similarity (top-K chunks) → Generator | v Answer + Citations
-
Preprocessing
- A Python script (
build_index.py) walks through repo files. - Splits code/docs into overlapping chunks.
- Generates embeddings with
all-MiniLM-L6-v2. - Saves everything into
public/index.json.
- A Python script (
-
Retrieval
- On each query, the question is embedded in-browser.
- Chunks are scored by cosine similarity.
- Top-K most relevant chunks (and pinned notes) are selected.
-
Generation
- A prompt is built with:
- The user’s question,
- Recent chat history,
- Retrieved context chunks,
- Guidance for handling subjective questions.
LaMini-Flan-T5-248Mgenerates a concise answer.- Sources are displayed with file + line spans.
- A prompt is built with:
-
Deployment
- Everything runs client-side.
- First model download is heavy (~100–250 MB), but cached after.
- Frontend: Next.js, React, Tailwind CSS
- AI/ML:
- Transformers.js
- Hugging Face Models (
MiniLM,LaMini-Flan-T5-248M) - Chroma DB (for semantic search, via MCP)
- Riggrep (for exact code search, via MCP)
- Ingestion: Python, sentence-transformers
- Deployment: Vercel (serverless, static hosting)
- Retrieval-Augmented Generation (RAG)
- Embeddings & semantic search
- Vector databases (Chroma)
- Exact search (riggrep)
- Browser-only ML deployment (Transformers.js, WebGPU/WASM)
- Next.js/React/Tailwind frontend development
- Python scripting for data preprocessing
- Serverless deployment (Vercel)
- Trade-off analysis: cost vs performance vs usability
- Communicating model constraints in UI/UX
- Small model (248M params) → short context window (~512–1024 tokens), limited reasoning.
- First load is slow → large model files (~100–250 MB) must download to the browser.
- Answers are factual extracts → no deep reasoning across many files.
- Best for demo / proof-of-concept, not production-level repo QA.
Deployed on Vercel:
👉 Live Demo Link