An end-to-end repository question answering system that indexes a public GitHub codebase, retrieves grounded code evidence, and generates cited answers through a retrieval-augmented generation pipeline.
This project includes:
- a React frontend for repository submission and conversational querying
- a FastAPI backend for indexing and answering questions
- a hybrid retrieval pipeline with semantic search, BM25, and reranking
- an evaluation harness for measuring retrieval quality and answer grounding
Strong applied AI projects usually show:
- clear system design
- practical backend and frontend integration
- retrieval and ranking logic beyond a single prompt
- measurable evaluation instead of anecdotal demos
- thoughtful tradeoffs around cost, latency, and persistence
Code Compass brings those elements together in one end-to-end application.
- A user pastes a GitHub repository URL into the UI.
- The backend clones the repository into a temporary local directory.
- Source files are filtered and chunked using tree-sitter and fallback text chunking.
- The system generates embeddings for chunks and stores them in a Qdrant-backed vector layer.
- At query time, the system retrieves evidence with:
- semantic vector search
- lexical BM25 search
- reciprocal rank fusion
- cross-encoder reranking
- The top grounded chunks are passed to the LLM to generate a concise answer.
- The UI displays the answer with file-level citations and GitHub source links.
┌──────────────────────┐
│ React UI │
│ repo submit + chat │
│ citations + status │
└──────────┬───────────┘
│ HTTP / JSON
▼
┌──────────────────────┐
│ FastAPI Server │
│ routes + session │
│ validation │
└──────────┬───────────┘
│
▼
┌──────────────────────────────────────────────┐
│ CodebaseRAGSystem │
│ indexing orchestration + query orchestration │
└───────┬───────────────┬───────────────┬──────┘
│ │ │
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ RepoFetcher │ │ CodeParser │ │ Embeddings │
│ clone/filter │ │ tree-sitter │ │ Vertex/local │
└──────┬───────┘ │ fallback │ └──────┬───────┘
│ └──────┬───────┘ │
│ │ │
└────────────┬────┴────────────┬────┘
▼ ▼
┌──────────────┐ ┌──────────────┐
│ SQLite │ │ Qdrant │
│ repo/status │ │ vector store │
│ metadata │ └──────┬───────┘
└──────────────┘ │
▼
┌──────────────────┐
│ Hybrid Retrieval │
│ semantic + BM25 │
│ + reranking │
└────────┬─────────┘
▼
┌──────────────────┐
│ LLM Answerer │
│ grounded answer │
│ + citations │
└──────────────────┘
- React 19
- Tailwind CSS
- Axios for API communication
Responsibilities:
- collect the GitHub repository URL
- poll indexing state
- send chat questions and prior conversation turns
- render markdown-like answers
- display cited files, symbols, and line ranges
Main entry points:
- FastAPI
- Pydantic
- SQLAlchemy
- SQLite for lightweight repository metadata
Responsibilities:
- validate requests
- manage session-scoped repository state
- run indexing in the background
- execute retrieval and answer generation
- return grounded answers and source metadata
Main entry points:
- tree-sitter for code-aware chunking
- Vertex AI or local embeddings for semantic retrieval depending on environment
- BM25 for lexical retrieval
- reciprocal rank fusion to combine retrieval channels
- a cross-encoder reranker for final source ordering
- Gemini or Groq-backed LLM generation depending on environment configuration
Core modules:
server/src/code_parser.pyserver/src/embeddings.pyserver/src/hybrid_search.pyserver/src/vector_store.pyserver/src/repo_fetcher.py
POST /api/repos/index- Backend registers the repo against a session
- Background task clones the repo
- Files are filtered by extension, directory, and size
- Files are chunked into code-aware segments
- Embeddings are generated for each chunk
- Chunks are stored in the vector layer and in in-memory retrieval state
- Metadata and progress are exposed back to the UI
POST /api/query- The backend validates the session and repository status
- The question is expanded using lightweight intent heuristics
- Semantic search retrieves candidate chunks
- BM25 retrieves lexical matches
- Results are fused and reranked
- Final sources are selected and passed to the LLM
- The backend returns:
answerconfidencesources- repository metadata
- fast iteration speed
- strong request validation through Pydantic
- simple background task support
- clean fit for JSON APIs and model-driven backend code
- straightforward stateful UI for a single-page workflow
- easy integration with polling, chat state, and citation rendering
- strong ecosystem for incremental iteration
- better chunk boundaries than naive fixed-length splitting
- lets the system reason around functions, classes, and symbols
- improves retrieval quality for implementation-focused questions
Pure semantic search misses exact symbols and file names. Pure lexical search misses semantic intent. This project combines both because code questions often need:
- exact identifiers
- nearby implementation detail
- cross-file semantic similarity
- simple vector abstraction
- local in-memory mode for fast demos
- cloud-compatible path for later deployment
- enough for lightweight repository/session metadata
- very low operational overhead
- matches the project goal of staying simple while preserving useful state
Local development and the evaluation harness are designed around Vertex AI:
- Vertex AI Gemini for answer generation
- Vertex AI embeddings for semantic retrieval
This setup is useful for:
- higher quality local experiments
- benchmark runs with RAGAS
- comparing retrieval and answer quality in a stronger managed-model environment
The production deployment target is:
- frontend on Vercel
- backend on Hugging Face Spaces
Production inference is configured differently from local/eval:
- Groq-hosted Llama for answer generation
- lightweight local sentence-transformer embeddings for semantic retrieval
This production setup was chosen to fit Hugging Face Spaces free-tier constraints more comfortably while keeping the retrieval and answer pipeline intact.
Recommended production runtime:
LLM_PROVIDER=groqEMBEDDING_PROVIDER=localLIGHTWEIGHT_LOCAL_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
- Vercel hosts the React frontend
- Hugging Face Spaces hosts the FastAPI backend
- the backend is packaged and deployed as a Docker Space
- GitHub Actions syncs the backend code to the Space on pushes to
main
The backend is deployed with Docker using:
The container:
- installs Python dependencies
- copies the backend application
- starts the FastAPI app with Uvicorn on port
7860
Continuous deployment is handled through:
The workflow:
- runs on pushes to
main - syncs the
server/directory to the Hugging Face Space - triggers the Docker Space rebuild automatically
The project includes an end-to-end eval harness that calls the live API instead of mocking the retrieval pipeline.
Files:
The benchmark currently measures:
- retrieval hit rate
- top-1 hit rate
- mean reciprocal rank
- source recall
- duplicate source rate
- keyword-based answer checks
- grounded answer rate
- optional RAGAS judge metrics such as faithfulness and answer relevancy
The project includes a measurable end-to-end evaluation workflow alongside the product itself.
Latest expanded internal benchmark:
- 37 evaluation cases
- 8 categories
- 4 multi-turn conversation cases
Headline metrics from the current benchmark run:
| Metric | Result |
|---|---|
| Retrieval hit rate | 95.0% |
| Top-1 hit rate | 75.0% |
| Mean reciprocal rank | 0.85 |
| Source recall | 74.2% |
| Faithfulness (RAGAS) | 0.917 |
| Answer relevancy (RAGAS) | 0.843 |
| Context precision (RAGAS) | 0.767 |
What these numbers mean:
- the system retrieves at least one relevant source for the large majority of benchmark cases
- the first-ranked source is relevant in most cases, with the biggest remaining opportunity in canonical file ranking for harder prompts
- the benchmark includes architecture, API, setup, docs, tests, cross-file, and conversation-style questions
- the evaluation provides a solid engineering benchmark for the current system and repo scope
Benchmark strengths:
- strong retrieval on architecture and setup questions
- grounded answers with source-linked citations
- measurable end-to-end performance instead of anecdotal examples
Benchmark-exposed weaknesses:
- duplicate source retrieval still appears in some cases
- some cross-file and test-heavy questions remain harder than single-file API questions
- canonical implementation files are not always ranked first on the hardest prompts
- full-stack architecture with a clear data flow
- code-aware retrieval rather than plain document retrieval
- practical hybrid search design
- session-aware repo isolation
- source-grounded answer generation
- explicit benchmark and evaluation workflow
- retrieval state is intentionally session-scoped and mostly in memory
- cloned repositories are temporary and deleted after indexing
- repository metadata is lightweight and persisted separately from vector state
- if the backend restarts, repositories must be re-indexed
- the benchmark is strong for the current project scope and can be expanded further across repositories over time
cd server
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python server_app.pyBackend runs on http://localhost:8000
cd ui
npm install
npm startFrontend runs on http://localhost:3000
Create ui/.env:
REACT_APP_API_URL=http://localhost:8000From the server directory:
CODEBASE_RAG_API_URL=http://localhost:8000 \
CODEBASE_RAG_SESSION_ID=<session-id> \
CODEBASE_RAG_REPO_ID=<repo-id> \
CODEBASE_RAG_EVAL_OUTPUT=evals/latest_eval_report.json \
python evals/run_eval.pyThe output report includes:
- eval-set audit warnings
- headline metrics
- category breakdowns
- case-by-case detail
- a summary string suitable for project reporting
If you want to save the latest run as a JSON artifact:
CODEBASE_RAG_EVAL_OUTPUT=evals/latest_eval_report.jsonserver/
server_app.py
evals/
src/
ui/
src/
README.md

