Languages: English | 中文
This service indexes Git repositories from common hosts (GitLab, GitHub, Gitea) or any clone URL (via manual trigger) into a searchable vector knowledge base. On main/master push webhooks—or when you call the trigger API—it pulls code, chunks it at the function level (falls back to file level when parsing yields zero functions), optionally generates a one-line description per chunk via an LLM, then embeds and upserts into Chroma. You can query via HTTP APIs, use the bundled admin UI, feed hits into Dify (API Tool), or call the code Q&A endpoints.
Admin UI (/admin/): Overview with indexed projects, quick links, and Semantic search with natural-language queries over the vector index.
| Overview | Semantic search |
|---|---|
![]() |
![]() |
- Auto indexing: webhooks for GitLab, GitHub, and Gitea on
main/masterpush (serial worker avoids concurrent write failures); other hosts can use manual trigger or CI calling the same enqueue API - Progress tracking: returns a
job_idon enqueue; query job status/progress anytime - Semantic search: results include
path,name,start_line,end_line, etc. for quick navigation - Code Q&A (optional LLM):
POST /api/code-chatand streaming variant (see OpenAPI at/docs)
cp .env.example .env
docker compose up -d
curl "http://localhost:8000/health"Then open:
http://localhost:8000/docsfor OpenAPIhttp://localhost:8000/admin/for admin UI
Minimum required settings before first useful indexing:
- Embeddings (see Environment variables): default
EMBED_PROVIDER=ollamaneedsOLLAMA_BASE_URLandEMBED_MODEL; useEMBED_PROVIDER=openaiwithOPENAI_EMBED_BASE_URL,OPENAI_EMBED_API_KEY, and an OpenAIEMBED_MODEL(e.g.text-embedding-3-small). Embedding OpenAI settings are not the same as chatOPENAI_*. - Private HTTPS repos (optional):
GIT_HTTPS_TOKENorGITLAB_ACCESS_TOKEN - Webhook signature verification (recommended for non-LAN): set each platform secret
| Path | Purpose |
|---|---|
backend/app/ |
Python / FastAPI service, indexing, wiki, vector store |
backend/requirements.txt |
Backend dependencies |
frontend/ |
React + Vite admin UI (built assets served at /admin/) |
docs/images/ |
README screenshots |
LICENSE |
MIT license |
scripts/ |
Helper scripts |
Webhook push (GitLab / GitHub / Gitea, main/master) / Manual trigger (any Git URL)
↓
Enqueue job (SQLite persistence, single serial worker)
↓
clone_or_pull: clone/pull repo (optional HTTPS auth: GIT_HTTPS_TOKEN or GITLAB_ACCESS_TOKEN, see below)
↓
collect_files: scan common code/config files (skip `node_modules` / `.git` / `.env` etc.)
↓
parse_functions: Tree-sitter function-level parsing (0 results → file-level fallback)
↓
describe_chunks (optional): one-line descriptions via `LLM_PROVIDER` (Dify / Azure OpenAI / OpenAI-compatible)
↓
generate_wiki: static wiki (MkDocs / Starlight / VitePress) under DATA_DIR/wiki_sites
↓
upsert_vector_store: embeddings (`EMBED_PROVIDER`: Ollama or OpenAI-compatible) → Chroma upsert
↓
query/search: semantic retrieval (for Dify/frontend)
cp .env.example .envTypical minimum config:
- Private HTTPS repos: set
GITLAB_ACCESS_TOKENand/orGIT_HTTPS_TOKEN(see Private HTTPS clone); for GitHub PAT useGIT_HTTPS_USERNAME=x-access-token - Embeddings: set
EMBED_PROVIDER(ollamaoropenai) and the matching variables (see Environment variables); changing provider or embedding dimension requires clearingDATA_DIR/chromaand re-indexing - Optional LLM descriptions / code chat: set
LLM_PROVIDERtodify,azure_openai, oropenai(defaultopenai) and configure only that provider’s keys (see LLM provider)
docker compose up -d- Service:
http://localhost:8000 - Docs:
http://localhost:8000/docs - Rebuild after code changes:
docker compose build --no-cache && docker compose up -d
All webhook routes only enqueue on main or master pushes. Successful enqueue returns job_id in the JSON body.
If the corresponding secret env var is unset, signature verification is skipped (convenient for LAN testing; not recommended on the public internet).
In the project Settings → Webhooks:
- URL:
http://<host>:8000/webhook/gitlab - Secret: same as
GITLAB_WEBHOOK_SECRET(sent/verified per your GitLab setup) - Events: Push events
- Handler accepts
object_kind=pushonly.
In the repo Settings → Webhooks → Add webhook:
- Payload URL:
http://<host>:8000/webhook/github - Content type:
application/json - Secret: same as
GITHUB_WEBHOOK_SECRET(HMAC SHA-256, headerX-Hub-Signature-256) - Events: Just the push event (ping events are ignored with
200)
In the repo Settings → Webhooks:
- URL:
http://<host>:8000/webhook/gitea - Secret: same as
GITEA_WEBHOOK_SECRET(headerX-Gitea-Signature, HMAC-SHA256 hex of raw body) - Events: Push
For private repositories over HTTPS, the worker injects credentials into the clone URL when the clean URL has no embedded userinfo:
| Variable | Role |
|---|---|
GIT_HTTPS_TOKEN |
If set, takes precedence over GITLAB_ACCESS_TOKEN |
GITLAB_ACCESS_TOKEN |
Still supported (e.g. GitLab read_repository PAT) |
GIT_HTTPS_USERNAME |
HTTP basic username for the token; default when empty is oauth2 (GitLab). For GitHub, set x-access-token |
You can also set GIT_HTTPS_USERNAME from the admin Settings UI (/admin/).
curl -X POST "http://localhost:8000/webhook/trigger" \
-H "Content-Type: application/json" \
-d '{"repo_url":"https://github.com/acme/backend.git","project_id":"acme/backend","project_name":"Display name"}'Use any HTTPS or SSH URL your runtime can git clone. Optional project_name: human-readable label (e.g. Chinese name). Stored on the job and shown on the Wiki home page, site title, and manifest.json.
curl -X POST "http://localhost:8000/api/index-jobs/enqueue" \
-H "Content-Type: application/json" \
-d '{"repo_url":"https://github.com/acme/backend.git","project_id":"acme/backend","project_name":"Display name"}'- POST
/api/query
curl -X POST "http://localhost:8000/api/query" \
-H "Content-Type: application/json" \
-d '{"query":"How is user login implemented?","project_id":"my-repo","top_k":10}'- GET
/api/search
curl "http://localhost:8000/api/search?q=login&project_id=my-repo&top_k=10"Response shape:
{
"results": [
{
"score": 0.123,
"content": "...",
"metadata": {
"path": "app/auth.py",
"name": "login",
"kind": "function",
"start_line": 10,
"end_line": 88
}
}
]
}- GET
/api/projects: list indexed projects and doc counts; each item includesproject_name(display name, may benull); optionalq(matchesproject_idorproject_namesubstring),limit/offsetpagination (omitlimitfor full list, backward compatible) - DELETE
/api/projects/{project_id}: remove project vectors and metadata - POST
/api/projects/{project_id}/reindex: enqueue a rebuild for the project - GET
/api/projects/{project_id}/vectors: inspect stored vectors with pagination - GET
/api/project/index-status?project_id=xxx: check whether a project is indexed (indexed/doc_count)
- Auth UI:
GET /api/auth/status,POST /api/auth/login,GET /api/auth/me - Admin settings:
GET /api/admin/settings - Storage insight:
GET /api/admin/storage - LLM usage metrics:
GET /api/admin/llm-usage - Code chat feedback:
POST /api/code-chat/feedback
After describe_chunks, the worker runs a wiki build before vector upsert. Failures are logged only and do not block indexing. Default WIKI_BACKEND=mkdocs (Material, Python-only). starlight or vitepress need Node.js + npm (included in this repo’s Dockerfile; install them yourself on bare metal). The first build runs npm install under wiki_work/<project_id> (npm registry access required). Output: DATA_DIR/wiki_sites/<project_id>/site/, served by the API.
- Browse:
http://<host>:8000/wiki/<project_id>/site/(sameproject_idrules as underrepos/) - Metadata:
GET /api/wiki/{project_id}→ lastmanifest.json(includeswiki_backend, commit, timestamps, counts)
Pages include overview, architecture (when an LLM is configured), file index (tree), per-file symbol pages, and a symbol table (split into parts when large). MkDocs uses Lunr search; Starlight/VitePress use built-in local search. On each symbol, Functionality shows only the LLM-generated one-line description from the indexing pipeline (same field as in the vector store); if the model is not configured or generation failed, a placeholder explains that. When a source docstring differs from that text, it appears separately under Source Docstring.
Indexing runs through a serial queue (avoids concurrent writes to Chroma / local repo dirs). Job state is persisted in SQLite so you can still query history after restarts.
- List jobs:
GET /api/index-jobs?limit=50&offset=0(optionalstatus/project_idfilters; responsetotalis the full match count,jobsis the current page,limit/offsetecho the request) - Get one job:
GET /api/index-jobs/{job_id} - Cancel a job:
POST /api/index-jobs/{job_id}/cancel(supportsqueuedandrunning; running jobs are terminated) - Retry a failed/cancelled job:
POST /api/index-jobs/{job_id}/retry - Precheck a repo before enqueue:
POST /api/index-jobs/precheck
Key fields:
- status:
queued/running/succeeded/failed/cancelled - progress: 0-100
- step: stage name (e.g.
clone_or_pull/parse_functions/generate_wiki/upsert_vector_store) - message: human-friendly stage message
Most values can also be changed from the admin UI (/admin/ → Settings); overrides are stored in DATA_DIR/ui_overrides.json and take precedence over .env for supported keys.
| Variable | Description |
|---|---|
DATA_DIR |
Data directory (default ./data; commonly /data in containers) |
Embeddings — set EMBED_PROVIDER to ollama (default) or openai:
| Variable | When | Description |
|---|---|---|
EMBED_PROVIDER |
Always | ollama (default) or openai. |
OLLAMA_BASE_URL |
EMBED_PROVIDER=ollama |
Ollama base URL (default http://localhost:11434; Docker often http://host.docker.internal:11434). |
OLLAMA_API_KEY |
Optional | Bearer token if Ollama sits behind a gateway. |
EMBED_MODEL |
Always | With Ollama: Ollama embeddings model name. With OpenAI: embeddings model id (e.g. text-embedding-3-small). If you change provider/model/dimension, clear DATA_DIR/chroma and re-index. |
OPENAI_EMBED_BASE_URL |
EMBED_PROVIDER=openai |
OpenAI-compatible embeddings API base (usually ends with /v1). Independent of OPENAI_BASE_URL used for chat. |
OPENAI_EMBED_API_KEY |
EMBED_PROVIDER=openai |
API key for embeddings only. Independent of OPENAI_API_KEY. |
| Variable | Description |
|---|---|
GITLAB_WEBHOOK_SECRET |
GitLab webhook secret (if unset, verification skipped) |
GITHUB_WEBHOOK_SECRET |
GitHub webhook secret for X-Hub-Signature-256 (if unset, skipped) |
GITEA_WEBHOOK_SECRET |
Gitea webhook secret for X-Gitea-Signature (if unset, skipped) |
GITLAB_ACCESS_TOKEN |
HTTPS clone token (private repos); still used if GIT_HTTPS_TOKEN is empty |
GIT_HTTPS_TOKEN |
HTTPS clone token; overrides GITLAB_ACCESS_TOKEN when set |
GIT_HTTPS_USERNAME |
HTTPS basic username for clone (oauth2 default; GitHub: x-access-token) |
GITLAB_EXTERNAL_URL |
Optional base URL for “open repo” links when no repo_url is stored (GitLab-style paths) |
WIKI_BACKEND |
mkdocs (default) / starlight / vitepress (last two need Node.js + npm; Docker image includes them). |
WIKI_ENABLED |
false / 0 disables wiki generation after describe (default: on). |
SKIP_WIKI |
1 skips wiki for a run without changing WIKI_ENABLED. |
EMBED_MAX_CHARS |
Max characters sent per embedding request before truncation (default 30000). |
CONTENT_LANGUAGE |
zh or en; LLM/wiki output language (default zh). |
INDEX_EXCLUDE_PATTERNS |
Comma- or newline-separated path globs to skip during indexing. |
Chunk descriptions and code Q&A use exactly one backend, chosen by LLM_PROVIDER: dify, azure_openai, or openai (default). Configure only the variables for the selected provider.
| Variable | Description |
|---|---|
LLM_PROVIDER |
dify | azure_openai | openai (default openai). |
DIFY_API_KEY |
Dify API key (when LLM_PROVIDER=dify). |
DIFY_BASE_URL |
Dify API base URL (default https://api.dify.ai/v1). |
AZURE_OPENAI_API_KEY |
Azure OpenAI key (when LLM_PROVIDER=azure_openai). |
AZURE_OPENAI_ENDPOINT |
Azure endpoint (e.g. https://xxx.cognitiveservices.azure.com). |
AZURE_OPENAI_VERSION |
Azure API version. |
AZURE_OPENAI_DEPLOYMENT |
Azure deployment name. |
OPENAI_API_KEY |
OpenAI (or compatible) key for chat (when LLM_PROVIDER=openai). |
OPENAI_BASE_URL |
Chat/completions base URL (default https://api.openai.com/v1). |
OPENAI_MODEL |
Chat model or deployment name. |
If no LLM is configured for the selected provider, indexing and retrieval still work; chunk descriptions may be missing or placeholders, and code Q&A will not call a model.
| Variable | Description |
|---|---|
REPOS_CACHE_MAX_GB |
Soft cap on total size (GiB) of DATA_DIR/repos mirrors; evicts other projects’ dirs LRU-style before clone/pull; 0 disables |
REPOS_CACHE_MAX_COUNT |
Max number of cached repo dirs under DATA_DIR/repos (including the current job); evicts other projects LRU-style; 0 disables |
SKIP_VECTOR_STORE |
If 1, runs clone/parse/(optional LLM) but skips Chroma upsert (useful for local validation). |
INCREMENTAL_INDEX |
Set to 1 / true to enable incremental vector indexing (default off). Requires project_index.sqlite3 to already have last_indexed_commit and vectors to use stable gv2_ IDs; otherwise it automatically falls back to full indexing. |
FORCE_FULL_INDEX |
Force a full vector reindex for a run (overrides incremental mode). |
WIKI_KEEP_WORK |
1 keeps intermediate wiki_work/<project_id> for debugging. |
WIKI_MAX_FILE_PAGES |
Max per-path file pages (default 5000). |
WIKI_SYMBOL_ROWS_PER_FILE |
Max symbol table rows per Markdown file (default 4000). |
NPM_REGISTRY |
Optional. When using starlight / vitepress, sets npm_config_registry for npm if non-empty. Or set npm_config_registry / NPM_CONFIG_REGISTRY in the environment (the uppercase form is mapped to npm_config_registry for subprocesses). |
By default under DATA_DIR:
- Repo mirrors:
DATA_DIR/repos/<project_id>/...(optionalREPOS_CACHE_MAX_GB/REPOS_CACHE_MAX_COUNTto auto-remove least-recently-used other project mirrors to save disk; never deletes the repo for the job currently indexing) - Vector store:
DATA_DIR/chroma/ - Jobs DB:
DATA_DIR/index_jobs.sqlite3 - Project vector index metadata:
DATA_DIR/project_index.sqlite3(doc_count, display name, plus incremental fieldslast_indexed_commit/last_embed_model) - Static wiki:
DATA_DIR/wiki_sites/<project_id>/site/plusmanifest.json(intermediatewiki_work/removed unlessWIKI_KEEP_WORK=1)
python3 -m venv .venv
source .venv/bin/activate
pip install -r backend/requirements.txt
cp .env.example .env
cd backend && uvicorn app.main:app --host 127.0.0.1 --port 8000 --reloadAdmin UI local dev (optional, second terminal; set CORS_ORIGINS=http://localhost:5173):
cd frontend && npm install && npm run devNotes:
- When started from
backend/, the defaultDATA_DIR=./dataresolves tobackend/data/; setDATA_DIR=../datain.envif you want data at the repository root. - On startup the service attempts to start the indexing queue worker; vector store/embedding objects are typically loaded on first index or first query.
- If you see
No function-level chunks parsed ...; using file-level fallback, parsing produced zero function chunks and the service fell back to file-level chunks (retrieval still works, but granularity is coarser).
When adding or changing any public API, queue behavior, environment variable, or admin page:
- Update both
README.mdandREADME.zh-CN.mdin the same PR. - Keep the same section order and endpoint coverage in both files.
- Ensure at least one runnable curl example still works after the change.

