Git Code Indexing → Vector Store (Chroma) → Semantic Search (Dify / Admin UI)

Languages: English | 中文

This service indexes Git repositories from common hosts (GitLab, GitHub, Gitea) or any clone URL (via manual trigger) into a searchable vector knowledge base. On main/master push webhooks—or when you call the trigger API—it pulls code, chunks it at the function level (falls back to file level when parsing yields zero functions), optionally generates a one-line description per chunk via an LLM, then embeds and upserts into Chroma. You can query via HTTP APIs, use the bundled admin UI, feed hits into Dify (API Tool), or call the code Q&A endpoints.

Screenshots

Admin UI (/admin/): Overview with indexed projects, quick links, and Semantic search with natural-language queries over the vector index.

Overview	Semantic search

What you get

Auto indexing: webhooks for GitLab, GitHub, and Gitea on main/master push (serial worker avoids concurrent write failures); other hosts can use manual trigger or CI calling the same enqueue API
Progress tracking: returns a job_id on enqueue; query job status/progress anytime
Semantic search: results include path, name, start_line, end_line, etc. for quick navigation
Code Q&A (optional LLM): POST /api/code-chat and streaming variant (see OpenAPI at /docs)

30-second quick start

cp .env.example .env
docker compose up -d
curl "http://localhost:8000/health"

Then open:

http://localhost:8000/docs for OpenAPI
http://localhost:8000/admin/ for admin UI

Minimum required settings before first useful indexing:

Embeddings (see Environment variables): default EMBED_PROVIDER=ollama needs OLLAMA_BASE_URL and EMBED_MODEL; use EMBED_PROVIDER=openai with OPENAI_EMBED_BASE_URL, OPENAI_EMBED_API_KEY, and an OpenAI EMBED_MODEL (e.g. text-embedding-3-small). Embedding OpenAI settings are not the same as chat OPENAI_*.
Private HTTPS repos (optional): GIT_HTTPS_TOKEN or GITLAB_ACCESS_TOKEN
Webhook signature verification (recommended for non-LAN): set each platform secret

Repository layout

Path	Purpose
`backend/app/`	Python / FastAPI service, indexing, wiki, vector store
`backend/requirements.txt`	Backend dependencies
`frontend/`	React + Vite admin UI (built assets served at `/admin/`)
`docs/images/`	README screenshots
`LICENSE`	MIT license
`scripts/`	Helper scripts

Workflow (matches the code)

Webhook push (GitLab / GitHub / Gitea, main/master) / Manual trigger (any Git URL)
  ↓
Enqueue job (SQLite persistence, single serial worker)
  ↓
clone_or_pull: clone/pull repo (optional HTTPS auth: GIT_HTTPS_TOKEN or GITLAB_ACCESS_TOKEN, see below)
  ↓
collect_files: scan common code/config files (skip `node_modules` / `.git` / `.env` etc.)
  ↓
parse_functions: Tree-sitter function-level parsing (0 results → file-level fallback)
  ↓
describe_chunks (optional): one-line descriptions via `LLM_PROVIDER` (Dify / Azure OpenAI / OpenAI-compatible)
  ↓
generate_wiki: static wiki (MkDocs / Starlight / VitePress) under DATA_DIR/wiki_sites
  ↓
upsert_vector_store: embeddings (`EMBED_PROVIDER`: Ollama or OpenAI-compatible) → Chroma upsert
  ↓
query/search: semantic retrieval (for Dify/frontend)

Quick start (Docker)

1) Configure `.env`

cp .env.example .env

Typical minimum config:

Private HTTPS repos: set GITLAB_ACCESS_TOKEN and/or GIT_HTTPS_TOKEN (see Private HTTPS clone); for GitHub PAT use GIT_HTTPS_USERNAME=x-access-token
Embeddings: set EMBED_PROVIDER (ollama or openai) and the matching variables (see Environment variables); changing provider or embedding dimension requires clearing DATA_DIR/chroma and re-indexing
Optional LLM descriptions / code chat: set LLM_PROVIDER to dify, azure_openai, or openai (default openai) and configure only that provider’s keys (see LLM provider)

2) Start

docker compose up -d

Service: http://localhost:8000
Docs: http://localhost:8000/docs
Rebuild after code changes: docker compose build --no-cache && docker compose up -d

Webhooks (auto-index on push)

All webhook routes only enqueue on main or master pushes. Successful enqueue returns job_id in the JSON body.

If the corresponding secret env var is unset, signature verification is skipped (convenient for LAN testing; not recommended on the public internet).

GitLab

In the project Settings → Webhooks:

URL: http://<host>:8000/webhook/gitlab
Secret: same as GITLAB_WEBHOOK_SECRET (sent/verified per your GitLab setup)
Events: Push events
Handler accepts object_kind=push only.

GitHub

In the repo Settings → Webhooks → Add webhook:

Payload URL: http://<host>:8000/webhook/github
Content type: application/json
Secret: same as GITHUB_WEBHOOK_SECRET (HMAC SHA-256, header X-Hub-Signature-256)
Events: Just the push event (ping events are ignored with 200)

Gitea

In the repo Settings → Webhooks:

URL: http://<host>:8000/webhook/gitea
Secret: same as GITEA_WEBHOOK_SECRET (header X-Gitea-Signature, HMAC-SHA256 hex of raw body)
Events: Push

Private HTTPS clone

For private repositories over HTTPS, the worker injects credentials into the clone URL when the clean URL has no embedded userinfo:

Variable	Role
`GIT_HTTPS_TOKEN`	If set, takes precedence over `GITLAB_ACCESS_TOKEN`
`GITLAB_ACCESS_TOKEN`	Still supported (e.g. GitLab `read_repository` PAT)
`GIT_HTTPS_USERNAME`	HTTP basic username for the token; default when empty is `oauth2` (GitLab). For GitHub, set `x-access-token`

You can also set GIT_HTTPS_USERNAME from the admin Settings UI (/admin/).

Manual indexing (without Webhook)

Option A: `/webhook/trigger`

curl -X POST "http://localhost:8000/webhook/trigger" \
  -H "Content-Type: application/json" \
  -d '{"repo_url":"https://github.com/acme/backend.git","project_id":"acme/backend","project_name":"Display name"}'

Use any HTTPS or SSH URL your runtime can git clone. Optional project_name: human-readable label (e.g. Chinese name). Stored on the job and shown on the Wiki home page, site title, and manifest.json.

Option B: `/api/index-jobs/enqueue` (equivalent enqueue API)

curl -X POST "http://localhost:8000/api/index-jobs/enqueue" \
  -H "Content-Type: application/json" \
  -d '{"repo_url":"https://github.com/acme/backend.git","project_id":"acme/backend","project_name":"Display name"}'

Query (for Dify / frontend)

Semantic search

POST /api/query

curl -X POST "http://localhost:8000/api/query" \
  -H "Content-Type: application/json" \
  -d '{"query":"How is user login implemented?","project_id":"my-repo","top_k":10}'

GET /api/search

curl "http://localhost:8000/api/search?q=login&project_id=my-repo&top_k=10"

Response shape:

{
  "results": [
    {
      "score": 0.123,
      "content": "...",
      "metadata": {
        "path": "app/auth.py",
        "name": "login",
        "kind": "function",
        "start_line": 10,
        "end_line": 88
      }
    }
  ]
}

Projects / index status

GET /api/projects: list indexed projects and doc counts; each item includes project_name (display name, may be null); optional q (matches project_id or project_name substring), limit/offset pagination (omit limit for full list, backward compatible)
DELETE /api/projects/{project_id}: remove project vectors and metadata
POST /api/projects/{project_id}/reindex: enqueue a rebuild for the project
GET /api/projects/{project_id}/vectors: inspect stored vectors with pagination
GET /api/project/index-status?project_id=xxx: check whether a project is indexed (indexed/doc_count)

Admin / auth / operations APIs

Auth UI: GET /api/auth/status, POST /api/auth/login, GET /api/auth/me
Admin settings: GET /api/admin/settings
Storage insight: GET /api/admin/storage
LLM usage metrics: GET /api/admin/llm-usage
Code chat feedback: POST /api/code-chat/feedback

Static Wiki (MkDocs / Starlight / VitePress)

After describe_chunks, the worker runs a wiki build before vector upsert. Failures are logged only and do not block indexing. Default WIKI_BACKEND=mkdocs (Material, Python-only). starlight or vitepress need Node.js + npm (included in this repo’s Dockerfile; install them yourself on bare metal). The first build runs npm install under wiki_work/<project_id> (npm registry access required). Output: DATA_DIR/wiki_sites/<project_id>/site/, served by the API.

Browse: http://<host>:8000/wiki/<project_id>/site/ (same project_id rules as under repos/)
Metadata: GET /api/wiki/{project_id} → last manifest.json (includes wiki_backend, commit, timestamps, counts)

Pages include overview, architecture (when an LLM is configured), file index (tree), per-file symbol pages, and a symbol table (split into parts when large). MkDocs uses Lunr search; Starlight/VitePress use built-in local search. On each symbol, Functionality shows only the LLM-generated one-line description from the indexing pipeline (same field as in the vector store); if the model is not configured or generation failed, a placeholder explains that. When a source docstring differs from that text, it appears separately under Source Docstring.

Index queue & job progress

Indexing runs through a serial queue (avoids concurrent writes to Chroma / local repo dirs). Job state is persisted in SQLite so you can still query history after restarts.

List jobs: GET /api/index-jobs?limit=50&offset=0 (optional status / project_id filters; response total is the full match count, jobs is the current page, limit/offset echo the request)
Get one job: GET /api/index-jobs/{job_id}
Cancel a job: POST /api/index-jobs/{job_id}/cancel (supports queued and running; running jobs are terminated)
Retry a failed/cancelled job: POST /api/index-jobs/{job_id}/retry
Precheck a repo before enqueue: POST /api/index-jobs/precheck

Key fields:

status: queued / running / succeeded / failed / cancelled
progress: 0-100
step: stage name (e.g. clone_or_pull / parse_functions / generate_wiki / upsert_vector_store)
message: human-friendly stage message

Environment variables

Most values can also be changed from the admin UI (/admin/ → Settings); overrides are stored in DATA_DIR/ui_overrides.json and take precedence over .env for supported keys.

Required for baseline indexing

Variable	Description
`DATA_DIR`	Data directory (default `./data`; commonly `/data` in containers)

Embeddings — set EMBED_PROVIDER to ollama (default) or openai:

Variable	When	Description
`EMBED_PROVIDER`	Always	`ollama` (default) or `openai`.
`OLLAMA_BASE_URL`	`EMBED_PROVIDER=ollama`	Ollama base URL (default `http://localhost:11434`; Docker often `http://host.docker.internal:11434`).
`OLLAMA_API_KEY`	Optional	Bearer token if Ollama sits behind a gateway.
`EMBED_MODEL`	Always	With Ollama: Ollama embeddings model name. With OpenAI: embeddings model id (e.g. `text-embedding-3-small`). If you change provider/model/dimension, clear `DATA_DIR/chroma` and re-index.
`OPENAI_EMBED_BASE_URL`	`EMBED_PROVIDER=openai`	OpenAI-compatible embeddings API base (usually ends with `/v1`). Independent of `OPENAI_BASE_URL` used for chat.
`OPENAI_EMBED_API_KEY`	`EMBED_PROVIDER=openai`	API key for embeddings only. Independent of `OPENAI_API_KEY`.

Common optional settings

Variable	Description
`GITLAB_WEBHOOK_SECRET`	GitLab webhook secret (if unset, verification skipped)
`GITHUB_WEBHOOK_SECRET`	GitHub webhook secret for `X-Hub-Signature-256` (if unset, skipped)
`GITEA_WEBHOOK_SECRET`	Gitea webhook secret for `X-Gitea-Signature` (if unset, skipped)
`GITLAB_ACCESS_TOKEN`	HTTPS clone token (private repos); still used if `GIT_HTTPS_TOKEN` is empty
`GIT_HTTPS_TOKEN`	HTTPS clone token; overrides `GITLAB_ACCESS_TOKEN` when set
`GIT_HTTPS_USERNAME`	HTTPS basic username for clone (`oauth2` default; GitHub: `x-access-token`)
`GITLAB_EXTERNAL_URL`	Optional base URL for “open repo” links when no `repo_url` is stored (GitLab-style paths)
`WIKI_BACKEND`	`mkdocs` (default) / `starlight` / `vitepress` (last two need Node.js + npm; Docker image includes them).
`WIKI_ENABLED`	`false` / `0` disables wiki generation after describe (default: on).
`SKIP_WIKI`	`1` skips wiki for a run without changing `WIKI_ENABLED`.
`EMBED_MAX_CHARS`	Max characters sent per embedding request before truncation (default `30000`).
`CONTENT_LANGUAGE`	`zh` or `en`; LLM/wiki output language (default `zh`).
`INDEX_EXCLUDE_PATTERNS`	Comma- or newline-separated path globs to skip during indexing.

LLM provider

Chunk descriptions and code Q&A use exactly one backend, chosen by LLM_PROVIDER: dify, azure_openai, or openai (default). Configure only the variables for the selected provider.

Variable	Description
`LLM_PROVIDER`	`dify` \| `azure_openai` \| `openai` (default `openai`).
`DIFY_API_KEY`	Dify API key (when `LLM_PROVIDER=dify`).
`DIFY_BASE_URL`	Dify API base URL (default `https://api.dify.ai/v1`).
`AZURE_OPENAI_API_KEY`	Azure OpenAI key (when `LLM_PROVIDER=azure_openai`).
`AZURE_OPENAI_ENDPOINT`	Azure endpoint (e.g. `https://xxx.cognitiveservices.azure.com`).
`AZURE_OPENAI_VERSION`	Azure API version.
`AZURE_OPENAI_DEPLOYMENT`	Azure deployment name.
`OPENAI_API_KEY`	OpenAI (or compatible) key for chat (when `LLM_PROVIDER=openai`).
`OPENAI_BASE_URL`	Chat/completions base URL (default `https://api.openai.com/v1`).
`OPENAI_MODEL`	Chat model or deployment name.

If no LLM is configured for the selected provider, indexing and retrieval still work; chunk descriptions may be missing or placeholders, and code Q&A will not call a model.

Advanced controls

Variable	Description
`REPOS_CACHE_MAX_GB`	Soft cap on total size (GiB) of `DATA_DIR/repos` mirrors; evicts other projects’ dirs LRU-style before clone/pull; `0` disables
`REPOS_CACHE_MAX_COUNT`	Max number of cached repo dirs under `DATA_DIR/repos` (including the current job); evicts other projects LRU-style; `0` disables
`SKIP_VECTOR_STORE`	If `1`, runs clone/parse/(optional LLM) but skips Chroma upsert (useful for local validation).
`INCREMENTAL_INDEX`	Set to `1` / `true` to enable incremental vector indexing (default off). Requires `project_index.sqlite3` to already have `last_indexed_commit` and vectors to use stable `gv2_` IDs; otherwise it automatically falls back to full indexing.
`FORCE_FULL_INDEX`	Force a full vector reindex for a run (overrides incremental mode).
`WIKI_KEEP_WORK`	`1` keeps intermediate `wiki_work/<project_id>` for debugging.
`WIKI_MAX_FILE_PAGES`	Max per-path file pages (default `5000`).
`WIKI_SYMBOL_ROWS_PER_FILE`	Max symbol table rows per Markdown file (default `4000`).
`NPM_REGISTRY`	Optional. When using `starlight` / `vitepress`, sets `npm_config_registry` for npm if non-empty. Or set `npm_config_registry` / `NPM_CONFIG_REGISTRY` in the environment (the uppercase form is mapped to `npm_config_registry` for subprocesses).

Data & persistence

By default under DATA_DIR:

Repo mirrors: DATA_DIR/repos/<project_id>/... (optional REPOS_CACHE_MAX_GB / REPOS_CACHE_MAX_COUNT to auto-remove least-recently-used other project mirrors to save disk; never deletes the repo for the job currently indexing)
Vector store: DATA_DIR/chroma/
Jobs DB: DATA_DIR/index_jobs.sqlite3
Project vector index metadata: DATA_DIR/project_index.sqlite3 (doc_count, display name, plus incremental fields last_indexed_commit / last_embed_model)
Static wiki: DATA_DIR/wiki_sites/<project_id>/site/ plus manifest.json (intermediate wiki_work/ removed unless WIKI_KEEP_WORK=1)

Development (local)

python3 -m venv .venv
source .venv/bin/activate
pip install -r backend/requirements.txt
cp .env.example .env
cd backend && uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload

Admin UI local dev (optional, second terminal; set CORS_ORIGINS=http://localhost:5173):

cd frontend && npm install && npm run dev

Notes:

When started from backend/, the default DATA_DIR=./data resolves to backend/data/; set DATA_DIR=../data in .env if you want data at the repository root.
On startup the service attempts to start the indexing queue worker; vector store/embedding objects are typically loaded on first index or first query.
If you see No function-level chunks parsed ...; using file-level fallback, parsing produced zero function chunks and the service fell back to file-level chunks (retrieval still works, but granularity is coarser).

README maintenance rule

When adding or changing any public API, queue behavior, environment variable, or admin page:

Update both README.md and README.zh-CN.md in the same PR.
Keep the same section order and endpoint coverage in both files.
Ensure at least one runnable curl example still works after the change.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github/workflows		.github/workflows
backend		backend
docs/images		docs/images
frontend		frontend
scripts		scripts
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Git Code Indexing → Vector Store (Chroma) → Semantic Search (Dify / Admin UI)

Screenshots

What you get

30-second quick start

Repository layout

Workflow (matches the code)

Quick start (Docker)

1) Configure .env

2) Start

Webhooks (auto-index on push)

GitLab

GitHub

Gitea

Private HTTPS clone

Manual indexing (without Webhook)

Option A: /webhook/trigger

Option B: /api/index-jobs/enqueue (equivalent enqueue API)

Query (for Dify / frontend)

Semantic search

Projects / index status

Admin / auth / operations APIs

Static Wiki (MkDocs / Starlight / VitePress)

Index queue & job progress

Environment variables

Required for baseline indexing

Common optional settings

LLM provider

Advanced controls

Data & persistence

Development (local)

README maintenance rule

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1) Configure `.env`

Option A: `/webhook/trigger`

Option B: `/api/index-jobs/enqueue` (equivalent enqueue API)

Packages