Add git mirror hosting to cc-catalog-svc for air-gapped deployments#115
Conversation
When git.enabled, the service mirrors each CodeCollection git_url into bare repos under /data/git, serves them via git smart HTTP at /git, and rewrites catalog git_url responses to public_base_url/<slug>.git once mirrored. Includes scheduled sync, admin trigger, and status API. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
CI clones configured CodeCollection repos into /opt/cc-catalog/git during the Docker build; runtime sync can be disabled so pods with no egress still serve read-only git smart HTTP from baked mirrors. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
Enable git in config.bake.yaml, bake from sources without requiring runtime git.enabled, fail the Docker build when no *.git dirs are produced, and add a test for the bake manifest. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
Use x-access-token Basic auth (GitHub git HTTPS requirement) instead of Bearer, and limit config.bake.yaml to the six runwhen-contrib air-gap repos. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
DictBackend keys must be "/<slug>.git" with Repo objects, not bare paths with bytes keys. Without this, git clone returned "No git repository was found at /rw-generic-codecollection.git" despite present: true in the API. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
- Added `a2wsgi` as a dependency in `pyproject.toml` to support WSGI middleware. - Updated the import of `WSGIMiddleware` in `main.py` to use `a2wsgi` instead of `starlette.middleware.wsgi` for improved compatibility with Dulwich's git smart HTTP. - Added a new test in `test_git_http.py` to verify the functionality of the WSGI mount using `a2wsgi`. This ensures that Dulwich streams refs correctly via the WSGI write callback. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
Use Dulwich make_wsgi_chain so GunzipFilter handles gzip POST bodies from real git clients, which was causing RPC/curl-18 transfer errors on multi-ref repos. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
Replace Dulwich with native git http-backend so platform gitget's fetch(depth=2, tags=True) works; Dulwich crashed on shallow upload-pack and caused taskiq-worker RPC failures. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
Security & correctness:
* Slug validation in repo_bare_path/list_bare_repo_slugs blocks
`..`, slashes, and other path-escape attempts before joining onto
data_dir.
* Git HTTP only serves slugs derived from repos_to_sync(cfg) via the
new make_git_wsgi_app(allowed_slugs=...) parameter, so leftover
*.git directories on disk are 404 rather than served.
* Path router in the WSGI app explicitly allow-lists smart-HTTP
endpoints (info/refs, git-upload-pack, HEAD) and rejects anything
else without spawning git http-backend.
Robustness:
* Process-wide lock around run_git_sync prevents the scheduler and
POST /admin/sync-git from racing on the same bare repos.
* sync_one_repo detects an incomplete bare clone (missing HEAD or
objects/) and re-clones rather than looping forever on
`git remote update` against junk.
* sync_one_repo runs `git remote set-url` when the configured
upstream changes, so an operator edit to git_url is honored on the
next sync.
* WSGI app streams git http-backend output to the client in 64 KiB
chunks via the WSGI write() callback instead of buffering the full
packfile in memory.
Air-gap correctness:
* git.data_dir default flipped to /opt/cc-catalog/git so release
images with build-time baked mirrors work out of the box.
* run_git_sync(force=True) no longer bypasses runtime_sync=False;
callers must pass allow_runtime_sync=True. POST /admin/sync-git
accepts ?allow_runtime_sync=<bool> and defaults to false so a
stray click can't egress to github.com from an air-gapped pod.
* populate_baked_head_commits backfills git_head_commit from disk
at startup so /api/v1/git/repos reports HEAD for baked mirrors
before any runtime sync has run.
Schema:
* init_db now applies idempotent ALTER TABLE ... ADD COLUMN for
git_head_commit / git_last_synced / git_last_sync_error so
already-deployed databases pick up the new columns on upgrade.
Docs:
* docs/GIT_MIRROR.md — full operator runbook (deployment scenarios,
security model, troubleshooting).
* README.md and ARCHITECTURE.md reference the new doc; admin/git
endpoints documented in the API table.
* config.airgap.yaml gains an inline note about the
allow_runtime_sync override.
Misc:
* Drop unused `repos_to_sync` import from scripts/bake_git_mirrors.
Tests: * 104 → 104 passing. New coverage for slug sanitization, path
traversal 404, allowed_slugs scoping, incomplete-clone recovery,
origin URL refresh, sync lock contention, baked head_commit
backfill, and in-place column add on upgrade.
Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
| git_cfg = app_cfg.git | ||
| if git_cfg.enabled and git_cfg.public_base_url and repo_exists(git_cfg.data_dir, cc.slug): | ||
| return public_git_url(cc.slug, git_cfg) | ||
| return cc.git_url |
There was a problem hiding this comment.
Catalog rewrite ignores HTTP allowlist
High Severity
The Git mirror's public_base_url may be advertised for incomplete or non-existent repos because repo_exists is too lenient, checking only for a HEAD file. Additionally, the /git Smart HTTP endpoint's allowed_slugs list is static, preventing newly added repos from being served after a config reload.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ffdd304. Configure here.
| app_cfg = cfg or get_config() | ||
| git_cfg = app_cfg.git | ||
| if git_cfg.enabled and git_cfg.public_base_url and repo_exists(git_cfg.data_dir, cc.slug): | ||
| return public_git_url(cc.slug, git_cfg) |
There was a problem hiding this comment.
Incomplete mirror still rewrites URL
Medium Severity
Catalog rewrite and HTTP serving treat a repo as present when HEAD exists, but sync_one_repo only treats mirrors with an objects directory as complete. Interrupted clones or corrupt trees can pass repo_exists yet fail clone/fetch; with runtime_sync false, sync never repairs them.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ffdd304. Configure here.
| ) | ||
| conn.execute( | ||
| text(f'ALTER TABLE {table_name} ADD COLUMN {col_name} {col_ddl}') | ||
| ) |
There was a problem hiding this comment.
Parallel startup column migration race
Medium Severity
Legacy git column migration uses plain ALTER TABLE ... ADD COLUMN without IF NOT EXISTS. Multiple pods starting together on Postgres can both see missing columns and race; the loser’s init_db can fail and prevent readiness.
Reviewed by Cursor Bugbot for commit ffdd304. Configure here.
…e.yaml - Introduced a new entry for `ss-rw-cli-codecollection` with its corresponding git URL and image registry. - Ensured consistency in the format of the configuration file by maintaining proper indentation and structure.
cc-catalog-svc imageTag:
|
| ["clone", "--mirror", upstream_url, dest], | ||
| git_cfg.auth, | ||
| timeout=git_cfg.clone_timeout_seconds, | ||
| ) |
There was a problem hiding this comment.
Sync can delete repo during clone
High Severity
run_git_sync holds a process lock, but Smart HTTP does not. sync_one_repo can shutil.rmtree a bare repo and re-clone while git http-backend is serving the same path, causing intermittent clone/fetch failures or corrupt packs.
Reviewed by Cursor Bugbot for commit 25063c3. Configure here.
| acquired = _SYNC_LOCK.acquire(blocking=False) | ||
| if not acquired: | ||
| summary["skipped"] = "another git sync is already running" | ||
| return summary |
There was a problem hiding this comment.
Git sync lock per process
Medium Severity
run_git_sync serializes work with a module-level threading.Lock in each process only. Multiple API replicas sharing a writable git.data_dir (e.g. /data/git on a PVC) with runtime_sync enabled can run git clone/remote update on the same bare repo concurrently and corrupt packs or refs.
Reviewed by Cursor Bugbot for commit 25063c3. Configure here.
Before this commit, when two canonical `<ref>-<cc_sha7>-<rt_sha7>` tags shared the same ref (e.g. two `main-...-...` builds), both the OCI source's `resolve_latest` and `_upsert_refs`'s grouping tiebreak fell through to a lex-on-image_tag sort. That sort is wrong for our schema — `cc_sha7` is hex, so `main-1...` sorts ASCII-before `main-d...` even when the `1...` push happened weeks later. The catalog kept reporting a stale `latest_image_tag` until either tag aged out. Fix has two parts: 1. `OCISource.discover_refs` now enriches each tag in a tiebreak group with `built_at`: it GETs the manifest, prefers a `Last-Modified` header when present (JFrog / Harbor / Quay), and otherwise descends into the manifest's `config.digest` blob (or, for OCI image indices, the first child platform manifest) and reads the `created` field written by buildkit. Failures per tag are tolerated. 2. `_upsert_refs` now uses `(built_at, image_tag)` for its grouping tiebreak, matching `resolve_latest` exactly so `is_latest` always lands on the row that `resolve_latest` declared. Enrichment runs only when a ref has more than one canonical tag, so single-tag-per-ref polls do zero extra HTTP work. Also fix a pre-existing test count drift in `test_bake_git_mirrors.py` that hardcoded 6 entries while `config.bake.yaml` now ships 7. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
| write(leftover) | ||
| _stream_body(proc.stdout, write) | ||
| proc.wait() | ||
| _log_proc_stderr(proc, path) |
There was a problem hiding this comment.
stderr may block streaming
Medium Severity
While streaming git http-backend stdout to the client, stderr is not read until after the body finishes. Verbose CGI stderr can fill the pipe buffer and stall the subprocess, hanging clones on that worker.
Reviewed by Cursor Bugbot for commit be3224d. Configure here.
…hex-larger. Same bug as the cc-catalog-svc fix one commit prior: when two canonical `<ref>-<cc_sha7>-<rt_sha7>` tags share the same ref (two `main-...-...` builds in the same registry), both `resolve_latest` and `_upsert_versions`'s grouping tiebreak fell through to a lex-on-image_tag sort, which is wrong for our hex-prefixed schema — `main-1xxxxxx-...` sorts ASCII-before `main-dxxxxxx-...` even when the `1xxxxxx` push happened weeks later. Fix mirrors the cc-catalog-svc one: 1. `OCISource.discover_refs` enriches each tag in a tiebreak group with `built_at`: it GETs the manifest, prefers a `Last-Modified` header when present (JFrog / Harbor / Quay), and otherwise descends into the manifest's `config.digest` blob (or the first child platform manifest for OCI image indices) and reads the `created` field set by buildkit. Per-tag failures are tolerated. 2. `_upsert_versions` uses `(built_at, image_tag)` for its grouping tiebreak, matching `resolve_latest` exactly so `is_latest` always lands on the row `resolve_latest` declared. Enrichment runs only when a ref has more than one canonical tag, so single-tag-per-ref polls do zero extra HTTP work. The shared requests.Session reuses the bearer-token across manifest + config-blob fetches. No tests added (cc-registry-v2 has no organized pytest suite for the backend). The fix has been validated via the matching cc-catalog-svc test in the same branch which covers identical contract. Co-authored-by: Cursor <cursoragent@cursor.com>
cc-catalog-svc imageTag:
|
Container Images BuiltTag:
|
Field report from the JFrog-fronted air-gap deployment: after the previous fix, the catalog STILL kept the older `main-de76dd0-71dfdc4` as the surviving row for ref=main and stamped its `image_built_at` with a timestamp suspiciously close to `last_synced` (same second). Root cause: JFrog Artifactory's docker-remote repository proxies an upstream registry but populates the `Last-Modified` HTTP header on manifest responses with JFrog's local CACHE mtime — i.e. when JFrog last refreshed the manifest from upstream. After the user cleared JFrog's cache, both manifest tags were re-pulled within a few hundred millis of each other in cc-catalog-svc's poll loop, and whichever GET landed last got the freshest mtime. That happened to be the older image, so its `built_at` beat the newer image's. The OCI Distribution spec does not require `Last-Modified` at all, and the only field guaranteed to reflect when an image was actually built is the manifest's `config.created` (set by buildkit / docker buildx unconditionally). Strip the Last-Modified fast path from both OCISource implementations (cc-catalog-svc + cc-registry-v2) and always go to the config blob. Cost: one extra HTTP call per tiebreak tag (manifest GET still required; we now always also GET the config blob). Negligible even for CCs with many competing tags. Tests: - Update `test_discover_refs_enriches_built_at_on_tiebreak` and `test_discover_refs_enrichment_tolerates_per_tag_failure` to mock config blobs instead of `Last-Modified`. - Add `test_discover_refs_ignores_misleading_last_modified_from_jfrog` which reproduces the production scenario exactly: the OLDER build has a NEWER Last-Modified, but the source must still pick the actually-newer build by reading `config.created`. Architecture doc updated to call out the JFrog cache-mtime pitfall. Co-authored-by: Cursor <cursoragent@cursor.com>
Container Images BuiltTag:
|
cc-catalog-svc imageTag:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 7 total unresolved issues (including 6 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 50920bf. Configure here.
| _PATH_RE = re.compile( | ||
| r"^/(?P<slug>[A-Za-z0-9][A-Za-z0-9._-]{0,199})\.git" | ||
| r"(?P<rest>/info/refs|/git-upload-pack|/git-receive-pack|/HEAD)$" | ||
| ) |
There was a problem hiding this comment.
Path regex allows git-receive-pack despite read-only design
Medium Severity
_PATH_RE includes /git-receive-pack as an allowed path, but the service is explicitly read-only. The documentation in GIT_MIRROR.md states the WSGI router only accepts three paths (/info/refs, /git-upload-pack, /HEAD), yet the regex permits a fourth. This causes git http-backend to be spawned for push requests unnecessarily. While the CGI currently rejects pushes (no http.receivepack configured), filtering at the routing layer is the documented intent and provides defense-in-depth against accidental bare-repo misconfiguration.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 50920bf. Configure here.


When git.enabled, the service mirrors each CodeCollection git_url into bare repos under /data/git, serves them via git smart HTTP at /git, and rewrites catalog git_url responses to public_base_url/.git once mirrored. Includes scheduled sync, admin trigger, and status API.
Note
Medium Risk
Adds new network-facing git smart-HTTP surface plus scheduled/admin-driven git sync that shells out to
git, and introduces lightweight DB schema migration logic; misconfiguration or edge cases could expose/serve unintended repos or cause sync/load issues.Overview
Adds an optional git mirror service to
cc-catalog-svc: whengit.enabledis set, the app maintains bare mirrors of CodeCollectiongit_urls, serves them read-only over smart HTTP (mounted atgit.mount_path), and exposes a new status API (GET /api/v1/git/repos*).Integrates git syncing into operations via scheduler support (
scheduler.git_sync_minutes) and a new admin trigger (POST /api/v1/admin/sync-gitwith an air-gap override flag), and updates catalog responses to rewritegit_urltogit.public_base_url/<slug>.gitwhen a local mirror exists.Updates packaging and release flow to support build-time “baked” mirrors: CI passes BuildKit secret/token and build args, the Dockerfile adds a
git-bakestage to clone mirrors into/opt/cc-catalog/git, and runtime includesgitplus startup backfill of baked HEAD commits. Includes DB/model changes to persist git sync metadata and a smallinit_dbin-place column-addition mechanism, plus new air-gap/bake config examples and documentation.Reviewed by Cursor Bugbot for commit 50920bf. Bugbot is set up for automated code reviews on this repo. Configure here.