Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/cc-catalog-svc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,9 @@ jobs:
build-args: |
GITHUB_SHA=${{ github.sha }}
GITHUB_REF=${{ github.ref }}
BAKE_GIT_MIRRORS=true
secrets: |
github_token=${{ secrets.GITHUB_TOKEN }}

- name: Release summary
if: needs.create-release.outputs.release_tag != ''
Expand Down
55 changes: 52 additions & 3 deletions cc-catalog-svc/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# cc-catalog-svc image.
#
# syntax=docker/dockerfile:1.4
# Two layers worth of intent:
#
# 1. A small Python runtime (python:3.12-slim) running a single uvicorn
Expand All @@ -10,11 +11,20 @@
# the de-facto tool for OCI image copy. We pin a specific version so
# builds are reproducible.
#
# Air-gap git mirrors:
# CI has outbound git access; the running container may not. When
# BAKE_GIT_MIRRORS=true (default in CI), the git-bake stage clones bare
# mirrors into /opt/cc-catalog/git — outside the /data volume so a PVC
# mount does not hide baked repos. Deploy with git.data_dir=/opt/cc-catalog/git
# and git.runtime_sync=false (see config-examples/config.airgap.yaml).
#
# Read-only-rootfs friendly: the only writable path the service needs is
# /data (SQLite DB + crane's auth cache). Mount a volume there in k8s.

ARG PYTHON_VERSION=3.12-slim
ARG CRANE_VERSION=v0.20.2
ARG BAKE_GIT_MIRRORS=true
ARG BAKE_CONFIG=config-examples/config.bake.yaml

# -----------------------------------------------------------------------------
# Stage 1: fetch crane. Doing this in a tiny scratch-ish stage keeps the
Expand All @@ -39,7 +49,39 @@ RUN set -eux; \
/usr/local/bin/crane version

# -----------------------------------------------------------------------------
# Stage 2: runtime image.
# Stage 2: bake git mirrors (CI only — needs outbound network).
# -----------------------------------------------------------------------------
FROM python:${PYTHON_VERSION} AS git-bake
ARG BAKE_GIT_MIRRORS
ARG BAKE_CONFIG
RUN apt-get update \
&& apt-get install -y --no-install-recommends git ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY pyproject.toml README.md ./
COPY app ./app
COPY scripts ./scripts
COPY config-examples/config.bake.yaml ./config-examples/config.bake.yaml
RUN pip install --no-cache-dir .
RUN mkdir -p /opt/cc-catalog/git
RUN --mount=type=secret,id=github_token \
if [ "${BAKE_GIT_MIRRORS}" = "true" ]; then \
export GITHUB_TOKEN="$(cat /run/secrets/github_token 2>/dev/null || true)"; \
python scripts/bake_git_mirrors.py \
--config "${BAKE_CONFIG}" \
--dest /opt/cc-catalog/git; \
baked="$(find /opt/cc-catalog/git -mindepth 1 -maxdepth 1 -type d -name '*.git' | wc -l)"; \
if [ "${baked}" -eq 0 ]; then \
echo "ERROR: BAKE_GIT_MIRRORS=true but no *.git dirs under /opt/cc-catalog/git" >&2; \
exit 1; \
fi; \
echo "git bake: ${baked} bare repo(s) in /opt/cc-catalog/git"; \
else \
echo "BAKE_GIT_MIRRORS=false; skipping git mirror bake"; \
fi

# -----------------------------------------------------------------------------
# Stage 3: runtime image.
# -----------------------------------------------------------------------------
FROM python:${PYTHON_VERSION} AS runtime

Expand All @@ -52,10 +94,17 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
# operator instinct carries over.
RUN groupadd --system --gid 1000 catalog \
&& useradd --system --uid 1000 --gid 1000 --home /home/catalog --create-home catalog \
&& mkdir -p /app /data \
&& chown -R catalog:catalog /app /data
&& mkdir -p /app /data /data/git /opt/cc-catalog/git \
&& chown -R catalog:catalog /app /data /opt/cc-catalog

COPY --from=crane /usr/local/bin/crane /usr/local/bin/crane
COPY --from=git-bake /opt/cc-catalog/git /opt/cc-catalog/git

# git binary kept for optional runtime_sync in connected environments.
RUN apt-get update \
&& apt-get install -y --no-install-recommends git \
&& rm -rf /var/lib/apt/lists/* \
&& chown -R catalog:catalog /opt/cc-catalog/git

WORKDIR /app
# README.md is referenced from pyproject.toml's `readme = "README.md"`, so
Expand Down
10 changes: 10 additions & 0 deletions cc-catalog-svc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,15 @@ Admin (writes):
|---|---|
| POST | `/api/v1/admin/reload-config` |
| POST | `/api/v1/admin/sync-catalog` |
| POST | `/api/v1/admin/sync-git?allow_runtime_sync=<bool>` |

Git mirror hosting (read-only — see [docs/GIT_MIRROR.md](docs/GIT_MIRROR.md)):

| Method | Path |
|---|---|
| GET | `/api/v1/git/repos` |
| GET | `/api/v1/git/repos/{slug}` |
| GET/POST | `/git/{slug}.git/info/refs`, `/git/{slug}.git/git-upload-pack` (smart HTTP) |

Health:

Expand Down Expand Up @@ -212,5 +221,6 @@ without network, without Postgres.

- [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) — design rationale, HA, schema
- [docs/JFROG.md](docs/JFROG.md) — JFrog destination operator guide
- [docs/GIT_MIRROR.md](docs/GIT_MIRROR.md) — git mirror hosting for air-gap deployments
- [docs/PAPI_INTEGRATION.md](docs/PAPI_INTEGRATION.md) — how PAPI consumes the catalog and the `?destination=` extension
- `cc-registry-v2/docs/CCV.md` — the contract this service is compatible with (in the sibling project)
97 changes: 96 additions & 1 deletion cc-catalog-svc/app/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,100 @@ class SchedulerConfig(BaseModel):
mirror_poll_minutes: int = Field(5, ge=1, le=1440)
mirror_workers: int = Field(2, ge=1, le=32)
per_job_timeout_seconds: int = Field(600, ge=10, le=3600)
git_sync_minutes: int = Field(
60,
ge=1,
le=1440,
description="How often to fetch upstream git repos into local mirrors.",
)


class GitAuth(BaseModel):
"""Auth for fetching upstream git repos (GitHub PAT, etc.)."""

token_env: Optional[str] = Field(
None,
description="Env var holding a Bearer token (GitHub PAT, etc.).",
)
user_env: Optional[str] = Field(
None,
description="Env var holding HTTP Basic username.",
)
pass_env: Optional[str] = Field(
None,
description="Env var holding HTTP Basic password or token-as-password.",
)

@model_validator(mode="after")
def _exactly_one_or_none(self) -> "GitAuth":
token = bool(self.token_env)
basic_any = bool(self.user_env) or bool(self.pass_env)
basic_both = bool(self.user_env) and bool(self.pass_env)
if token and basic_any:
raise ValueError("git.auth: set EITHER token_env OR user_env+pass_env, not both")
if basic_any and not basic_both:
raise ValueError("git.auth: user_env and pass_env must be set together")
return self


class GitServiceConfig(BaseModel):
"""Local git mirror + smart HTTP service for air-gapped deployments.

When enabled, cc-catalog-svc maintains bare mirror clones of each
configured CodeCollection ``git_url`` and serves them read-only at
``mount_path``. Catalog API responses rewrite ``git_url`` to
``public_base_url/<slug>.git`` once a mirror exists.

Release images bake mirrors at Docker build time into
``/opt/cc-catalog/git`` (outside the ``/data`` volume). Set
``data_dir: /opt/cc-catalog/git`` and ``runtime_sync: false`` for
air-gapped deployments with no outbound git access.
"""

enabled: bool = False
data_dir: str = Field(
"/opt/cc-catalog/git",
description=(
"Directory for bare mirror repos (<slug>.git). Defaults to "
"/opt/cc-catalog/git so release images with build-time baked "
"mirrors work out of the box. Override to a writable path on "
"the data PVC (e.g. /data/git) if you want runtime_sync to "
"persist fetched objects across pod restarts."
),
)
mount_path: str = Field(
"/git",
description="HTTP path prefix for git smart HTTP (clone URL path).",
)
public_base_url: Optional[str] = Field(
None,
description=(
"External base URL for clone commands, e.g. "
"https://cc-catalog.example.com/git. Omit to keep upstream git_url "
"in catalog responses even when mirrors exist."
),
)
runtime_sync: bool = Field(
True,
description=(
"When false, skip scheduled/admin background fetch from upstream. "
"Use with build-time baked mirrors in air-gapped environments."
),
)
Comment thread
cursor[bot] marked this conversation as resolved.
auth: GitAuth = Field(default_factory=GitAuth)
codecollections: list[str] = Field(
default_factory=list,
description=("Slugs to mirror. Empty = every CC with a git_url from sources."),
)
clone_timeout_seconds: int = Field(900, ge=30, le=7200)
fetch_timeout_seconds: int = Field(600, ge=30, le=3600)

@field_validator("auth", mode="before")
@classmethod
def _none_to_empty_auth(cls, v):
if v is None:
return {}
return v


class CatalogAPIConfig(BaseModel):
Expand Down Expand Up @@ -304,10 +398,11 @@ class AppConfig(BaseModel):
storage: StorageConfig = Field(default_factory=StorageConfig)
catalog_api: CatalogAPIConfig = Field(default_factory=CatalogAPIConfig)
scheduler: SchedulerConfig = Field(default_factory=SchedulerConfig)
git: GitServiceConfig = Field(default_factory=GitServiceConfig)
sources: list[SourceConfig] = Field(default_factory=list)
destinations: list[DestinationConfig] = Field(default_factory=list)

@field_validator("storage", "catalog_api", "scheduler", mode="before")
@field_validator("storage", "catalog_api", "scheduler", "git", mode="before")
@classmethod
def _none_to_default(cls, v, info):
# `key:` in YAML => None; treat as "use defaults".
Expand Down
60 changes: 52 additions & 8 deletions cc-catalog-svc/app/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
from contextlib import contextmanager
from typing import Iterator

from sqlalchemy import create_engine, event
from sqlalchemy import create_engine, event, inspect, text
from sqlalchemy.engine import Engine
from sqlalchemy.orm import Session, sessionmaker

Expand All @@ -32,6 +32,18 @@

logger = logging.getLogger(__name__)

# Columns added after the initial schema cut. ``init_db`` ensures these
# exist on already-deployed databases without requiring a full Alembic
# pipeline. Format: {table_name: {column_name: SQL type for ADD COLUMN}}.
# Keep types portable across sqlite + postgres (no ``SERIAL`` etc.).
_LEGACY_COLUMN_ADDITIONS: dict[str, dict[str, str]] = {
"codecollections": {
"git_head_commit": "VARCHAR(80)",
"git_last_synced": "TIMESTAMP",
"git_last_sync_error": "TEXT",
},
}

_engine: Engine | None = None
_SessionLocal: sessionmaker[Session] | None = None

Expand Down Expand Up @@ -79,17 +91,49 @@ def get_session_factory() -> sessionmaker[Session]:


def init_db() -> None:
"""Create all tables. Idempotent.

We intentionally use `Base.metadata.create_all` rather than Alembic
for the first cut: the schema is small (5 tables), single-writer
(the scheduler), and the service is greenfield. When the schema
starts evolving we'll add Alembic; until then `create_all` keeps the
bootstrap path trivial.
"""Create all tables and apply lightweight in-place migrations.

We intentionally use ``Base.metadata.create_all`` rather than
Alembic for the first cut: the schema is small (5 tables),
single-writer (the scheduler), and the service is greenfield. When
the schema starts evolving in earnest we'll add Alembic.

``create_all`` only creates *missing* tables, so any column added
to an existing table after the initial release would silently fail
on upgrade. ``_apply_legacy_column_additions`` patches that gap by
issuing ``ALTER TABLE ... ADD COLUMN IF NOT EXISTS`` (sqlite + pg
compatible) for every column registered in
``_LEGACY_COLUMN_ADDITIONS``.
"""
engine = get_engine()
logger.info("initializing schema on %s", _safe_dsn(str(engine.url)))
Base.metadata.create_all(engine)
_apply_legacy_column_additions(engine)


def _apply_legacy_column_additions(engine: Engine) -> None:
"""Idempotently add columns introduced after the initial schema."""
inspector = inspect(engine)
existing_tables = set(inspector.get_table_names())
for table_name, columns in _LEGACY_COLUMN_ADDITIONS.items():
if table_name not in existing_tables:
# create_all just made it, so every column is already there.
continue
present = {c["name"] for c in inspector.get_columns(table_name)}
missing = {name: ddl for name, ddl in columns.items() if name not in present}
if not missing:
continue
with engine.begin() as conn:
for col_name, col_ddl in missing.items():
logger.info(
"init_db: adding missing column %s.%s (%s)",
table_name,
col_name,
col_ddl,
)
conn.execute(
text(f'ALTER TABLE {table_name} ADD COLUMN {col_name} {col_ddl}')
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallel startup column migration race

Medium Severity

Legacy git column migration uses plain ALTER TABLE ... ADD COLUMN without IF NOT EXISTS. Multiple pods starting together on Postgres can both see missing columns and race; the loser’s init_db can fail and prevent readiness.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ffdd304. Configure here.



@contextmanager
Expand Down
17 changes: 17 additions & 0 deletions cc-catalog-svc/app/git_http/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
"""Git smart HTTP serving for mirrored bare repositories."""

from app.git_http.server import (
is_valid_slug,
list_bare_repo_slugs,
make_git_wsgi_app,
repo_bare_path,
repo_exists,
)

__all__ = [
"is_valid_slug",
"list_bare_repo_slugs",
"make_git_wsgi_app",
"repo_bare_path",
"repo_exists",
]
Loading
Loading