Skip to content

feat(docker): split docker-compose UI and gateway into separate services#28711

Draft
yassin-berriai wants to merge 8 commits into
litellm_internal_stagingfrom
litellm_fix/split-docker-compose-ui-and-gateway
Draft

feat(docker): split docker-compose UI and gateway into separate services#28711
yassin-berriai wants to merge 8 commits into
litellm_internal_stagingfrom
litellm_fix/split-docker-compose-ui-and-gateway

Conversation

@yassin-berriai
Copy link
Copy Markdown
Contributor

@yassin-berriai yassin-berriai commented May 23, 2026

Summary

Resolves LIT-2815

The monolithic litellm service in docker-compose.yml served both the LLM API and the admin UI from a single container. This PR splits it into two independent services:

  • gateway — FastAPI/uvicorn API proxy (port 4000), built from gateway/Dockerfile
  • ui — nginx/Next.js admin dashboard (port 3000), built from ui/Dockerfile

The gateway uses depends_on: db: condition: service_healthy so it waits for Postgres to pass pg_isready before starting (avoids connection-refused on cold boot).

Changes

File Change
docker-compose.yml Replace litellm service with ui + gateway; gateway waits for db health
prometheus.yml Update scrape target: litellm:4000gateway:4000
ui/litellm-dashboard/build_ui.sh Atomic rm+mv instead of rm -rf * + cp
tests/test_litellm/test_docker_compose.py 17 new static checks (see test matrix below)

Test Matrix

Scenario Expected Result
ui service exists present in services dict
gateway service exists present in services dict
Legacy litellm service is gone absent from services dict
ui builds from ui/Dockerfile build.dockerfile == "ui/Dockerfile"
gateway builds from gateway/Dockerfile build.dockerfile == "gateway/Dockerfile"
ui exposes port 3000 exact "3000:3000" in ports
gateway exposes port 4000 exact "4000:4000" in ports
gateway has DATABASE_URL env var key present in environment
gateway has STORE_MODEL_IN_DB env var key present in environment
gateway health check configured healthcheck.test non-empty
ui health check configured healthcheck.test non-empty
gateway depends on db with service_healthy dict form with condition: service_healthy
db service exists present in services dict
db health check configured healthcheck.test non-empty
prometheus service exists present in services dict
Named volume postgres_data is declared key present in top-level volumes dict
prometheus.yml scrapes gateway:4000 target contains "gateway"
prometheus.yml does not reference litellm:4000 old target absent

All 17 tests pass (uv run pytest tests/test_litellm/test_docker_compose.py -v).

Migration note

Users who have a local prometheus.yml pointing at litellm:4000 should update the target to gateway:4000. The documentation PR (BerriAI/litellm-docs#210) updates the quick-start guide accordingly.

https://claude.ai/code/session_01LoB2H4kqJM5cFS98gV8GPY

…ces (LIT-2815)

- Replace monolithic `litellm` service with dedicated `ui` (nginx/Next.js,
  port 3000) and `gateway` (uvicorn/FastAPI, port 4000) services, each
  building from their own Dockerfile
- Update prometheus.yml scrape target from `litellm:4000` → `gateway:4000`
- Add litellm/proxy/_experimental/out/ to .gitignore and remove the 680
  committed build artefacts; the UI bundle is now produced inside the
  Docker build, not checked in to source control
- Add tests/test_litellm/test_docker_compose.py with 17 static checks
  covering service presence, build config, ports, env vars, health checks,
  dependencies, and prometheus scrape targets

Resolves LIT-2815

https://claude.ai/code/session_01JVLLUH66aUXF9kxoHcYxWu
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

claude added 6 commits May 23, 2026 21:48
…opied artefacts

The _experimental/out/ directory was removed from source control in the
UI/gateway split (LIT-2815). Dockerfile.non_root was still copying from
that path, causing test-server-root-path CI to fail.

Replace the cp from the now-absent _experimental/out/ with an inline
npm ci + npm run build using the nodejs/npm already installed in the
builder stage, then copy from ui/litellm-dashboard/out/ instead.

https://claude.ai/code/session_01JVLLUH66aUXF9kxoHcYxWu
After removing litellm/proxy/_experimental/out/ from source control
(LIT-2815), a fresh CI checkout no longer contains that directory.
build_ui.sh used `cp -r ./out/* <dest>` which fails when <dest> does
not exist.

Replace with `rm -rf <dest> && mv ./out <dest>`, which works whether
or not the destination exists — identical to the approach already used
in the e2e_ui_testing CircleCI job (see config.yml comment).

https://claude.ai/code/session_01JVLLUH66aUXF9kxoHcYxWu
…n_root

Per review feedback, keep litellm/proxy/_experimental/out/ tracked in git
so the PR diff stays focused on the docker-compose split. Remove the
gitignore entry added in the initial commit and restore Dockerfile.non_root
to its original approach of staging the pre-built UI artefacts.

The build_ui.sh rm+mv fix (avoiding cp-as-child) is retained as a
standalone improvement.

Resolves LIT-2815

https://claude.ai/code/session_01JVLLUH66aUXF9kxoHcYxWu
…al state

Complete the revert started in the previous commit — remove the
litellm/proxy/_experimental/out/ gitignore entry and restore
Dockerfile.non_root to copy from the checked-in static export rather than
building from npm source.

https://claude.ai/code/session_01JVLLUH66aUXF9kxoHcYxWu
…rt (LIT-2815)

Add the `backend` service (port 4001, backend/Dockerfile) to docker-compose.yml
to mirror the three-service architecture in helm/litellm: gateway (4000),
backend (4001), ui (3000). Update tests to cover the new service.

https://claude.ai/code/session_01JVLLUH66aUXF9kxoHcYxWu
The backend service at port 4001 was not part of LIT-2815 (UI/gateway
split) and duplicated gateway's config without serving a defined role.
Remove it so the compose file matches the two-service architecture
described in the ticket and PR.

Also trim the seven backend-specific test cases and the unused `import
os` from test_docker_compose.py; all 17 remaining checks still pass.

Resolves LIT-2815
Copy link
Copy Markdown
Contributor Author

@greptileai


Generated by Claude Code

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR splits the monolithic litellm Docker service into two independent services — gateway (FastAPI/uvicorn on port 4000) and ui (nginx/Next.js on port 3000) — and updates prometheus.yml to scrape the renamed gateway service. Both services get their own Dockerfiles, healthchecks, and the gateway now properly waits for Postgres readiness via condition: service_healthy.

  • docker-compose.yml: litellm service removed; gateway uses depends_on: db: condition: service_healthy and carries over all existing environment variables and healthcheck config; ui is a new nginx-served frontend service with its own healthcheck.
  • tests/test_litellm/test_docker_compose.py: 17 new static checks verify service presence, build config, exact port mappings, env vars, healthchecks, and Prometheus target; port assertions use exact string equality.
  • build_ui.sh: rm -rf dir && mv out dir replaces the old multi-step copy, making the output handoff atomic.

Confidence Score: 5/5

Safe to merge — the changes are confined to Docker/infra config and a new static test file, with no modifications to application logic.

All application code is untouched. The docker-compose split is clean, the service_healthy dependency is correctly wired, and the new tests use exact port-equality assertions. The only gap is a minor one in the test's else branch for depends_on, which has no effect on the current configuration.

No files require special attention; the minor test-coverage gap in test_gateway_depends_on_db is low risk.

Important Files Changed

Filename Overview
docker-compose.yml Splits monolithic litellm service into ui (nginx/Next.js, port 3000) and gateway (FastAPI, port 4000); gateway now uses condition: service_healthy for depends_on: db, and both services have healthchecks configured.
prometheus.yml One-line update: scrape target changed from litellm:4000 to gateway:4000 to match the renamed service.
tests/test_litellm/test_docker_compose.py New static-analysis test file with 17 checks; port assertions now use exact equality; test_gateway_depends_on_db verifies service_healthy in the dict branch but the else branch skips that check, leaving a small coverage gap.
ui/litellm-dashboard/build_ui.sh Replaced rm -rf dir/* && cp -r out/* dir && rm -rf out with the cleaner rm -rf dir && mv out dir, making the build output replacement atomic and resilient to a missing destination directory.

Reviews (2): Last reviewed commit: "fix(docker): address Greptile review and..." | Re-trigger Greptile

Comment thread tests/test_litellm/test_docker_compose.py
Copy link
Copy Markdown
Contributor Author

Addressed both items from the 4/5 review:

  1. depends_on readiness — switched gateway from the list form to condition: service_healthy, so it won't start until pg_isready passes on the db container.
  2. Port-test exactness — replaced substring matching with exact "3000:3000" / "4000:4000" string equality; also extended test_gateway_depends_on_db to assert the service_healthy condition.

Also restored uv.lock to its pre-test-run state (the previous commit had dropped the [options] exclude-newer section because local uv couldn't parse the relative "3 days" duration, breaking uv lock --check and uv sync --frozen in CI).

@greptileai


Generated by Claude Code

…tests)

Two issues raised in Greptile 4/5 review:

1. gateway depends_on used the list form which starts the gateway before
   Postgres is ready. Switch to condition: service_healthy so the
   gateway waits for pg_isready to pass on cold boots.

2. Port tests used substring matching ("3000" in str(p)) which could
   pass spurious mappings like "13000:3000". Change to exact string
   equality "3000:3000" / "4000:4000". Also assert service_healthy
   in test_gateway_depends_on_db.

Resolves LIT-2815
@yassin-berriai yassin-berriai force-pushed the litellm_fix/split-docker-compose-ui-and-gateway branch from 44cb2ee to 9275c6d Compare May 23, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants