fix(ops-controller): inject decrypted runtime secrets into compose subprocesses#60
Merged
AlienWalker1995 merged 1 commit intoJun 26, 2026
Conversation
…bprocesses
The ops-controller recreates services via its own docker-compose subprocess
(dashboard recreate, POST /compose/*, /services/*/recreate), but those only saw
the auto-loaded `.env` — never `~/.ai-toolkit/runtime/.env`. So any
secret-dependent service it recreated came up with secrets unset and crash-looped
(oauth2-proxy: `cookie_secret must be 16, 24, or 32 bytes`). This is the root
cause behind the 2026-06-26 outage and the bandaids that followed it.
Fix (least privilege): mount the already-decrypted `runtime/.env` read-only into
ops-controller and inject it into the compose subprocess env via a shared
`_compose_env()` helper (also de-duplicates 3 inline env constructions). compose
interpolates `${VAR}` from the process env, so recreated secret services now get
real values. ops-controller gets the decrypted env only — never the age key;
decryption stays host-only.
- ops-controller/main.py: add `_load_runtime_env()` + `_compose_env()`; use in
`_recreate_service`, `_run_compose`, and the `/services/*/recreate` endpoint.
Add caddy/oauth2-proxy/searxng to ALLOWED_SERVICES (now safe to recreate).
- docker-compose.yml: mount `${HOME}/.ai-toolkit/runtime/.env:/run/runtime.env:ro`
+ `RUNTIME_ENV_FILE`; correct the stale watchdog comment.
- docs: secrets runbook gains a "how services receive secrets at runtime" section
(two --env-file model + ops-controller injection + local-only boundary);
README + secrets/README updated to match.
- tests: parsing, missing-file/dir degradation, runtime-overrides-process-env,
extra-overrides-all (fabricated values only).
No secret values are committed — only path references and architecture.
Validated end-to-end: ops-controller `/compose/up oauth2-proxy` (the previously
broken path) brings it up healthy with no cookie_secret error; 36 ops-controller
tests pass; ruff clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The ops-controller recreates services via its own
docker-composesubprocess (dashboard "recreate",POST /compose/*,/services/*/recreate). Those subprocesses only ever saw the auto-loaded.env— never~/.ai-toolkit/runtime/.env. So any secret-dependent service ops-controller recreated came up with its secrets unset and crash-looped:This is the root cause behind the 2026-06-26 oauth2-proxy outage (and the bandaids that followed: placeholder secrets in
.env, empty stub files, compose-path rewrites).Fix (least privilege)
runtime/.envread-only into ops-controller (/run/runtime.env) and inject it into the compose subprocess env via a shared_compose_env()helper (which also de-duplicates 3 inlineenvconstructions). docker-compose interpolates${VAR}from the process env, so recreated secret services now get real values.make decrypt-secrets). A compromised ops-controller (already docker-socket-privileged) leaks current secrets but cannot decrypt.sopshistory or re-derive the key.caddy/oauth2-proxy/searxngadded toALLOWED_SERVICES— now safe to recreate. (The self-heal watchdog already covers them via its exclude-list model; the "hermes-only" comment was stale and is corrected.)Changes
ops-controller/main.py—_load_runtime_env()+_compose_env(); used in_recreate_service,_run_compose, and the/services/*/recreateendpoint.ALLOWED_SERVICES+= caddy/oauth2-proxy/searxng.docker-compose.yml— read-onlyruntime/.envmount +RUNTIME_ENV_FILE; corrected watchdog comment.docs/runbooks/secrets.md— new "How services receive secrets at runtime" section (two---env-filemodel, ops-controller injection, local-only boundary). README +secrets/README.mdupdated to match.tests/test_ops_controller_compose_env.py— parsing, missing-file/dir degradation, runtime-overrides-process-env, extra-overrides-all (fabricated values only).No secret values are committed — only path references and architecture.
Validation
docker exec ops-controller→POST /compose/up {"service":"oauth2-proxy"}(the previously-broken path) → oauth2-proxy comes up healthy in ~31s, nocookie_secreterror. Verified the 32-byte cookie secret is read inside ops-controller.ruffclean.🤖 Generated with Claude Code