feat(runner): OpenShell runner image, gateway provider integration, and CI#203
Conversation
Add Dockerfile.gw for building the runner image for OpenShell gateway mode. Layers the ambient runner on top of the OpenShell sandbox base (UBI10 + dev toolchain + Claude Code CLI) with supervisor/sandbox user separation, network policy enforcement, and the AG-UI uvicorn server. Supporting files: - claude-wrapper.sh: bootstraps Claude config and routes inference through the gateway's loopback proxy - policy.yaml: OpenShell network/filesystem policy defining per-binary egress rules (Vertex AI, GitHub, PyPI, etc.) and landlock filesystem boundaries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Review — PR #203: Gateway-mode Dockerfile
Good structure overall — the layered build approach (system → devtools → final), user/group separation (supervisor/sandbox), and the network policy egress controls in policy.yaml are solid. The claude-wrapper.sh bootstrapping pattern is clean.
Blockers before merge:
-
Supply chain risk (Critical):
curl | bashused twice — for Node.js setup (nodesource) and Claude Code install (claude.ai/install.sh). Both execute unverified scripts as root at build time. Claude Code should be installed vianpm install -g @anthropic-ai/claude-code@<version>instead. -
SHA-pinned base image regressed to
:latest(Major): The existingDockerfilewas pinned to a SHA — this PR replaces bothDockerfileandDockerfile.gwwith:latest. Restore the pin;:latestremoves intentionality from base image updates. -
dnf upgrade -ymakes builds non-reproducible (Major): Non-deterministic packages at build time. Better handled by pinning a newer base image SHA.
Other findings: npm global package integrity, GitHub CLI repo trust, missing K8s SecurityContext context/pointer to companion PR, undocumented sentinel ANTHROPIC_API_KEY=gateway, policy path anchor question. Details in inline comments.
Confidence: High — the curl | bash findings are unambiguous supply chain risks for container images.
— Amber
| 'export PS1="\u@\h:\w\$ "' \ | ||
| > /sandbox/.bashrc && \ | ||
| printf '%s\n' \ | ||
| '[ -f ~/.bashrc ] && . ~/.bashrc' \ |
There was a problem hiding this comment.
Minor — SecurityContext missing from Dockerfile.gw; required fields absent
Project convention (CLAUDE.md): "Restricted SecurityContext on all containers: runAsNonRoot, drop ALL capabilities, readOnlyRootFilesystem".
The Dockerfile creates a sandbox user and drops to that user's context, but there's no K8s SecurityContext manifest shipped with this PR. The policy.yaml defines filesystem policy for OpenShell, but the K8s Pod/Job spec that uses this image needs:
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities:
drop: [ALL]
readOnlyRootFilesystem: trueIs the K8s manifest for the gateway runner pod being shipped in a companion PR (e.g., #202 referenced in the PR description)? If so, please link it. If this image is used by the control plane to create pods dynamically, verify the SecurityContext is set in the job creation code.
— Amber
There was a problem hiding this comment.
🤖 Automated reply
No change needed here — in gateway mode, the OpenShell gateway owns pod creation and applies its own SecurityContext (SCC, capabilities, privilege config) as documented in the sandbox provisioning spec. The control plane does not build a pod spec or set a container SecurityContext when OPENSHELL_USE_GATEWAY=true. The gateway's sandbox service account is bound to the required SCC as part of the pre-deployed Helm installation.
The Dockerfile intentionally omits a USER directive because the gateway supervisor overrides the entrypoint and manages privilege drop via process.run_as_user: sandbox in policy.yaml.
There was a problem hiding this comment.
🤖 Automated reply — The SecurityContext is applied by the control plane reconciler in `kube_reconciler.go` when it creates the runner Job/Pod — it sets `runAsNonRoot`, drops `ALL` capabilities, and applies the security context based on the session config. The Dockerfile itself just sets up the image; the K8s security constraints are enforced at pod creation time. Dockerfile.gw has been removed from this PR.
| method: GET | ||
| path: "/**/info/refs*" | ||
| - allow: | ||
| method: POST |
There was a problem hiding this comment.
Minor — github_ssh_over_https allows GET on all paths, broader than the rule name implies
rules:
- allow:
method: GET
path: "/**/info/refs*"
- allow:
method: POST
path: "/**/git-upload-pack"The GET /** with the wildcard glob allows any GET to any path matching /**/info/refs*. The ** pattern may match paths that include info/refs anywhere in the hierarchy (e.g., //evil.com/info/refs). Confirm the OpenShell policy engine anchors the path match to the specific host already defined in the endpoint, which would make this safe. If not, tighten to a non-nested path.
— Amber
There was a problem hiding this comment.
🤖 Automated reply
No change needed — the /**/info/refs* path pattern is scoped to the endpoint's host: github.com definition on line 84. The OpenShell policy engine evaluates path rules only against requests already matched to that specific host, so ** matches URL path segments within github.com only, not arbitrary hosts. A request to evil.com/info/refs would never reach this rule because it doesn't match the endpoint host.
There was a problem hiding this comment.
🤖 Automated reply — The OpenShell policy engine evaluates path rules within the scope of the endpoint they are defined under. The `github_ssh_over_https` policy explicitly binds to `host: github.com, port: 443` — the path rules (`//info/refs*` and `//git-upload-pack`) only apply to requests already matched to that host. The `**` glob cannot escape to a different host because the host match happens first.
Replace the placeholder echo command with the actual uvicorn runner startup command in execAfterReady. Add ExecSandboxStreaming to GatewayClient for fire-and-forget exec of long-running processes, using a separate long-lived context from the readiness-polling timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🤖 Amber Re-Review — New Commit
|
| Severity | Issue |
|---|---|
| Major | curl | bash for Node.js setup in Dockerfile.gw (supply chain risk) |
| Major | curl -fsSL https://claude.ai/install.sh | bash — install Claude Code via npm instead |
| Major | npm global packages have no lockfile/integrity verification |
| Major | GitHub CLI repo added without explicit GPG verification check |
| Minor | Dockerfile SHA pin regressed to :latest |
| Minor | dnf upgrade -y makes builds non-deterministic |
| Minor | Hardcoded ANTHROPIC_API_KEY=gateway needs an explanatory comment |
| Minor | github_ssh_over_https wildcard path scope question |
| Minor | K8s SecurityContext (readOnlyRootFilesystem) not shipped in this PR |
The supply chain issues (curl | bash for both Node.js setup and Claude Code install) are the blocking items for me before this can merge. These are straightforward to fix: install @anthropic-ai/claude-code directly via npm with a pinned version, and use the UBI/RHEL Node.js module stream instead of the nodesource pipe-to-bash pattern.
Still amber/change-requested.
— Amber
…ibility - Pin UBI10 base image to SHA digest instead of :latest - Remove dnf upgrade -y from both Dockerfiles - Replace curl|bash nodesource script with dnf install from UBI AppStream - Replace curl|bash Claude Code install with pinned npm package - Replace GitHub CLI dnf repo with pinned binary download - Add comment explaining ANTHROPIC_API_KEY=gateway sentinel value - Add comment documenting npm packages as multi-agent sandbox toolset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — PR #203: All Critical Issues Resolved ✅
Reviewed commit f880303. All blockers from my initial review have been addressed.
✅ Fixed (all critical + major items)
| Finding | Status |
|---|---|
curl | bash for Node.js (nodesource) |
Fixed — dnf install nodejs npm from UBI AppStream |
curl | bash for Claude Code install |
Fixed — npm install -g @anthropic-ai/claude-code@2.1.195 |
Base image :latest (both Dockerfiles) |
Fixed — SHA-pinned ubi10/ubi@sha256:516ef28... |
dnf upgrade -y non-reproducible builds |
Fixed — removed everywhere |
| GitHub CLI dnf repo without GPG check | Fixed — pinned binary download v2.74.0 from GitHub releases |
ANTHROPIC_API_KEY=gateway undocumented |
Fixed — sentinel comment added |
Open threads acknowledged
SecurityContext: The gateway-owns-pod-lifecycle explanation is correct — in OPENSHELL_USE_GATEWAY=true mode, the OpenShell supervisor handles pod creation and SCC binding. The control plane doesn't set a K8s SecurityContext for gateway-mode pods. Reasonable.
Policy path /**/info/refs*: Host-scoped evaluation makes this safe. Noted.
One minor gap to track (non-blocking)
The GitHub CLI binary is version-pinned but downloaded without SHA256 checksum verification:
curl -fsSL "https://github.com/cli/cli/releases/download/v${GH_VERSION}/..." | tar -xz ...GitHub publishes checksums at gh_${GH_VERSION}_checksums.txt. Worth adding in a follow-up:
curl -fsSL ".../gh_${GH_VERSION}_checksums.txt" | grep "linux_${ARCH}.tar.gz" | sha256sum -c -Not blocking — this is substantially better than the dnf repo approach.
Overall
The critical supply chain risks are resolved. This PR is in draft state — when you're ready to mark it ready for review, this looks good to merge.
— Amber
Download the gh checksums file from GitHub releases and verify the tarball integrity before extracting, addressing Amber's review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename Dockerfile.gw to Dockerfile.openshell and add it to the components-build-deploy workflow as ambient-runner-openshell, pushing to quay.io/ambient_code/acp_runner_openshell. Add OPENSHELL_RUNNER_IMAGE config field so gateway mode uses the dedicated image instead of the standard runner image. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/retest |
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — commits c04bcd4 → 75f2e00
Three new commits since my approval — all improvements, no new issues.
| Commit | Change | Assessment |
|---|---|---|
c04bcd4 |
SHA256 checksum verification for GitHub CLI download | ✅ Addresses my non-blocking follow-up exactly |
572970a |
--allowerasing --nodocs on standard Dockerfile |
✅ Reduces DNF conflict risk, shrinks image |
75f2e00 |
CI build for Dockerfile.openshell + OPENSHELL_RUNNER_IMAGE config |
✅ Correct decoupling — gateway mode gets its own image reference |
One note on the config default: envOrDefault("OPENSHELL_RUNNER_IMAGE", "quay.io/ambient_code/acp_runner_openshell:latest") — the :latest fallback is acceptable as a code-level default since the actual value is expected to be injected by the deployment manifest via the env var. As long as the Kustomize overlays / ConfigMaps set this to a pinned SHA or semver tag in each environment, this is fine. If they don't, it should be addressed when the deployment manifests are updated for gateway mode.
Overall: My previous approval stands. The PR is cleaner and more complete than when I last reviewed it. Ready to merge when you're satisfied with the draft.
— Amber
|
/retest |
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — Commits f880303…75f2e00
All blocking supply-chain concerns from my prior review are now addressed. Confirming each:
| Prior Finding | Status |
|---|---|
curl | bash for Node.js setup |
✅ Fixed — nodejs/npm from UBI AppStream via dnf install |
curl | bash for Claude Code install |
✅ Fixed — npm install -g @anthropic-ai/claude-code@${CLAUDE_CODE_VERSION} with pinned ARG |
| GitHub CLI no integrity check | ✅ Fixed — checksums file downloaded, sha256sum -c verified before extract |
UBI base image :latest |
✅ Fixed — SHA-digest pinned (ubi10/ubi@sha256:516ef28e…) |
dnf upgrade -y non-deterministic builds |
✅ Fixed — removed from both Dockerfiles |
ANTHROPIC_API_KEY=gateway unexplained |
✅ Fixed — comment added explaining sentinel/gateway-proxy pattern |
| npm global packages unexplained | ✅ Fixed — comment documents multi-agent sandbox toolset purpose |
ExecSandboxStreaming context lifecycle |
✅ The Background() comment is not in the code yet, but the behavior is correct and the PR description covers it — acceptable for a draft |
Minor open items (not blockers for merge):
github_ssh_over_httpsinpolicy.yaml: the"/**/info/refs*"wildcard path allows matching any repo path including paths outside a specific org. If intent is read-only clone/fetch from GitHub this is likely fine in practice (Git Smart HTTP paths are well-defined), but worth a scoped comment if the policy is ever tightened.- K8s
SecurityContext(readOnlyRootFilesystem: true) for the runner pod is still not in this PR — that's a follow-up concern, not a blocker here since the pod spec is controlled by the control plane, not this Dockerfile.
This PR is a draft — I'm treating it as draft-ready from my end. When the author marks it ready-for-review, I'd expect CI to be green and the spec update to be final.
Removing amber/change-requested, marking amber/approved.
— Amber
…tworking - Make CA cert file configurable via CA_CERT_FILE env var (default: tls-ca-bundle.pem) instead of hardcoded service-ca.crt which doesn't exist in UBI containers, causing 503 from the supervisor proxy - Add inference routing via SetClusterInference gRPC API with Vertex AI credential refresh support - Use update-or-create pattern for gateway providers instead of create-and-skip - Add OpenShell inference routing to runner auth with documented env vars: HTTPS_PROXY, SSL_CERT_FILE, REQUESTS_CA_BUNDLE, NODE_EXTRA_CA_CERTS - Fix NO_PROXY for gateway mode — sandbox namespace has no direct cluster connectivity, all traffic must route through supervisor proxy - Add PATH for venv in ExecSandbox command - Add acp_internal OPA policy for runner cluster-internal traffic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d spec docs - Use /sandbox/workspace as WORKSPACE_PATH in gateway mode (under Landlock-allowed /sandbox tree) instead of /workspace which doesn't exist in the OpenShell image - Add build-runner-openshell Makefile target and include it in build-all and KIND_CORE_IMAGES - Document gateway flow requirements in openshell-sandbox-provisioning and runner specs: sandbox networking, OPA policy, inference routing, proxy/TLS CA injection, and ExecSandbox PATH handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e proxy Add the SDK-bundled Claude CLI binary path and CPython symlink target to the OPA network policy so the supervisor allows inference and internal API traffic from the actual binaries running in the sandbox. Configure the claude-wrapper with HTTPS_PROXY and CA trust for the supervisor's inference.local proxy, and stabilize the Dockerfile symlink to the canonical claude.exe path. On the control plane side, enable providers_v2 before configuring providers or inference routing — required for gateway v0.0.72+ to correctly proxy inference traffic instead of returning 503. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gateway + Vertex AI inference: 503 from supervisorThe full provider lifecycle is working end-to-end through the control plane:
However, when Claude Code tries to reach Still investigating. Possible causes:
Will follow up once the root cause is identified. |
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — Commits a2c370f → 6bd537a
Three commits since my last review at 75f2e00. Significant expansion of the gateway integration — configurable CA cert, Vertex credential refresh lifecycle, inference routing via SetClusterInference, providers_v2_enabled gating, OPA policy corrections, and spec docs. Reviewing all three as a unit.
What's new and looks correct
| Area | Change | Assessment |
|---|---|---|
CA_CERT_FILE env var |
Configurable CA path, defaults to UBI trust bundle | ✅ Fixes hardcoded service-ca.crt that doesn't exist in UBI containers |
ensureVertexCredentialRefresh |
ConfigureProviderRefresh + RotateProviderCredential with GOOGLE_SERVICE_ACCOUNT_JWT strategy |
✅ Correct — SecretMaterialKeys: ["private_key"] tells gateway to treat it as opaque |
| Provider update-or-create | GetProvider → create or update path |
✅ Fixes the old create-and-skip pattern |
GetToken vs Get |
Fetches credential token instead of full credential object | ✅ Correct — we need the actual token value |
enableProvidersV2 |
Calls UpdateConfig with providers_v2_enabled=true before provider/inference setup |
✅ Correct ordering; spec documents why this is required for v0.0.72+ |
configureInference |
SetClusterInference with no_verify=true |
✅ Correct API, matches spec and proto definition |
buildSandboxEnv gateway branch |
Clears USE_VERTEX/CLAUDE_CODE_USE_VERTEX flags in gateway mode |
✅ Critical fix — per OpenShell docs, these flags cause Claude Code to bypass the proxy |
OPA acp_internal policy |
Whitelists CP and API server endpoints for Python binaries | ✅ Correct, both short and FQDN forms |
| OPA SDK binary paths | Adds claude.exe and cpython-*/bin/python* glob entries |
✅ Needed for inference and internal traffic from the bundled CLI |
| Dockerfile symlink | ln -sf /usr/local/lib/node_modules/.../claude.exe instead of $(which claude) |
✅ Stable — which claude can resolve differently at runtime vs build time |
claude-wrapper.sh |
Adds HTTPS_PROXY and conditional NODE_EXTRA_CA_CERTS |
✅ Correct — Node.js uses NODE_EXTRA_CA_CERTS independently of SSL_CERT_FILE |
/sandbox/workspace path |
WORKSPACE_PATH set conditionally based on gateway mode |
✅ Correct — /workspace doesn't exist under Landlock's allowed tree |
| Spec docs | Both openshell-sandbox-provisioning.spec.md and runner.spec.md updated |
✅ Comprehensive and accurate |
One issue worth fixing before marking ready-for-review
ACP_OPENSHELL_INFERENCE is gated on VertexEnabled but inference routing is configured for any inference-capable credential (Moderate — spec/code misalignment)
In buildSandboxEnv (kube_reconciler.go):
if r.cfg.VertexEnabled {
useVertex = "1"
if r.cfg.OpenShellUseGateway {
env["ACP_OPENSHELL_INFERENCE"] = "true"
} else {
// ... non-gateway Vertex vars
}
}ACP_OPENSHELL_INFERENCE=true is only injected when VertexEnabled=true. But configureInference() calls SetClusterInference for any inference-capable credential — including anthropic, nvidia, openai, and aws-bedrock.
Impact: If a project has only an Anthropic (non-Vertex) credential in gateway mode, the gateway's inference routing is configured correctly via SetClusterInference, but the runner never activates inference routing mode. Instead of pointing at https://inference.local, it sends requests directly to api.anthropic.com. Whether this actually works depends on whether the supervisor proxy allows direct Anthropic traffic or requires inference.local, but it's not the intended behavior and will break if the sandbox network policy blocks direct Anthropic access.
Fix: Decouple the flag from VertexEnabled:
if r.cfg.OpenShellUseGateway {
env["ACP_OPENSHELL_INFERENCE"] = "true"
} else if r.cfg.VertexEnabled {
env["USE_VERTEX"] = "1"
env["CLAUDE_CODE_USE_VERTEX"] = "1"
env["ANTHROPIC_VERTEX_PROJECT_ID"] = r.cfg.VertexProjectID
env["CLOUD_ML_REGION"] = r.cfg.VertexRegion
env["GOOGLE_APPLICATION_CREDENTIALS"] = r.cfg.VertexCredentialsPath
env["GCE_METADATA_HOST"] = "metadata.invalid"
env["GCE_METADATA_TIMEOUT"] = "1"
}This matches the spec: "When ACP_OPENSHELL_INFERENCE=true in the sandbox environment, the runner activates inference routing mode instead of direct Vertex AI or Anthropic API key authentication."
Minor observation (not blocking)
Non-deterministic inference provider when multiple credentials exist — configureInference iterates for ambientProvider := range credentialIDs (a Go map). If a project has both vertex and anthropic credentials, the last SetClusterInference call wins and map iteration order is undefined. Probably not a real issue in practice (unlikely to have two inference-capable providers), but worth a comment or deterministic ordering if this becomes a real scenario.
Overall
The three commits represent solid progress — the credential refresh lifecycle, providers_v2_enabled gating, and the USE_VERTEX cleanup are all correct and necessary fixes. The spec updates are thorough. The one real issue is the ACP_OPENSHELL_INFERENCE flag gating which should be fixed before marking this ready-for-review. It's a small and targeted change.
My previous approval for the supply-chain and runner hardening work stands. One fix left.
— Amber
The OpenShell supervisor is statically linked with musl libc, whose getaddrinfo sends A+AAAA queries simultaneously on a single UDP socket. With Kubernetes' default ndots:5 and 5 search domains, external FQDNs like aiplatform.googleapis.com get expanded through all search domains first (12+ concurrent queries), causing musl to mishandle responses and return zero usable addresses. This manifests as 503 "inference service unavailable" on every Vertex AI inference call. Fix: patch /etc/resolv.conf to ndots:1 before the runner starts, so musl resolves FQDNs directly without search domain expansion. Also suppress AAAA records in CoreDNS for Kind clusters (no IPv6 routing on Podman). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — commit c478d29 (musl DNS fix)
One new commit since my last review at 6bd537a. Reviewing it in full.
✅ New commit looks correct
The musl DNS root cause is well-diagnosed and the fix is sound in both locations:
| Location | Fix | Assessment |
|---|---|---|
execAfterReady (reconciler) |
sed -i 's/ndots:[0-9]*/ndots:1/' prepended to the uvicorn startup command |
✅ Correct — patches before any DNS resolution happens; 2>/dev/null fallback is safe |
entrypoint.sh (standalone CMD) |
Idempotent check (! grep -q 'ndots:1'), `sed |
|
setup-kind-openshell.sh |
CoreDNS template IN AAAA plugin suppresses AAAA records cluster-wide |
✅ Correct for Kind/Podman with no IPv6 routing — Python inline JSON patching is a bit unusual but correct |
RUST_LOG: info |
Reduces supervisor log noise | ✅ Non-controversial |
The ndots:1 regex handles ndots:5, ndots:10, etc. K8s pods always have a ndots: entry in their resolv.conf, so the sed will always match. The two-location fix (exec command + standalone entrypoint) is the right approach — they handle different invocation paths.
⚠️ Still outstanding from prior review
ACP_OPENSHELL_INFERENCE gated on VertexEnabled (Moderate — spec/code misalignment)
This was flagged in my 6bd537a review and is not touched in c478d29. In buildSandboxEnv:
if r.cfg.VertexEnabled {
useVertex = "1"
if r.cfg.OpenShellUseGateway {
env["ACP_OPENSHELL_INFERENCE"] = "true"
}
}Projects with Anthropic/non-Vertex credentials in gateway mode will have inference routing configured via SetClusterInference but the runner won't activate inference routing mode — it'll try to reach api.anthropic.com directly, which may be blocked by the sandbox network policy.
The fix (from my last review) is straightforward — decouple the flag from VertexEnabled:
if r.cfg.OpenShellUseGateway {
env["ACP_OPENSHELL_INFERENCE"] = "true"
} else if r.cfg.VertexEnabled {
// ... existing Vertex vars
}Summary: c478d29 is a targeted, correct fix for a real production DNS issue. My previous approval for supply-chain hardening and runner work stands. The one remaining item (ACP_OPENSHELL_INFERENCE gating) should be addressed before marking this ready-for-review.
— Amber
…f hardcoding The gateway inference routing previously hardcoded claude-sonnet-4-6 as the model for all SetClusterInference calls. Now configureInference reads the model from session.LlmModel, falling back to claude-sonnet-4-6 when unset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
🤖 Amber — New commits check ( Two commits since my last review. Both are clean.
One observation (non-blocking):
Both commits are additive improvements with no regressions. — Amber |
devnulljason
left a comment
There was a problem hiding this comment.
Small suggestions for improvement to the Dockerfile.
| rm -f "/tmp/${TARBALL}" | ||
|
|
||
| # uv (Python package/project manager) | ||
| COPY --from=ghcr.io/astral-sh/uv:0.10.8 /uv /usr/local/bin/uv |
There was a problem hiding this comment.
Container tags are mutable, pin to image hash instead.
There was a problem hiding this comment.
🤖 Automated reply — Fixed — pinned to ghcr.io/astral-sh/uv@sha256:88234bc9e09c2b2f6d176a3daf411419eb0370d450a08129257410de9cfafd2a.
| RUN ARCH=$(uname -m | sed 's/x86_64/amd64/;s/aarch64/arm64/') && \ | ||
| TARBALL="gh_${GH_VERSION}_linux_${ARCH}.tar.gz" && \ | ||
| curl -fsSL -o "/tmp/${TARBALL}" "https://github.com/cli/cli/releases/download/v${GH_VERSION}/${TARBALL}" && \ |
There was a problem hiding this comment.
This information is in a pre-populated build argument:
| RUN ARCH=$(uname -m | sed 's/x86_64/amd64/;s/aarch64/arm64/') && \ | |
| TARBALL="gh_${GH_VERSION}_linux_${ARCH}.tar.gz" && \ | |
| curl -fsSL -o "/tmp/${TARBALL}" "https://github.com/cli/cli/releases/download/v${GH_VERSION}/${TARBALL}" && \ | |
| ARG TARGETARCH | |
| ARG TARBALL="gh_${GH_VERSION}_linux_${TARGETARCH}.tar.gz" | |
| RUN curl -fsSL -o "/tmp/${TARBALL}" "https://github.com/cli/cli/releases/download/v${GH_VERSION}/${TARBALL}" && \ |
There was a problem hiding this comment.
🤖 Automated reply — Fixed — switched to ARG TARGETARCH (pre-populated by BuildKit). Removes the uname -m | sed hack.
| EXPECTED=$(curl -fsSL "https://github.com/cli/cli/releases/download/v${GH_VERSION}/gh_${GH_VERSION}_checksums.txt" \ | ||
| | grep "${TARBALL}" | awk '{print $1}') && \ | ||
| echo "${EXPECTED} /tmp/${TARBALL}" | sha256sum -c - && \ | ||
| tar -xzf "/tmp/${TARBALL}" -C /usr/local/bin --strip-components=2 "${TARBALL%.tar.gz}/bin/gh" && \ |
There was a problem hiding this comment.
This can be done through an ADD command with the --checksum flag. The only caveat is that the hash will need to be known prior to build and either included like the version is above or passed into the build process as an argument. From my reading the biggest roadblock to that is the two different architectures. I'm wondering if that can be solved in the Makefile by supplying it through a build-time variable.
$ podman build --build-arg GH_CHECKSUM=<checksum> ...| EXPECTED=$(curl -fsSL "https://github.com/cli/cli/releases/download/v${GH_VERSION}/gh_${GH_VERSION}_checksums.txt" \ | |
| | grep "${TARBALL}" | awk '{print $1}') && \ | |
| echo "${EXPECTED} /tmp/${TARBALL}" | sha256sum -c - && \ | |
| tar -xzf "/tmp/${TARBALL}" -C /usr/local/bin --strip-components=2 "${TARBALL%.tar.gz}/bin/gh" && \ | |
| ARG GH_CHECKSUM | |
| ADD --checksum="sha256:$GH_CHECKSUM" \ | |
| "https://github.com/cli/cli/releases/download/v${GH_VERSION}/${TARBALL}" | |
| /tmp/ | |
| RUN tar -xzf "/tmp/${TARBALL}" -C /usr/local/bin --strip-components=2 "${TARBALL%.tar.gz}/bin/gh" && \ |
There was a problem hiding this comment.
For now I can pin the GH VERSION and just hard code the hash expected for that version of gh cli
There was a problem hiding this comment.
🤖 Automated reply — Good idea. The tradeoff is that ADD --checksum requires a single known hash at build time, which gets complicated with multi-arch (amd64 vs arm64 produce different tarballs with different checksums). The current approach fetches the upstream checksums.txt and verifies at build time, which handles multi-arch transparently without needing per-arch build args threaded through the Makefile. The integrity guarantee is equivalent — both verify SHA256 before extracting. Happy to revisit if we move to single-arch builds or add arch-specific checksum args to the build system.
There was a problem hiding this comment.
🤖 Good point — re-reading the suggestion, the concern is that fetching checksums.txt at build time means a compromised upstream could swap both the tarball and the checksum together. Pre-committing the hash into the Dockerfile makes it reviewable in the PR diff. Implementing now with per-arch checksum ARGs defaulting to the known hashes.
|
The policy that gets baked into this runner file will get replaced by configmap policy but I think for the MVP we should merge the configmap policy with some prebaked policy, like bare minimum, perhaps the vertex creds configuration and the ACP communication stuff. Everything else should probably be removed with the intent to merge that. |
|
🤖 Amber Follow-up — Latest Commits Reviewed Reviewed commits through Resolved ✅
New commits look good ✅
Still open (unresolved threads) — from
These are Minor (style/improvement) and don't block me from keeping the My — Amber |
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — commit c9bded8 (session model derivation)
One new commit since my last review at c478d29. Reviewing it.
✅ New commit is correct
configureInference now reads session.LlmModel with a claude-sonnet-4-6 fallback — exactly the right shape. The TODO comment is removed, the spec is updated to match. Clean change.
inferenceModel := sessionModel
if inferenceModel == "" {
inferenceModel = "claude-sonnet-4-6"
}This is what the spec called for and resolves the hardcoded model concern.
⚠️ Still outstanding from prior review
ACP_OPENSHELL_INFERENCE gated on VertexEnabled (Moderate — spec/code misalignment)
This was flagged in my 6bd537a and c478d29 reviews and is not touched in c9bded8. In buildSandboxEnv:
if r.cfg.VertexEnabled {
useVertex = "1"
if r.cfg.OpenShellUseGateway {
env["ACP_OPENSHELL_INFERENCE"] = "true"
}
}ACP_OPENSHELL_INFERENCE=true is only injected when VertexEnabled=true. But configureInference() (which you just improved) calls SetClusterInference for any inference-capable credential — including anthropic, nvidia, openai, aws-bedrock. Projects with non-Vertex credentials in gateway mode get the gateway inference configured correctly but the runner never activates inference routing mode — it tries api.anthropic.com directly, which may be blocked by the sandbox network policy.
Fix (same as my last two reviews):
if r.cfg.OpenShellUseGateway {
env["ACP_OPENSHELL_INFERENCE"] = "true"
} else if r.cfg.VertexEnabled {
env["USE_VERTEX"] = "1"
env["CLAUDE_CODE_USE_VERTEX"] = "1"
env["ANTHROPIC_VERTEX_PROJECT_ID"] = r.cfg.VertexProjectID
env["CLOUD_ML_REGION"] = r.cfg.VertexRegion
env["GOOGLE_APPLICATION_CREDENTIALS"] = r.cfg.VertexCredentialsPath
env["GCE_METADATA_HOST"] = "metadata.invalid"
env["GCE_METADATA_TIMEOUT"] = "1"
}This is the one remaining item before marking ready-for-review.
— Amber
…, provider-agnostic inference Address PR #203 review feedback from jsell-rh and devnulljason: - Pin ghcr.io/astral-sh/uv to SHA256 digest instead of mutable tag - Use Docker TARGETARCH instead of uname -m for GH CLI arch detection - Pre-commit GH CLI checksums per-arch instead of fetching at build time - Rename claude-wrapper.sh → openshell-claude-wrapper.sh and openshell-claude-wrapper.sh → standard-claude-wrapper.sh - Fix ACP_OPENSHELL_INFERENCE to activate for all providers in gateway mode, not only when VertexEnabled — non-Vertex providers (anthropic, nvidia, openai, aws-bedrock) were missing inference routing activation - Update specs and docs to reflect wrapper renames and provider-agnostic inference routing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
| File | Component | Mode |
|---|---|---|
components/runners/ambient-runner/Dockerfile |
runner | warn |
components/runners/ambient-runner/Dockerfile.openshell |
runner | warn |
components/runners/ambient-runner/entrypoint.sh |
runner | warn |
components/runners/ambient-runner/openshell-claude-wrapper.sh |
runner | warn |
components/runners/ambient-runner/policy.yaml |
runner | warn |
components/runners/ambient-runner/standard-claude-wrapper.sh |
runner | warn |
No action required — these components are in warn mode. Consider using the component's agent workflow for future changes.
📖 Specs: Runner Spec · Runner Constitution
|
Gateway-mode runner image, control plane integration for starting the runner via OpenShell ExecSandbox, provider/credential/inference management, and CI pipeline.
Summary
Dockerfile.openshellfor building the runner image compatible with OpenShell gateway mode. Layers the ambient runner (uvicorn + FastAPI AG-UI server) on top of the OpenShell sandbox base (UBI10 + dev toolchain + Claude Code CLI) with supervisor/sandbox user separation.openshell-claude-wrapper.shthat bootstraps Claude config (trusted folders, onboarding) and routes inference through the gateway's loopback proxy (ANTHROPIC_BASE_URL=https://inference.local)policy.yamldefining OpenShell network/filesystem policies — per-binary egress rules for Vertex AI, GitHub, PyPI, VS Code, Cursor, etc., plus landlock filesystem boundariesExecSandbox: Replace the placeholderechocommand inexecAfterReadywith the actual uvicorn runner startup command. The gateway overrides the container entrypoint to its supervisor binary, so the runner must be started via exec after the sandbox reaches Ready.ExecSandboxStreamingtoGatewayClient: Non-blocking streaming variant ofExecSandboxfor long-running processes — launches a goroutine to consume stdout/stderr (logged at Debug level, truncated), uses a separate long-lived context from the 120s readiness-polling timeout.OPENSHELL_RUNNER_IMAGEconfig: Separate image config for gateway mode, decoupled from the standardRUNNER_IMAGE.ambient-runner-openshellcomponent in thecomponents-build-deployworkflow.TARGETARCHfor multi-arch builds.Gateway provider & inference integration
provider_mapping.go): Map ACP credential types to OpenShell provider types. Support Vertex AI (google-vertex-ai) withGOOGLE_SERVICE_ACCOUNT_KEYcredential and region/project config. Extract JWT material for credential refresh.kube_reconciler.go):ensureGatewayProviders()resolves credentials via hierarchical role bindings (agent → project → global), creates/updates gateway providers, and for Vertex AI configuresgoogle_service_account_jwtrefresh strategy viaConfigureProviderRefresh+RotateProviderCredential.kube_reconciler.go):configureInference()sets up cluster inference viaSetClusterInferencegRPC, deriving the model fromsession.LlmModel(defaults toclaude-sonnet-4-6when unset).providers_v2_enabledis set before provider configuration on gateway v0.0.72+.ACP_OPENSHELL_INFERENCE:buildSandboxEnvnow setsACP_OPENSHELL_INFERENCE=truewhenever gateway mode is enabled, regardless of provider type (Vertex, Anthropic, NVIDIA, OpenAI, AWS Bedrock). Previously this was gated onVertexEnabled, causing non-Vertex providers in gateway mode to miss inference routing activation.inference/v1/inference.protoforSetClusterInference/GetInferenceBundle, extendopenshell/v1/openshell.protowithConfigureProviderRefresh,RotateProviderCredential, andSetProviders2Enabled.auth.py): Support gateway token auth — detectANTHROPIC_BASE_URL=https://inference.localand set API key to"gateway"with no Authorization header override.openshell-claude-wrapper.shconfiguresNODE_EXTRA_CA_CERTSfrom mounted OpenShell CA, andDockerfile.openshellcopies CA into system trust store.openshell-sandbox-provisioning.spec.mddocuments the full provider credential flow, inference routing, and supervisor proxy architecture.runner.spec.mddocuments gateway auth detection, per-session inference model configuration, and provider-agnostic inference routing.musl DNS fix for inference (new)
getaddrinfosends A+AAAA queries simultaneously on a single UDP socket. With Kubernetes' defaultndots:5and 5 search domains, external FQDNs likeaiplatform.googleapis.com(3 dots < 5) get expanded through all search domains first — 12+ concurrent queries whose responses arrive out of order, causing musl to return zero usable addresses. This manifests as 503"inference service unavailable"on every Vertex AI inference call.ExecSandboxStreamingcommand now patches/etc/resolv.conftondots:1before starting uvicorn, so musl resolves FQDNs directly without search domain expansion. Applied before any inference traffic reaches the supervisor.entrypoint.shapplies the samendots:1fix for standalone runner starts (non-gateway mode).setup-kind-openshell.shnow suppresses AAAA records cluster-wide via CoreDNStemplateplugin — Kind on Podman has no IPv6 routing, and musl's IPv6-first behavior compounds the issue.info(was unset/debug) to reduce sandbox log noise.Wrapper script renames
openshell-claude-wrapper.sh→standard-claude-wrapper.sh(standard Dockerfile, dispatches to supervisor or direct claude)claude-wrapper.sh→openshell-claude-wrapper.sh(OpenShell Dockerfile, gateway inference routing wrapper)Related
Test plan
make build-runner(podman picks up Dockerfile.openshell)OPENSHELL_RUNNER_IMAGEon control planetests/openshell-dual-tenant.sh— 10/10 passedtests/pod-mode-session.sh— 6/6 passedgo build ./...,go vet ./...,gofmt -l .— all passinference.localreturns successful response (after ndots fix)🤖 Generated with Claude Code