fix(runner): prevent infinite recursion in openshell claude wrapper and move inference env vars to sandbox level#214
Conversation
The wrapper is installed as /usr/local/bin/claude, so when Claude Code spawns subagents that invoke `claude` by name, PATH resolves back to the wrapper causing infinite recursion. Add a re-entry guard that execs the real binary directly on subsequent invocations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…evention The env var approach doesn't work because the OpenShell sandbox supervisor spawns each command in a clean environment that does not inherit exports. Switch to a /tmp file guard that persists across invocations within the same sandbox instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY from the claude wrapper script into the reconciler's buildSandboxEnv so every tool in the sandbox gets them, not just processes launched through the wrapper. Remove HTTPS_PROXY, NO_PROXY, and NODE_EXTRA_CA_CERTS from the wrapper since those are already set by the reconciler and supervisor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Analysis — PR #214
Summary: Tight, correct fix for a real recursion bug. The file-based guard is the right call given the sandbox supervisor's env-clearing behavior. Two minor notes before you mark this ready.
Findings
[Minor] PR description mentions _CLAUDE_WRAPPER_ACTIVE env var — code uses a file guard
The bullet in the PR body says "Adds a _CLAUDE_WRAPPER_ACTIVE re-entry guard" but the implementation uses GUARD="/tmp/.claude-wrapper-initialized" — a file, not an env var. The code comment correctly explains the rationale (env not inherited between sandbox invocations). Just update the PR description to match the implementation before marking ready.
[Minor] Guard file persists across top-level invocations within the pod lifetime
Once created, /tmp/.claude-wrapper-initialized is never removed. In the current single-session sandbox model this is fine — one pod, one top-level claude invocation, then the pod terminates. But if the runner ever retries or re-invokes claude in the same pod (e.g., after a crash-restart of the wrapper), the second invocation would skip the HOME=/sandbox and ANTHROPIC_BASE_URL exports and exec directly to the real binary. Worth a comment noting the assumption that /tmp is ephemeral per pod and the guard is intentionally not cleaned up.
Checklist
| Item | Result |
|---|---|
No panic() in production Go |
N/A (bash script) |
| Guard mechanism correct for env-clearing sandbox | ✅ |
| No secrets / tokens introduced | ✅ |
| Comment explains non-obvious invariant | ✅ (file vs env var rationale) |
| Test plan reflects real deployment verification |
Confidence: 95% — Logic is correct for the single-session sandbox model. The minor items are clarity improvements, not blockers.
Approving. Ping me when you uncheck the test plan and mark it ready — I'll re-verify before merge.
— Amber
|
No description provided. |
…ges in kind This commit addresses three related issues with the OpenShell runner in kind-local development: 1. **Infinite recursion fix**: The openshell-claude-wrapper.sh script was infinitely recursing, calling itself and accumulating --bare arguments until hitting ARG_MAX. This occurred because npm's global install created a symlink at /usr/local/bin/claude. When the Dockerfile copied the wrapper to that location, it followed the symlink and overwrote claude.exe itself. Fixed by removing the npm symlink before copying the wrapper, and updating the wrapper to call claude.exe directly. Also merged guard file protection from PR #214. 2. **Local image support**: Added control-plane-local-images-patch.yaml to override RUNNER_IMAGE, OPENSHELL_RUNNER_IMAGE, and MCP_IMAGE env vars with localhost/ prefix for kind-local deployments. This triggers imagePullPolicy=IfNotPresent via existing logic in kube_reconciler.go. 3. **Gateway imagePullPolicy**: OpenShell gateway was hardcoded to imagePullPolicy=Always for sandbox pods, causing pull failures for localhost/ images. Added image_pull_policy = "IfNotPresent" to the gateway.toml configuration. 4. **kind-reload targets**: Refactored kind-rebuild to call individual kind-reload-* targets instead of build-all + blanket restart. Added new kind-reload-runner-openshell target for rapid iteration on the OpenShell runner during development. Changes: - Makefile: Refactored kind-rebuild, added kind-reload-runner-openshell - Dockerfile.openshell: Remove npm symlink before COPY wrapper - openshell-claude-wrapper.sh: Call claude.exe directly (not /opt/claude/bin/claude) - gateway/configmap.yaml: Set image_pull_policy = "IfNotPresent" - kind-local overlay: Add control-plane-local-images-patch.yaml Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Kyle Squizzato <kysquizz@redhat.com>
…ges in kind Adds Dockerfile fix and kind-local image configuration to complement the wrapper guard file protection already in this branch. 1. **Infinite recursion fix (Dockerfile)**: npm's global install creates a symlink at /usr/local/bin/claude. The Dockerfile COPY followed this symlink and overwrote claude.exe itself, causing the wrapper to call itself infinitely. Fixed by removing the npm symlink before copying the wrapper, and calling claude.exe directly. 2. **Local image support**: Added control-plane-local-images-patch.yaml to override RUNNER_IMAGE, OPENSHELL_RUNNER_IMAGE, and MCP_IMAGE env vars with localhost/ prefix for kind-local deployments. This triggers imagePullPolicy=IfNotPresent via existing logic in kube_reconciler.go. 3. **Gateway imagePullPolicy**: OpenShell gateway was hardcoded to imagePullPolicy=Always for sandbox pods, causing pull failures for localhost/ images. Added image_pull_policy = "IfNotPresent" to the gateway.toml configuration. 4. **kind-reload targets**: Refactored kind-rebuild to call individual kind-reload-* targets instead of build-all + blanket restart. Added new kind-reload-runner-openshell target for rapid iteration. Changes: - Makefile: Refactored kind-rebuild, added kind-reload-runner-openshell - Dockerfile.openshell: Remove npm symlink before COPY wrapper - openshell-claude-wrapper.sh: Call claude.exe directly - gateway/configmap.yaml: Set image_pull_policy = "IfNotPresent" - kind-local overlay: Add control-plane-local-images-patch.yaml Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Kyle Squizzato <kysquizz@redhat.com>
Moves gateway image_pull_policy configuration from base manifest to conditional injection based on LOCAL_IMAGES env var. This ensures IfNotPresent policy only applies in kind-local development, not production. Changes: - gateway/manifests.go: Add logic to inject image_pull_policy when LOCAL_IMAGES=true - gateway/configmap.yaml: Remove unconditional image_pull_policy setting - kind/control-plane-env-patch.yaml: Set LOCAL_IMAGES=true for kind overlay - kind-local/control-plane-local-images-patch.yaml: Update comment Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Kyle Squizzato <kysquizz@redhat.com>
|
| File | Component | Mode |
|---|---|---|
components/runners/ambient-runner/Dockerfile.openshell |
runner | warn |
components/runners/ambient-runner/openshell-claude-wrapper.sh |
runner | warn |
No action required — these components are in warn mode. Consider using the component's agent workflow for future changes.
📖 Specs: Runner Spec · Runner Constitution
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — PR #214 (2 new commits since f2ca98a)
Both minor items from my previous review are resolved — the wrapper comment clearly explains the file-vs-env rationale, and the description now correctly describes the file guard. Re-approving with two new minor notes on the additions.
New findings
[Minor] TOML injection in manifests.go is fragile
ApplyConfigOverrides searches for the line containing app_armor_profile and inserts image_pull_policy immediately after. If that string ever moves or is renamed in the upstream gateway TOML template, the injection silently does nothing — developers will see sandbox pull failures in kind without a clear error from the control plane. Since this only affects kind-local dev (not production), blast radius is low, but it would be worth adding a log line when the anchor string isn't found:
// after the for loop, before the SetNestedMap
if !injected {
r.logger.Warn().Msg("LOCAL_IMAGES: could not find app_armor_profile anchor in gateway.toml — image_pull_policy not injected")
}[Minor] kind-rebuild now excludes the OpenShell runner
The refactored kind-rebuild calls kind-reload-ambient-ui, kind-reload-ambient-control-plane, and kind-reload-ambient-api-server — but not kind-reload-runner-openshell. The original used build-all which built everything. Developers doing a full rebuild to pick up runner changes will be silently missing that image. Either add kind-reload-runner-openshell to the chain, or update the kind-rebuild help text to note it's excluded.
What's good in the new commits
rm -f /usr/local/bin/claudebeforeCOPYis the correct fix for the symlink-overwrite problem — clean and targeted- Wrapper comment block is now excellent: explains why env vars moved to sandbox level and why file guard vs env var
LOCAL_IMAGESgating inmanifests.goover a static configmap change is the right approach — production deployments won't accidentally getIfNotPresentANTHROPIC_API_KEY=unused-for-inference-routingcomment inkube_reconciler.gois clearer than the previous"gateway"sentinel
Confidence: 93% — The minor items are quality-of-life, not correctness issues. Approving.
— Amber
Summary
/usr/local/bin/claude, so when Claude Code spawns subagents that invokeclaudeby name, PATH resolves back to the wrapper causing infinite recursion/tmp/.claude-wrapper-initialized) that skips wrapper setup on subsequent invocations within the same sandbox instance — env vars can't be used because the sandbox supervisor spawns each command in a clean environmentANTHROPIC_BASE_URLandANTHROPIC_API_KEYfrom the wrapper into the reconciler'sbuildSandboxEnvso every tool in the sandbox gets them, not just ClaudeTest plan
🤖 Generated with Claude Code