Skip to content

fix(runner): prevent infinite recursion in openshell claude wrapper and move inference env vars to sandbox level#214

Merged
jsell-rh merged 6 commits into
mainfrom
fix/openshell-claude-wrapper-recursion
Jul 1, 2026
Merged

fix(runner): prevent infinite recursion in openshell claude wrapper and move inference env vars to sandbox level#214
jsell-rh merged 6 commits into
mainfrom
fix/openshell-claude-wrapper-recursion

Conversation

@bsquizz

@bsquizz bsquizz commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • The openshell claude wrapper is installed as /usr/local/bin/claude, so when Claude Code spawns subagents that invoke claude by name, PATH resolves back to the wrapper causing infinite recursion
  • Adds a file-based re-entry guard (/tmp/.claude-wrapper-initialized) that skips wrapper setup on subsequent invocations within the same sandbox instance — env vars can't be used because the sandbox supervisor spawns each command in a clean environment
  • Moves ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY from the wrapper into the reconciler's buildSandboxEnv so every tool in the sandbox gets them, not just Claude

Test plan

  • Deploy OpenShell runner and verify subagent spawning works without recursion
  • Verify single-agent sessions still work correctly
  • Verify inference routing works with env vars set at sandbox level

🤖 Generated with Claude Code

bsquizz and others added 4 commits June 30, 2026 19:04
The wrapper is installed as /usr/local/bin/claude, so when Claude Code
spawns subagents that invoke `claude` by name, PATH resolves back to the
wrapper causing infinite recursion. Add a re-entry guard that execs the
real binary directly on subsequent invocations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…evention

The env var approach doesn't work because the OpenShell sandbox supervisor
spawns each command in a clean environment that does not inherit exports.
Switch to a /tmp file guard that persists across invocations within the
same sandbox instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY from the claude wrapper
script into the reconciler's buildSandboxEnv so every tool in the
sandbox gets them, not just processes launched through the wrapper.
Remove HTTPS_PROXY, NO_PROXY, and NODE_EXTRA_CA_CERTS from the wrapper
since those are already set by the reconciler and supervisor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@jsell-rh jsell-rh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Amber Analysis — PR #214

Summary: Tight, correct fix for a real recursion bug. The file-based guard is the right call given the sandbox supervisor's env-clearing behavior. Two minor notes before you mark this ready.


Findings

[Minor] PR description mentions _CLAUDE_WRAPPER_ACTIVE env var — code uses a file guard

The bullet in the PR body says "Adds a _CLAUDE_WRAPPER_ACTIVE re-entry guard" but the implementation uses GUARD="/tmp/.claude-wrapper-initialized" — a file, not an env var. The code comment correctly explains the rationale (env not inherited between sandbox invocations). Just update the PR description to match the implementation before marking ready.

[Minor] Guard file persists across top-level invocations within the pod lifetime

Once created, /tmp/.claude-wrapper-initialized is never removed. In the current single-session sandbox model this is fine — one pod, one top-level claude invocation, then the pod terminates. But if the runner ever retries or re-invokes claude in the same pod (e.g., after a crash-restart of the wrapper), the second invocation would skip the HOME=/sandbox and ANTHROPIC_BASE_URL exports and exec directly to the real binary. Worth a comment noting the assumption that /tmp is ephemeral per pod and the guard is intentionally not cleaned up.


Checklist

Item Result
No panic() in production Go N/A (bash script)
Guard mechanism correct for env-clearing sandbox
No secrets / tokens introduced
Comment explains non-obvious invariant ✅ (file vs env var rationale)
Test plan reflects real deployment verification ⚠️ Both items unchecked — expected for draft

Confidence: 95% — Logic is correct for the single-session sandbox model. The minor items are clarity improvements, not blockers.

Approving. Ping me when you uncheck the test plan and mark it ready — I'll re-verify before merge.

Amber

@jsell-rh

Copy link
Copy Markdown
Collaborator

No description provided.

@bsquizz bsquizz changed the title fix(runner): prevent infinite recursion in openshell claude wrapper fix(runner): prevent infinite recursion in openshell claude wrapper and move inference env vars to sandbox level Jun 30, 2026
@bsquizz bsquizz marked this pull request as ready for review June 30, 2026 23:36
@bsquizz bsquizz enabled auto-merge June 30, 2026 23:36
@github-actions github-actions Bot added auto-merge-pending PR eligible for auto-merge, waiting for checks component/control-plane component/runner labels Jun 30, 2026
squizzi added a commit that referenced this pull request Jun 30, 2026
…ges in kind

This commit addresses three related issues with the OpenShell runner in
kind-local development:

1. **Infinite recursion fix**: The openshell-claude-wrapper.sh script was
   infinitely recursing, calling itself and accumulating --bare arguments
   until hitting ARG_MAX. This occurred because npm's global install created
   a symlink at /usr/local/bin/claude. When the Dockerfile copied the wrapper
   to that location, it followed the symlink and overwrote claude.exe itself.
   Fixed by removing the npm symlink before copying the wrapper, and updating
   the wrapper to call claude.exe directly. Also merged guard file protection
   from PR #214.

2. **Local image support**: Added control-plane-local-images-patch.yaml
   to override RUNNER_IMAGE, OPENSHELL_RUNNER_IMAGE, and MCP_IMAGE env
   vars with localhost/ prefix for kind-local deployments. This triggers
   imagePullPolicy=IfNotPresent via existing logic in kube_reconciler.go.

3. **Gateway imagePullPolicy**: OpenShell gateway was hardcoded to
   imagePullPolicy=Always for sandbox pods, causing pull failures for
   localhost/ images. Added image_pull_policy = "IfNotPresent" to the
   gateway.toml configuration.

4. **kind-reload targets**: Refactored kind-rebuild to call individual
   kind-reload-* targets instead of build-all + blanket restart. Added
   new kind-reload-runner-openshell target for rapid iteration on the
   OpenShell runner during development.

Changes:
- Makefile: Refactored kind-rebuild, added kind-reload-runner-openshell
- Dockerfile.openshell: Remove npm symlink before COPY wrapper
- openshell-claude-wrapper.sh: Call claude.exe directly (not /opt/claude/bin/claude)
- gateway/configmap.yaml: Set image_pull_policy = "IfNotPresent"
- kind-local overlay: Add control-plane-local-images-patch.yaml

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Kyle Squizzato <kysquizz@redhat.com>
@squizzi squizzi disabled auto-merge June 30, 2026 23:46
…ges in kind

Adds Dockerfile fix and kind-local image configuration to complement the
wrapper guard file protection already in this branch.

1. **Infinite recursion fix (Dockerfile)**: npm's global install creates
   a symlink at /usr/local/bin/claude. The Dockerfile COPY followed this
   symlink and overwrote claude.exe itself, causing the wrapper to call
   itself infinitely. Fixed by removing the npm symlink before copying
   the wrapper, and calling claude.exe directly.

2. **Local image support**: Added control-plane-local-images-patch.yaml
   to override RUNNER_IMAGE, OPENSHELL_RUNNER_IMAGE, and MCP_IMAGE env
   vars with localhost/ prefix for kind-local deployments. This triggers
   imagePullPolicy=IfNotPresent via existing logic in kube_reconciler.go.

3. **Gateway imagePullPolicy**: OpenShell gateway was hardcoded to
   imagePullPolicy=Always for sandbox pods, causing pull failures for
   localhost/ images. Added image_pull_policy = "IfNotPresent" to the
   gateway.toml configuration.

4. **kind-reload targets**: Refactored kind-rebuild to call individual
   kind-reload-* targets instead of build-all + blanket restart. Added
   new kind-reload-runner-openshell target for rapid iteration.

Changes:
- Makefile: Refactored kind-rebuild, added kind-reload-runner-openshell
- Dockerfile.openshell: Remove npm symlink before COPY wrapper
- openshell-claude-wrapper.sh: Call claude.exe directly
- gateway/configmap.yaml: Set image_pull_policy = "IfNotPresent"
- kind-local overlay: Add control-plane-local-images-patch.yaml

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Kyle Squizzato <kysquizz@redhat.com>
Moves gateway image_pull_policy configuration from base manifest to
conditional injection based on LOCAL_IMAGES env var. This ensures
IfNotPresent policy only applies in kind-local development, not production.

Changes:
- gateway/manifests.go: Add logic to inject image_pull_policy when LOCAL_IMAGES=true
- gateway/configmap.yaml: Remove unconditional image_pull_policy setting
- kind/control-plane-env-patch.yaml: Set LOCAL_IMAGES=true for kind overlay
- kind-local/control-plane-local-images-patch.yaml: Update comment

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Kyle Squizzato <kysquizz@redhat.com>
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ SDD Preflight — Managed Paths Modified

This PR modifies files in SDD-managed component(s). These components are migrating to Spec-Driven Development.

File Component Mode
components/runners/ambient-runner/Dockerfile.openshell runner warn
components/runners/ambient-runner/openshell-claude-wrapper.sh runner warn

No action required — these components are in warn mode. Consider using the component's agent workflow for future changes.

📖 Specs: Runner Spec · Runner Constitution

@jsell-rh jsell-rh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Amber Re-Review — PR #214 (2 new commits since f2ca98a)

Both minor items from my previous review are resolved — the wrapper comment clearly explains the file-vs-env rationale, and the description now correctly describes the file guard. Re-approving with two new minor notes on the additions.


New findings

[Minor] TOML injection in manifests.go is fragile

ApplyConfigOverrides searches for the line containing app_armor_profile and inserts image_pull_policy immediately after. If that string ever moves or is renamed in the upstream gateway TOML template, the injection silently does nothing — developers will see sandbox pull failures in kind without a clear error from the control plane. Since this only affects kind-local dev (not production), blast radius is low, but it would be worth adding a log line when the anchor string isn't found:

// after the for loop, before the SetNestedMap
if !injected {
    r.logger.Warn().Msg("LOCAL_IMAGES: could not find app_armor_profile anchor in gateway.toml — image_pull_policy not injected")
}

[Minor] kind-rebuild now excludes the OpenShell runner

The refactored kind-rebuild calls kind-reload-ambient-ui, kind-reload-ambient-control-plane, and kind-reload-ambient-api-server — but not kind-reload-runner-openshell. The original used build-all which built everything. Developers doing a full rebuild to pick up runner changes will be silently missing that image. Either add kind-reload-runner-openshell to the chain, or update the kind-rebuild help text to note it's excluded.


What's good in the new commits

  • rm -f /usr/local/bin/claude before COPY is the correct fix for the symlink-overwrite problem — clean and targeted
  • Wrapper comment block is now excellent: explains why env vars moved to sandbox level and why file guard vs env var
  • LOCAL_IMAGES gating in manifests.go over a static configmap change is the right approach — production deployments won't accidentally get IfNotPresent
  • ANTHROPIC_API_KEY=unused-for-inference-routing comment in kube_reconciler.go is clearer than the previous "gateway" sentinel

Confidence: 93% — The minor items are quality-of-life, not correctness issues. Approving.

Amber

@jsell-rh jsell-rh disabled auto-merge July 1, 2026 00:04
@jsell-rh jsell-rh enabled auto-merge July 1, 2026 00:04
@jsell-rh jsell-rh added this pull request to the merge queue Jul 1, 2026
Merged via the queue into main with commit ba95a7a Jul 1, 2026
40 of 43 checks passed
@jsell-rh jsell-rh deleted the fix/openshell-claude-wrapper-recursion branch July 1, 2026 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants