Skip to content

fix(gateway): patch resolv.conf ndots and set inference env vars at sandbox level#213

Closed
bsquizz wants to merge 1 commit into
mainfrom
fix/openshell-sandbox-ndots-inference
Closed

fix(gateway): patch resolv.conf ndots and set inference env vars at sandbox level#213
bsquizz wants to merge 1 commit into
mainfrom
fix/openshell-sandbox-ndots-inference

Conversation

@bsquizz

@bsquizz bsquizz commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Patch /etc/resolv.conf to ndots:1 via ExecSandbox before the runner entrypoint starts, fixing the musl libc DNS resolution failure that causes inference.local to hang with NET:FAIL
  • Move ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY from the Claude-specific wrapper to the sandbox environment so all tools (claude, opencode, etc.) get them automatically
  • Remove redundant proxy/TLS env var setup from auth.py and the wrapper — the OpenShell supervisor already injects these

Test plan

  • Rebuild control plane and runner images, deploy to kind with OPENSHELL_USE_GATEWAY=true
  • Create a sandbox session and verify resolv.conf has ndots:1
  • Run claude interactively via openshell sandbox connect — should reach inference.local without hanging
  • Verify ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY are present in sandbox env for all processes

🤖 Generated with Claude Code

…andbox level

The OpenShell supervisor (musl libc) fails to resolve upstream inference
endpoints with Kubernetes' default ndots:5, causing inference.local to
hang with NET:FAIL. Patch resolv.conf to ndots:1 via ExecSandbox before
the runner starts so the supervisor picks up the fix on its first DNS
lookup.

Move ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY from the Claude-specific
wrapper to the sandbox environment so all tools (claude, opencode, etc.)
get them automatically. Remove redundant proxy/TLS env var setup from
auth.py and the wrapper — the supervisor already injects these.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bsquizz bsquizz closed this Jun 30, 2026

@jsell-rh jsell-rh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Amber Analysis — PR #213

Summary: Clean, well-reasoned refactor that centralizes inference env var management at the sandbox level and moves ndots patching to the right place in the lifecycle. CI is green across all checks.

Confidence: High (95%) — the sequencing logic is correct and the code is straightforward.


What's Good

  • Correct fix for the ndots race: Moving the patch from entrypoint.sh (inside the container, after the supervisor's first DNS lookup) to ExecSandbox before ExecSandboxStreaming ensures the supervisor picks up ndots:1 before its first inference DNS query. The old entrypoint approach was fundamentally too late.
  • Sandbox-level env vars: Setting ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY in buildSandboxEnv instead of the Claude-specific wrapper means all tools (opencode, any future additions) inherit them automatically — correct architectural decision.
  • Non-fatal patch failure: Warn-and-continue on ndots patch error is the right call; a failed sed shouldn't abort the runner.
  • auth.py cleanup: Reading os.environ.get("ANTHROPIC_API_KEY", "inference-routing") instead of hardcoding is cleaner and correctly picks up whatever the control plane sets.

Minor Issues

Nit — sentinel value inconsistency (kube_reconciler.go, buildSandboxEnv)

env["ANTHROPIC_API_KEY"] = "unused-for-inference-routing"

This value travels all the way to auth.py and gets returned as the SDK's Bearer token value (the gateway proxy ignores it, so it's functionally fine). The old wrapper used "gateway", the old runner used "inference-routing", now it's "unused-for-inference-routing". Pick one canonical sentinel and document it in a single place. Suggestion: keep "inference-routing" for consistency with existing logs/docs.

Nit — no timeout on execCtx (kube_reconciler.go:475)

execCtx := context.Background()

ExecSandbox is synchronous. If the gRPC call hangs, this hangs indefinitely. A context.WithTimeout(context.Background(), 10*time.Second) would be a safety net for the sed command. Low risk in practice, but worth hardening.


Checklist

Check Status
No panic() in production Go
Errors wrapped with context
No tokens in logs
Structured logging with context fields
SecurityContext unchanged
CI passing ✅ All gates green

Draft status noted — test plan items are unchecked. Code itself is solid; happy to re-review once the e2e validation is done.

— Amber

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ SDD Preflight — Managed Paths Modified

This PR modifies files in SDD-managed component(s). These components are migrating to Spec-Driven Development.

File Component Mode
components/runners/ambient-runner/entrypoint.sh runner warn
components/runners/ambient-runner/openshell-claude-wrapper.sh runner warn

No action required — these components are in warn mode. Consider using the component's agent workflow for future changes.

📖 Specs: Runner Spec · Runner Constitution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants