fix(gateway): patch resolv.conf ndots and set inference env vars at sandbox level#213
fix(gateway): patch resolv.conf ndots and set inference env vars at sandbox level#213bsquizz wants to merge 1 commit into
Conversation
…andbox level The OpenShell supervisor (musl libc) fails to resolve upstream inference endpoints with Kubernetes' default ndots:5, causing inference.local to hang with NET:FAIL. Patch resolv.conf to ndots:1 via ExecSandbox before the runner starts so the supervisor picks up the fix on its first DNS lookup. Move ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY from the Claude-specific wrapper to the sandbox environment so all tools (claude, opencode, etc.) get them automatically. Remove redundant proxy/TLS env var setup from auth.py and the wrapper — the supervisor already injects these. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Analysis — PR #213
Summary: Clean, well-reasoned refactor that centralizes inference env var management at the sandbox level and moves ndots patching to the right place in the lifecycle. CI is green across all checks.
Confidence: High (95%) — the sequencing logic is correct and the code is straightforward.
What's Good
- Correct fix for the ndots race: Moving the patch from
entrypoint.sh(inside the container, after the supervisor's first DNS lookup) toExecSandboxbeforeExecSandboxStreamingensures the supervisor picks upndots:1before its first inference DNS query. The old entrypoint approach was fundamentally too late. - Sandbox-level env vars: Setting
ANTHROPIC_BASE_URLandANTHROPIC_API_KEYinbuildSandboxEnvinstead of the Claude-specific wrapper means all tools (opencode, any future additions) inherit them automatically — correct architectural decision. - Non-fatal patch failure: Warn-and-continue on ndots patch error is the right call; a failed sed shouldn't abort the runner.
auth.pycleanup: Readingos.environ.get("ANTHROPIC_API_KEY", "inference-routing")instead of hardcoding is cleaner and correctly picks up whatever the control plane sets.
Minor Issues
Nit — sentinel value inconsistency (kube_reconciler.go, buildSandboxEnv)
env["ANTHROPIC_API_KEY"] = "unused-for-inference-routing"This value travels all the way to auth.py and gets returned as the SDK's Bearer token value (the gateway proxy ignores it, so it's functionally fine). The old wrapper used "gateway", the old runner used "inference-routing", now it's "unused-for-inference-routing". Pick one canonical sentinel and document it in a single place. Suggestion: keep "inference-routing" for consistency with existing logs/docs.
Nit — no timeout on execCtx (kube_reconciler.go:475)
execCtx := context.Background()ExecSandbox is synchronous. If the gRPC call hangs, this hangs indefinitely. A context.WithTimeout(context.Background(), 10*time.Second) would be a safety net for the sed command. Low risk in practice, but worth hardening.
Checklist
| Check | Status |
|---|---|
No panic() in production Go |
✅ |
| Errors wrapped with context | ✅ |
| No tokens in logs | ✅ |
| Structured logging with context fields | ✅ |
| SecurityContext unchanged | ✅ |
| CI passing | ✅ All gates green |
Draft status noted — test plan items are unchecked. Code itself is solid; happy to re-review once the e2e validation is done.
— Amber
|
| File | Component | Mode |
|---|---|---|
components/runners/ambient-runner/entrypoint.sh |
runner | warn |
components/runners/ambient-runner/openshell-claude-wrapper.sh |
runner | warn |
No action required — these components are in warn mode. Consider using the component's agent workflow for future changes.
📖 Specs: Runner Spec · Runner Constitution
Summary
/etc/resolv.conftondots:1viaExecSandboxbefore the runner entrypoint starts, fixing the musl libc DNS resolution failure that causesinference.localto hang withNET:FAILANTHROPIC_BASE_URLandANTHROPIC_API_KEYfrom the Claude-specific wrapper to the sandbox environment so all tools (claude, opencode, etc.) get them automaticallyauth.pyand the wrapper — the OpenShell supervisor already injects theseTest plan
OPENSHELL_USE_GATEWAY=trueresolv.confhasndots:1claudeinteractively viaopenshell sandbox connect— should reach inference.local without hangingANTHROPIC_BASE_URLandANTHROPIC_API_KEYare present in sandbox env for all processes🤖 Generated with Claude Code