Skip to content

fix(gateway): add SA token auth, resilient provisioning, and multi-tenant CRB#207

Merged
jsell-rh merged 1 commit into
mainfrom
fix/gateway-auth-and-resilience
Jul 1, 2026
Merged

fix(gateway): add SA token auth, resilient provisioning, and multi-tenant CRB#207
jsell-rh merged 1 commit into
mainfrom
fix/gateway-auth-and-resilience

Conversation

@markturansky

Copy link
Copy Markdown
Collaborator

Summary

  • Gateway SA token auth: Injects the pod's Kubernetes service account token as a Bearer header on all 12 gRPC methods in GatewayClient, fixing Unauthenticated errors when the OpenShell gateway enforces auth.
  • Resilient provisioning: Converts configureInference and ensureVertexCredentialRefresh from fatal errors to warnings, so sandbox creation proceeds even when credential types are incompatible (e.g. authorized_user vs service_account).
  • Gateway inference env fix: ACP_OPENSHELL_INFERENCE=true is now set in gateway mode regardless of VertexEnabled, since the gateway's inference proxy handles routing transparently.
  • Kind setup fixes: sandboxImagePullPolicy=IfNotPresent for locally-loaded images; ClusterRoleBinding patched to include all tenant gateway SAs for TokenReview access.

Changes

< /dev/null | File | Change |
|------|--------|
| gateway_client.go | authContext() reads SA token, injects Bearer header on all RPCs |
| gateway_client_test.go | 3 new tests: no-path, valid-token, missing-file |
| config.go | OpenShellGatewaySATokenPath field (default: SA mount path) |
| main.go | Pass SA token path to NewGatewayClient |
| kube_reconciler.go | Non-fatal inference config + credential refresh; always set ACP_OPENSHELL_INFERENCE in gateway mode |
| setup-kind-openshell.sh | sandboxImagePullPolicy, multi-tenant CRB patch |

Test plan

  • go vet ./... passes
  • gofmt -l . clean
  • go test ./... passes (including 3 new auth tests)
  • End-to-end Kind validation: session → sandbox provisioning → runner exec → AG-UI server ready
  • Verified sandbox lifecycle completes with non-fatal inference config warning
  • Verified multi-tenant CRB patch allows TokenReview for all tenant gateway SAs

🤖 Generated with Claude Code

Base automatically changed from feat/runner-gateway-containerfile to main June 30, 2026 19:10
@squizzi

This comment was marked as outdated.

@bsquizz

bsquizz commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Gateway has usually been set up to keep mTLS on, but authentication off. I wasn't aware sending the kube svc account would do the trick. If this allows us to run with gateway auth on, sounds good to me.

@jsell-rh jsell-rh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Amber Analysis — PR #207

Two-sentence summary: This PR ships SA token auth for the gateway client, resilient provisioning, and multi-tenant CRB fixes — solid work overall. I have 2 blockers that must be fixed before merge: a private key logged in plain text during parsing errors, and os.Getenv called at request-time inside the reconciler's hot path rather than reading from cfg.


Findings

[Blocker] private_key can appear in error messages / logs

ExtractServiceAccountJWTMaterial returns fmt.Errorf("service account JSON missing client_email or private_key") — that string is fine. But the caller in ensureVertexCredentialRefresh wraps the error and logs it with r.logger.Warn().Err(err). If json.Unmarshal itself fails it will embed the raw input string in the error, which could include the SA key content.

More critically: the material.PrivateKey is passed directly into the Material map sent over gRPC. If the ConfigureProviderRefresh or RotateProviderCredential calls fail and zerolog serializes the error (which sometimes includes the request fields via reflection), the private key leaks. The constitution says never log tokens/secrets — this needs a length check or redaction. At minimum, don't pass the raw parsed struct directly into an error-wrapped chain.

[Blocker] ProviderConfig reads os.Getenv at reconcile time, not load time

ProviderConfig in openshell/provider.go calls os.Getenv("ANTHROPIC_VERTEX_PROJECT_ID") and os.Getenv("CLOUD_ML_REGION") on every call. These env vars are loaded into cfg by config.Load() and are already available as r.cfg.VertexProjectID / r.cfg.VertexRegion. This is an architecture violation (config ≠ code), duplicates state, and will silently produce empty provider configs in environments where the env vars are set before pod start but not after. Pass cfg or the two values explicitly.

[Critical] execCtx := context.Background() — unbounded exec goroutine

execAfterReady creates an execCtx with no timeout or cancellation, then spawns a goroutine in ExecSandboxStreaming that reads the stream forever. If the gateway connection drops and stream.Recv() never returns io.EOF, the goroutine leaks. The outer pollCtx has a 120s timeout — the exec context should at minimum derive from the session's lifecycle or have a reasonable max-runtime bound.

[Major] enableProvidersV2 blocks sandbox creation on every session

Setting providers_v2_enabled=true is documented as idempotent and required once per gateway, but it's called synchronously in provisionSessionSandbox for every session. Under high session creation rate this is an unnecessary serial gRPC call per sandbox. Consider calling it once at startup per namespace (on reconciler init or on first connection to a gateway), or at least check the response to skip the call when already enabled.

[Major] SA token is read from disk on every gRPC call

authContext calls os.ReadFile(g.saTokenPath) on every RPC invocation. SA tokens rotate (default 1h in K8s). While re-reading is correct for token freshness, doing it per-call at high frequency adds latency and inode pressure. Consider caching with a TTL (e.g. 5 minutes) — short enough to pick up rotations, long enough to not thrash disk.

[Minor] ACP_OPENSHELL_INFERENCE unconditionally set to "true" in gateway mode

buildSandboxEnv sets env["ACP_OPENSHELL_INFERENCE"] = "true" whenever OpenShellUseGateway is true, even if no inference-capable credential is present (so configureInference was a no-op). The runner will attempt inference routing that won't work. Gate this on whether configureInference actually set routing for the session — propagate a boolean back from configureInference.

[Minor] ExecSandboxStreaming goroutine has no reference to cancel / notify session

The streaming goroutine logs exit codes and errors but has no way to update session status when the runner exits unexpectedly. The existing execAfterReady pattern was synchronous for a reason — knowing the exit code mattered. If this is intentional (fire-and-forget), it should be explicitly documented as such and the session phase should be updated via a separate mechanism.


Checklist

Item Result
No panic() in production Go
Errors wrapped with fmt.Errorf("ctx: %w", err)
No tokens in log values ⚠️ (private key exposure risk — see blocker)
GetK8sClientsForRequest for user ops N/A (control plane, not API server)
IsNotFound handled for 404
OwnerReferences on child resources N/A (sandbox resources owned by gateway)
Config separated from code ❌ (ProviderConfig reads env at call time)
Tests for new logic ✅ (3 new auth tests, test coverage good)

Confidence: 88% — I'm highly confident on the private key and os.Getenv blockers. The exec goroutine leak is a real risk; the inference flag gating is a correctness issue. Token caching and enableProvidersV2 frequency are design judgement calls.

Rollback: git revert d85ce79 && kubectl rollout restart deployment/ambient-control-plane -n ambient-code

Amber

@jsell-rh jsell-rh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comment thread for the private key blocker — see main review body for full context.

@jsell-rh

Copy link
Copy Markdown
Collaborator

[Blocker — inline] ProviderConfig reads os.Getenv at reconcile time (openshell/provider.go). Use r.cfg.VertexProjectID / r.cfg.VertexRegion instead, or pass config explicitly. This violates the "separate configuration from code" convention in CLAUDE.md.

Amber

@bsquizz

bsquizz commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

#211 may fix this.

…Config fix

- Add service account token injection as Bearer auth on all gateway gRPC
  calls via authContext(), read from projected SA token path (configurable
  via OPENSHELL_GATEWAY_SA_TOKEN_PATH env var)
- Make configureInference and ensureVertexCredentialRefresh failures
  non-fatal (warn-and-continue) so sandbox provisioning succeeds even
  when inference routing or credential refresh setup fails
- Fix ProviderConfig to accept vertexProjectID and vertexRegion as
  parameters from config struct instead of calling os.Getenv at
  reconcile time
- Fix pre-existing TestResolveSandboxImage_AllowedRegistry test bug
  (missing OpenShellRunnerImage in test config)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@markturansky markturansky force-pushed the fix/gateway-auth-and-resilience branch from d85ce79 to bcef481 Compare July 1, 2026 02:51
@github-actions github-actions Bot added auto-merge-pending PR eligible for auto-merge, waiting for checks component/control-plane labels Jul 1, 2026
@jsell-rh jsell-rh enabled auto-merge July 1, 2026 03:02
@jsell-rh jsell-rh added this pull request to the merge queue Jul 1, 2026

@jsell-rh jsell-rh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Amber Re-Review — PR #207 (bcef481)

Good progress on this commit — one of the two blockers is resolved. One blocker and one critical remain before I can approve.


✅ Fixed

[Blocker → Fixed] ProviderConfig os.Getenv at reconcile time
ProviderConfig(ambientProvider, vertexProjectID, vertexRegion string) now takes explicit params from cfg at call site in kube_reconciler.go. Config separated from code — constitution-compliant.

[Improvement] configureInference and ensureVertexCredentialRefresh now non-fatal
Good call making these warn-and-continue. Sandbox provisioning should not be blocked by optional credential routing configuration.


❌ Still Outstanding

[Blocker] Private key exposure in ensureVertexCredentialRefresh — UNCHANGED

The underlying private key handling was not addressed in this commit. The material.PrivateKey returned by ExtractServiceAccountJWTMaterial is still passed directly into the gRPC Material map in RotateProviderCredential. Making the error non-fatal at the call site does not eliminate the leak path — if the gRPC call fails, zerolog's Err(err) can serialize error context that includes the request payload depending on the gRPC status error format.

Minimum required fix — redact before logging:

// In ensureVertexCredentialRefresh, before the Warn() log:
r.logger.Warn().Err(err).
    Str("provider", provName).
    Int("key_len", len(material.PrivateKey)). // length, not value
    Msg("vertex credential refresh setup failed")

And ensure the RotateProviderCredential call site does not log the full request struct.

[Critical] execCtx := context.Background() — unbounded goroutine in ExecSandboxStreaming — UNCHANGED

ExecSandboxStreaming still creates execCtx := context.Background() with no timeout or cancellation, then spawns a goroutine reading the stream indefinitely. The outer pollCtx has a 120s timeout, but execCtx has none. If the gRPC stream stalls without returning io.EOF, this goroutine leaks for the process lifetime.

Minimum fix:

// Derive from the session lifecycle context or add a deadline
execCtx, execCancel := context.WithTimeout(ctx, 30*time.Minute) // reasonable max runtime
defer execCancel()

Checklist Update

Item Previous Now
No tokens/secrets in log values ⚠️ Blocker ❌ Still unresolved
Config separated from code (ProviderConfig) ❌ Blocker ✅ Fixed
Errors wrapped with fmt.Errorf
Goroutine lifecycle bounded ⚠️ Critical ❌ Still unresolved
Tests for new auth logic

Confidence: 92% — The two remaining items are clearly scoped and have actionable fixes above. Once these are addressed I can approve.

Rollback: git revert bcef481 && kubectl rollout restart deployment/ambient-control-plane -n ambient-code

Amber

Merged via the queue into main with commit 5fb9ea0 Jul 1, 2026
31 checks passed
@jsell-rh jsell-rh deleted the fix/gateway-auth-and-resilience branch July 1, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants