fix(gateway): add SA token auth, resilient provisioning, and multi-tenant CRB#207
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
Gateway has usually been set up to keep mTLS on, but authentication off. I wasn't aware sending the kube svc account would do the trick. If this allows us to run with gateway auth on, sounds good to me. |
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Analysis — PR #207
Two-sentence summary: This PR ships SA token auth for the gateway client, resilient provisioning, and multi-tenant CRB fixes — solid work overall. I have 2 blockers that must be fixed before merge: a private key logged in plain text during parsing errors, and os.Getenv called at request-time inside the reconciler's hot path rather than reading from cfg.
Findings
[Blocker] private_key can appear in error messages / logs
ExtractServiceAccountJWTMaterial returns fmt.Errorf("service account JSON missing client_email or private_key") — that string is fine. But the caller in ensureVertexCredentialRefresh wraps the error and logs it with r.logger.Warn().Err(err). If json.Unmarshal itself fails it will embed the raw input string in the error, which could include the SA key content.
More critically: the material.PrivateKey is passed directly into the Material map sent over gRPC. If the ConfigureProviderRefresh or RotateProviderCredential calls fail and zerolog serializes the error (which sometimes includes the request fields via reflection), the private key leaks. The constitution says never log tokens/secrets — this needs a length check or redaction. At minimum, don't pass the raw parsed struct directly into an error-wrapped chain.
[Blocker] ProviderConfig reads os.Getenv at reconcile time, not load time
ProviderConfig in openshell/provider.go calls os.Getenv("ANTHROPIC_VERTEX_PROJECT_ID") and os.Getenv("CLOUD_ML_REGION") on every call. These env vars are loaded into cfg by config.Load() and are already available as r.cfg.VertexProjectID / r.cfg.VertexRegion. This is an architecture violation (config ≠ code), duplicates state, and will silently produce empty provider configs in environments where the env vars are set before pod start but not after. Pass cfg or the two values explicitly.
[Critical] execCtx := context.Background() — unbounded exec goroutine
execAfterReady creates an execCtx with no timeout or cancellation, then spawns a goroutine in ExecSandboxStreaming that reads the stream forever. If the gateway connection drops and stream.Recv() never returns io.EOF, the goroutine leaks. The outer pollCtx has a 120s timeout — the exec context should at minimum derive from the session's lifecycle or have a reasonable max-runtime bound.
[Major] enableProvidersV2 blocks sandbox creation on every session
Setting providers_v2_enabled=true is documented as idempotent and required once per gateway, but it's called synchronously in provisionSessionSandbox for every session. Under high session creation rate this is an unnecessary serial gRPC call per sandbox. Consider calling it once at startup per namespace (on reconciler init or on first connection to a gateway), or at least check the response to skip the call when already enabled.
[Major] SA token is read from disk on every gRPC call
authContext calls os.ReadFile(g.saTokenPath) on every RPC invocation. SA tokens rotate (default 1h in K8s). While re-reading is correct for token freshness, doing it per-call at high frequency adds latency and inode pressure. Consider caching with a TTL (e.g. 5 minutes) — short enough to pick up rotations, long enough to not thrash disk.
[Minor] ACP_OPENSHELL_INFERENCE unconditionally set to "true" in gateway mode
buildSandboxEnv sets env["ACP_OPENSHELL_INFERENCE"] = "true" whenever OpenShellUseGateway is true, even if no inference-capable credential is present (so configureInference was a no-op). The runner will attempt inference routing that won't work. Gate this on whether configureInference actually set routing for the session — propagate a boolean back from configureInference.
[Minor] ExecSandboxStreaming goroutine has no reference to cancel / notify session
The streaming goroutine logs exit codes and errors but has no way to update session status when the runner exits unexpectedly. The existing execAfterReady pattern was synchronous for a reason — knowing the exit code mattered. If this is intentional (fire-and-forget), it should be explicitly documented as such and the session phase should be updated via a separate mechanism.
Checklist
| Item | Result |
|---|---|
No panic() in production Go |
✅ |
Errors wrapped with fmt.Errorf("ctx: %w", err) |
✅ |
| No tokens in log values | |
GetK8sClientsForRequest for user ops |
N/A (control plane, not API server) |
IsNotFound handled for 404 |
✅ |
| OwnerReferences on child resources | N/A (sandbox resources owned by gateway) |
| Config separated from code | ❌ (ProviderConfig reads env at call time) |
| Tests for new logic | ✅ (3 new auth tests, test coverage good) |
Confidence: 88% — I'm highly confident on the private key and os.Getenv blockers. The exec goroutine leak is a real risk; the inference flag gating is a correctness issue. Token caching and enableProvidersV2 frequency are design judgement calls.
Rollback: git revert d85ce79 && kubectl rollout restart deployment/ambient-control-plane -n ambient-code
— Amber
jsell-rh
left a comment
There was a problem hiding this comment.
Inline comment thread for the private key blocker — see main review body for full context.
|
[Blocker — inline] — Amber |
|
#211 may fix this. |
…Config fix - Add service account token injection as Bearer auth on all gateway gRPC calls via authContext(), read from projected SA token path (configurable via OPENSHELL_GATEWAY_SA_TOKEN_PATH env var) - Make configureInference and ensureVertexCredentialRefresh failures non-fatal (warn-and-continue) so sandbox provisioning succeeds even when inference routing or credential refresh setup fails - Fix ProviderConfig to accept vertexProjectID and vertexRegion as parameters from config struct instead of calling os.Getenv at reconcile time - Fix pre-existing TestResolveSandboxImage_AllowedRegistry test bug (missing OpenShellRunnerImage in test config) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
d85ce79 to
bcef481
Compare
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — PR #207 (bcef481)
Good progress on this commit — one of the two blockers is resolved. One blocker and one critical remain before I can approve.
✅ Fixed
[Blocker → Fixed] ProviderConfig os.Getenv at reconcile time
ProviderConfig(ambientProvider, vertexProjectID, vertexRegion string) now takes explicit params from cfg at call site in kube_reconciler.go. Config separated from code — constitution-compliant.
[Improvement] configureInference and ensureVertexCredentialRefresh now non-fatal
Good call making these warn-and-continue. Sandbox provisioning should not be blocked by optional credential routing configuration.
❌ Still Outstanding
[Blocker] Private key exposure in ensureVertexCredentialRefresh — UNCHANGED
The underlying private key handling was not addressed in this commit. The material.PrivateKey returned by ExtractServiceAccountJWTMaterial is still passed directly into the gRPC Material map in RotateProviderCredential. Making the error non-fatal at the call site does not eliminate the leak path — if the gRPC call fails, zerolog's Err(err) can serialize error context that includes the request payload depending on the gRPC status error format.
Minimum required fix — redact before logging:
// In ensureVertexCredentialRefresh, before the Warn() log:
r.logger.Warn().Err(err).
Str("provider", provName).
Int("key_len", len(material.PrivateKey)). // length, not value
Msg("vertex credential refresh setup failed")And ensure the RotateProviderCredential call site does not log the full request struct.
[Critical] execCtx := context.Background() — unbounded goroutine in ExecSandboxStreaming — UNCHANGED
ExecSandboxStreaming still creates execCtx := context.Background() with no timeout or cancellation, then spawns a goroutine reading the stream indefinitely. The outer pollCtx has a 120s timeout, but execCtx has none. If the gRPC stream stalls without returning io.EOF, this goroutine leaks for the process lifetime.
Minimum fix:
// Derive from the session lifecycle context or add a deadline
execCtx, execCancel := context.WithTimeout(ctx, 30*time.Minute) // reasonable max runtime
defer execCancel()Checklist Update
| Item | Previous | Now |
|---|---|---|
| No tokens/secrets in log values | ❌ Still unresolved | |
Config separated from code (ProviderConfig) |
❌ Blocker | ✅ Fixed |
Errors wrapped with fmt.Errorf |
✅ | ✅ |
| Goroutine lifecycle bounded | ❌ Still unresolved | |
| Tests for new auth logic | ✅ | ✅ |
Confidence: 92% — The two remaining items are clearly scoped and have actionable fixes above. Once these are addressed I can approve.
Rollback: git revert bcef481 && kubectl rollout restart deployment/ambient-control-plane -n ambient-code
— Amber
Summary
Bearerheader on all 12 gRPC methods inGatewayClient, fixingUnauthenticatederrors when the OpenShell gateway enforces auth.configureInferenceandensureVertexCredentialRefreshfrom fatal errors to warnings, so sandbox creation proceeds even when credential types are incompatible (e.g.authorized_uservsservice_account).ACP_OPENSHELL_INFERENCE=trueis now set in gateway mode regardless ofVertexEnabled, since the gateway's inference proxy handles routing transparently.sandboxImagePullPolicy=IfNotPresentfor locally-loaded images;ClusterRoleBindingpatched to include all tenant gateway SAs forTokenReviewaccess.Changes
< /dev/null | File | Change |
|------|--------|
|
gateway_client.go|authContext()reads SA token, injects Bearer header on all RPCs ||
gateway_client_test.go| 3 new tests: no-path, valid-token, missing-file ||
config.go|OpenShellGatewaySATokenPathfield (default: SA mount path) ||
main.go| Pass SA token path toNewGatewayClient||
kube_reconciler.go| Non-fatal inference config + credential refresh; always setACP_OPENSHELL_INFERENCEin gateway mode ||
setup-kind-openshell.sh|sandboxImagePullPolicy, multi-tenant CRB patch |Test plan
go vet ./...passesgofmt -l .cleango test ./...passes (including 3 new auth tests)🤖 Generated with Claude Code