feat: implement agent sandbox configuration#201
Conversation
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Review — PR #201: Agent Sandbox Configuration
Good architecture overall — the full-stack approach (OpenAPI → DB migration → SDKs → control plane → UI) is clean and the ConvertAgent/PresentAgent round-trip via JSONB is correct. The configmap-driven agent declaration idea is solid. Several issues need resolution before merge.
Findings Summary
| Severity | Count | Issues |
|---|---|---|
| Critical | 2 | Pruning heuristic deletes human-created agents; zero tests on new business logic |
| Major | 4 | Polling instead of Informers; SQL injection in search; resolveSandboxImage skips registry validation + swallows parse errors; select block structure causes early return |
| Minor | 2 | Silent json.Marshal discards in handler and rollback migration |
Must-fix before merge
- Pruning false-positives (configmap_reconciler.go:297) — agents with non-empty
Entrypointset by humans will be deleted. Use an origin annotation to identify ConfigMap-managed agents. - Missing tests (configmap_reconciler.go:148) — 343 lines of new logic, zero test coverage. At minimum
parseAgentDeclarationtable tests. selectblock structure (main.go:262) — the innerselectinside theOpenShellUseGatewaybranch returns before the outer one is reached; useerrgroupto unify.- Registry validation (kube_reconciler.go:381) — tenant-supplied image references flow directly into
CreateSandboxRequestwithout allowlist validation.
What's working well
ConvertAgent/PresentAgentJSONB round-trip is correct- Migration has proper
Rollbackpath mergeAgentEnvironmentcorrectly doesn't overwrite existing env keys (session env takes precedence)- SDK regeneration is consistent across Go / Python / TypeScript
Since this is a draft PR the intent is iteration — happy to re-review once the critical items are addressed.
— Amber
| } | ||
| if len(decl.Environment) > 0 { | ||
| if raw, err := json.Marshal(decl.Environment); err == nil { | ||
| agent.Environment = string(raw) |
There was a problem hiding this comment.
Critical — pruneRemovedAgents deletes agents based on heuristic Entrypoint != ""
The pruning logic identifies ConfigMap-managed agents by a.Entrypoint != "":
if a.Entrypoint != "" && !declaredAgents[a.Name] {
sdk.Agents().Delete(ctx, a.ID)
}This is dangerous for two reasons:
-
False positives: A human-created agent that happens to have a non-empty
Entrypointwill be deleted on the next sync cycle, with no warning and no recovery. The entrypoint field is a general agent config field, not a ConfigMap-origin marker. -
Cross-namespace collision:
pruneRemovedAgentslists all agents for the project (size=500) and prunes any with a non-empty entrypoint not indeclaredAgents. If two namespaces in the same project each have a ConfigMap declaring an agent, one namespace's sync will prune the other's agents.
The correct approach is to track ConfigMap origin explicitly. Options:
- Store a label/annotation on the agent at creation time (e.g.,
annotations["ambient.ai/source"] = "configmap"andannotations["ambient.ai/configmap-namespace"] = ns) and filter on that during pruning - Or scope pruning per-namespace: only prune agents whose name was declared in this namespace's ConfigMaps and isn't in
declaredAgents
// At creation time
agent.Annotations = `{"ambient.ai/source":"configmap","ambient.ai/namespace":"` + namespace + `"}`
// At prune time
if sourceNS, ok := a.Annotations["ambient.ai/namespace"]; ok && sourceNS == namespace && !declaredAgents[a.Name] {
sdk.Agents().Delete(ctx, a.ID)
}— Amber
| select { | ||
| case <-ctx.Done(): | ||
| s.logger.Info().Msg("configmap syncer stopped") | ||
| return ctx.Err() |
There was a problem hiding this comment.
Major — 30-second polling; use Informers per project pattern
configMapSyncInterval = 30 * time.Second is a polling loop, same issue flagged in PR #200 for the gateway syncer. The existing control-plane codebase uses K8s Informers (SharedIndexInformer via factory) for reactive reconciliation.
This introduces up to 30s latency on agent declaration changes and doesn't match the established pattern.
The factory parameter is already passed to NewConfigMapSyncer — use it:
cmInformer := factory.Core().V1().ConfigMaps().Informer()
cmInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { s.handleCMEvent(ctx, obj) },
UpdateFunc: func(_, obj interface{}) { s.handleCMEvent(ctx, obj) },
DeleteFunc: func(obj interface{}) { s.handleCMDelete(ctx, obj) },
})
factory.Start(ctx.Done())
cache.WaitForCacheSync(ctx.Done(), cmInformer.HasSynced)Filter to only ambient.ai/kind=agent labelled ConfigMaps using WithTweakListOptions. This gives instant reactivity and eliminates the polling goroutine entirely.
— Amber
| } | ||
| if p.Content != "" && p.RepoURL != "" { | ||
| return nil, fmt.Errorf("payload[%d]: cannot specify both content and repo_url", i) | ||
| } |
There was a problem hiding this comment.
Major — findAgentByName uses SQL-injectable Search parameter
agents, err := sdk.Agents().List(ctx, &types.ListOptions{Size: 100, Search: fmt.Sprintf("name = '%s'", name)})The name value comes from the ConfigMap YAML name field which is validated only for emptiness. A crafted ConfigMap with name: foo' OR '1'='1 would inject into the search query. ConfigMaps are namespace-admin-editable, making this a real attack surface.
Fix: use %q or escape the value, or use a parameterised search API if the SDK supports it:
Search: fmt.Sprintf("name = %q", name)
// or
Search: "name = '" + strings.ReplaceAll(name, "'", "''") + "'"Also: Size: 100 silently truncates if the project has >100 agents — the lookup will miss agents beyond page 1. Use pagination or search with a tighter filter.
— Amber
| if decl.RepoURL != "" { | ||
| patch["repo_url"] = decl.RepoURL | ||
| } | ||
| if decl.LlmModel != "" { |
There was a problem hiding this comment.
Major — upsertAgent silently discards json.Marshal errors
labelsJSON, _ := json.Marshal(decl.Labels)
patch["labels"] = string(labelsJSON)And the same pattern for annotations. json.Marshal on a map[string]string won't fail in practice, but the _ discard pattern was flagged in CLAUDE.md: "Never silently swallow partial failures: Every error path must propagate or be collected, not discarded".
More importantly, if marshal did fail (e.g., a future type change), string(nil) becomes "" and the patch silently sets labels to an empty string, corrupting the agent record.
labelsJSON, err := json.Marshal(decl.Labels)
if err != nil {
return fmt.Errorf("marshalling agent labels: %w", err)
}
patch["labels"] = string(labelsJSON)— Amber
| func (r *SimpleKubeReconciler) resolveSandboxImage(agent *types.Agent) string { | ||
| if agent != nil && agent.SandboxTemplate != "" { | ||
| var tpl struct { | ||
| Image string `json:"image"` |
There was a problem hiding this comment.
Major — resolveSandboxImage re-parses JSON from a string field on every sandbox creation
func (r *SimpleKubeReconciler) resolveSandboxImage(agent *types.Agent) string {
if agent != nil && agent.SandboxTemplate != "" {
var tpl struct {
Image string `json:"image"`
}
if err := json.Unmarshal([]byte(agent.SandboxTemplate), &tpl); err == nil && tpl.Image != "" {
return tpl.Image
}
}
return r.cfg.RunnerImage
}agent.SandboxTemplate is a JSON string (serialised SandboxTemplate struct). The SDK type types.Agent.SandboxTemplate is typed as string here, but the OpenAPI model has a proper SandboxTemplate struct. The re-parse-from-string pattern defeats type safety.
Two concerns:
- If the JSON is malformed (DB corruption, partial write), the error is silently swallowed and
RunnerImageis used — caller never knows the agent's declared image was ignored. - The image value comes from a tenant-editable field. There's no validation it matches an allowed registry prefix before being passed to
CreateSandboxRequest. Tenant A could declareimage: evil.registry/malicious:latestand have it used for their session.
Minimum fix — log the parse error:
if err := json.Unmarshal(...); err != nil {
r.logger.Warn().Err(err).Str("agent_id", agent.ID).Msg("failed to parse sandbox_template; using default image")
}And add registry prefix validation matching your allowed image registries.
— Amber
| return fmt.Errorf("pod status syncer: %w", podSyncErr) | ||
| case cmSyncErr := <-cmSyncErrCh: | ||
| return fmt.Errorf("configmap syncer: %w", cmSyncErr) | ||
| } |
There was a problem hiding this comment.
Major — select block inside cfg.OpenShellUseGateway branch causes early return, skipping the outer select
if cfg.OpenShellUseGateway {
cmSyncer := ...
go func() { cmSyncErrCh <- cmSyncer.Run(ctx) }()
select { // ← inner select: returns when ANY goroutine exits
case tsErr := <-tsErrCh: ...
case infErr := <-infErrCh: ...
case podSyncErr := <-podSyncErrCh: ...
case cmSyncErr := <-cmSyncErrCh: ...
}
}
select { // ← outer select: NEVER reached when OpenShellUseGateway=true
case tsErr := <-tsErrCh: ...
...
}When OpenShellUseGateway=true the function returns as soon as any single goroutine terminates (including a clean ctx.Done() shutdown of cmSyncer), and the outer select is never reached. In normal operation cmSyncer.Run() returns ctx.Err() on shutdown, so the control plane exits immediately after the context is cancelled — that's correct, but it would also exit immediately if cmSyncer.Run returns early for any reason (panic-recovered, bug), silently taking down the control plane.
Unify into a single select (or use an errgroup):
g, gctx := errgroup.WithContext(ctx)
g.Go(func() error { return tokenSrv.Run(gctx) })
g.Go(func() error { return informerRunner.Run(gctx) })
g.Go(func() error { return podSyncer.Run(gctx) })
if cfg.OpenShellUseGateway {
g.Go(func() error { return cmSyncer.Run(gctx) })
}
return g.Wait()— Amber
| } | ||
| if patch.Payloads != nil { | ||
| if raw, merr := json.Marshal(patch.Payloads); merr == nil { | ||
| s := string(raw) |
There was a problem hiding this comment.
Minor — silent json.Marshal error discard in Patch handler
if patch.Providers != nil {
if raw, merr := json.Marshal(patch.Providers); merr == nil {
s := string(raw)
found.Providers = &s
}
}If json.Marshal fails (shouldn't for []string, but pattern is inconsistent with the project convention), the patch is silently ignored — the field isn't updated, no error is returned, no log. The client receives a 200 with stale data. Either handle the error:
raw, merr := json.Marshal(patch.Providers)
if merr != nil {
writeJSONError(w, http.StatusInternalServerError, fmt.Errorf("marshalling providers: %w", merr))
return
}
s := string(raw)
found.Providers = &sOr, since []string literally cannot fail marshal, add a comment explaining why the error is intentionally ignored to document the invariant.
— Amber
| }, | ||
| Rollback: func(tx *gorm.DB) error { | ||
| cols := []string{ | ||
| "entrypoint", "providers", "payloads", |
There was a problem hiding this comment.
Minor — SQL injection in rollback migration via string concatenation
if err := tx.Exec("ALTER TABLE agents DROP COLUMN IF EXISTS " + col).Error; err != nil {col comes from the hardcoded slice []string{"entrypoint", "providers", ...} which is fine today. But this pattern is worth flagging — if col ever becomes variable (e.g., refactored to accept a column list), this becomes an injection vector. Use a format string instead:
tx.Exec(fmt.Sprintf("ALTER TABLE agents DROP COLUMN IF EXISTS %s", col))Or even better, validate col against an allowlist before use. The Migrate path already uses parameterised statements — apply the same care to Rollback.
— Amber
| return | ||
| } | ||
|
|
||
| declaredAgents := map[string]bool{} |
There was a problem hiding this comment.
Critical — Zero tests for configmap_reconciler.go (343 lines of new business logic)
Project convention (Constitution Principle IV): tests required. The new configmap_reconciler.go adds 343 lines with no test file. The most impactful test targets:
parseAgentDeclaration— table-driven: valid YAML parses correctly,namemissing returns error, payload with bothcontent+repo_urlreturns error, invalid YAML returns errorpruneRemovedAgents— mock SDK client: verify only ConfigMap-origin agents (once the origin marker is added) are pruned, not hand-created agentsupsertAgent— create path and update path via fake SDK; verify fields are set correctly on first create vs patch on updatesyncNamespaceAgents— mockkube.ListConfigMapsByLabelreturning multi-key ConfigMap; verify all keys are iterated and errors on one key don't abort others
Minimum to unblock: parseAgentDeclaration tests (pure function, no mocking needed) and one integration-style test for pruneRemovedAgents once the origin-marker fix is in.
func TestParseAgentDeclaration(t *testing.T) {
cases := []struct {
name string
yaml string
wantErr bool
}{
{"valid", "name: my-agent\nentrypoint: claude\n", false},
{"missing name", "entrypoint: claude\n", false}, // validated by caller
{"both content and repo_url", "name: a\npayloads:\n- sandbox_path: /x\n content: foo\n repo_url: http://x\n", true},
{"invalid yaml", "name: [bad", true},
}
...
}— Amber
|
🤖 Amber — Follow-up Review (64866fb) Good progress on the latest commit. Several prior findings are resolved: ✅ Addressed:
Still needs attention before merge: Major — 30-second polling still in place (
Major — The warning log is good, but the core concern stands: Minor —
Major — Test coverage still incomplete (
These are the most likely sources of production incidents and the hardest to debug without test coverage. Keeping — Amber |
While this is a good idea I think it's a bit out of scope for this effort. I imagine we'll be heavily reviewing these configmaps that end up in tenant workspaces for now. |
|
🤖 Amber Re-Review — commit Good fix pass. Checking each item from my earlier review:
Remaining open items: Major — 30s polling still in place; Informer not used
Major — The warning log was added (good), but there's still no registry allowlist check. A tenant-editable Minor — Migration rollback SQL still uses string concatenation tx.Exec("ALTER TABLE agents DROP COLUMN IF EXISTS " + col)The values come from a hardcoded slice (safe today), but the pattern was flagged. Use Minor — If the project has >100 agents, the lookup will miss agents on page 2+. This is a correctness bug for large installations. Either paginate or use a tighter search filter. Summary: 7 of 9 original findings addressed — solid fix pass. The polling and image registry concerns are the remaining blockers for a non-draft review. Happy to re-check when those are addressed. — Amber |
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — Progress Update
Good progress on the latest commit (64866fb). Here's what was addressed and what remains.
✅ Fixed (6 of 9 issues)
| Issue | Status |
|---|---|
pruneRemovedAgents false-positive deletion (Critical) |
Fixed — isConfigMapManaged + namespace-scoped origin annotations |
findAgentByName SQL injection (Major) |
Fixed — strings.ReplaceAll escaping |
Silent json.Marshal discards in configmap_reconciler (Major) |
Fixed — errors now propagated |
Silent json.Marshal discards in Patch handler (Major) |
Fixed — returns GeneralError |
Dead outer select when gateway enabled (Major) |
Fixed — nil channel pattern, unified select |
resolveSandboxImage silent parse failure (Major) |
Partially fixed — warning log added ✅ |
Tests for configmap_reconciler.go (Critical) |
Fixed — TestParseAgentDeclaration + TestIsConfigMapManaged with good coverage |
⚠️ Still Open
Major — 30-second polling loop not replaced with Informer (previously flagged)
configMapSyncInterval = 30 * time.Second and the polling time.Ticker in Run() are unchanged. The factory is passed into NewConfigMapSyncer but not used for reactive watching. PR #200 has the same pattern. This introduces up to 30s latency on agent declaration changes and is inconsistent with the established Informer pattern in the rest of the codebase.
I recognise this is a design scope question — if the intent is to do this in a follow-up PR, please leave a // TODO: replace with Informer comment in Run() and link the follow-up issue so it doesn't get lost.
Minor — resolveSandboxImage still accepts arbitrary registry (no allowlist)
The parse error is now logged, but there's still no validation that tpl.Image matches an allowed registry prefix before it's passed to CreateSandboxRequest. A tenant-controlled ConfigMap can specify image: evil.registry/malicious:latest. This should be validated against an allowlist (e.g., quay.io/, registry.access.redhat.com/) or at minimum documented as an accepted risk with a // TODO.
Minor — Rollback migration SQL pattern unchanged
tx.Exec("ALTER TABLE agents DROP COLUMN IF EXISTS " + col)The col slice is hardcoded, so this isn't an active vulnerability. Still worth fixing the pattern before this lands — use fmt.Sprintf or validate col against an allowlist so it's safe if the list ever becomes dynamic.
Still amber/change-requested for the polling issue — that's an architectural deviation from project patterns and should be resolved or formally deferred.
— Amber
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — commit 64866fb
Thanks for the fast turnaround on the feedback. I've verified the latest commit against all 9 original findings. Most are resolved; two remain.
✅ Fixed
| Finding | Status |
|---|---|
pruneRemovedAgents false-positive deletions (origin heuristic) |
Fixed — ambient.ai/source + ambient.ai/source-namespace annotations, isConfigMapManaged guards prune correctly |
| Cross-namespace prune collision | Fixed — namespace-scoped prune via annotation |
findAgentByName SQL injection |
Fixed — strings.ReplaceAll(name, "'", "''") applied to both name and project_id |
json.Marshal silently discarded in upsertAgent |
Fixed — error propagated with fmt.Errorf |
json.Marshal silently discarded in handler.go Patch |
Fixed — all four fields (providers, payloads, environment, sandbox_template) now return proper error responses |
main.go nested inner/outer select (dead outer select) |
Fixed — unified single select with nil-channel pattern for conditional cmSyncErrCh |
resolveSandboxImage silent parse failure |
Fixed — WARN log with agent name on unmarshal error |
Zero tests for configmap_reconciler.go |
Fixed — TestParseAgentDeclaration, TestIsConfigMapManaged, TestParsePolicyDeclaration added |
| Minor SQL injection pattern in rollback migration | Acceptable as-is (column list is hardcoded; add a comment noting the invariant) |
❌ Still Needs Work — 2 blockers
1. 30-second polling (Major — unchanged)
configMapSyncInterval = 30 * time.Second with a time.NewTicker loop was flagged in both this PR and #200 as conflicting with the project's established Informer pattern. This remains unaddressed. The factory is already passed to NewConfigMapSyncer; please wire up a SharedIndexInformer with ambient.ai/kind=agent label filtering. The original comment has the code sketch.
2. No registry allowlist on resolveSandboxImage (Critical — security)
resolveSandboxImage now logs parse errors ✅, but the underlying security concern is unaddressed: a tenant-editable ConfigMap field (sandbox_template.image) is used verbatim as the container image reference with no registry validation. A tenant namespace-admin can declare image: evil.registry/malicious:latest and have it execute for their session.
Minimum fix: validate the parsed image against an allowed registry prefix list before returning it. If the allowed registries are already defined somewhere in config, use that list. Otherwise add an ALLOWED_IMAGE_REGISTRIES env var (comma-separated prefixes) with a reasonable default.
func (r *SimpleKubeReconciler) resolveSandboxImage(agent *types.Agent) string {
if agent != nil && agent.SandboxTemplate != "" {
var tpl struct{ Image string `json:"image"` }
if err := json.Unmarshal([]byte(agent.SandboxTemplate), &tpl); err != nil {
r.logger.Warn().Err(err).Str("agent", agent.Name).Msg("failed to parse sandbox_template JSON, using default image")
} else if tpl.Image != "" {
if err := r.validateImageRegistry(tpl.Image); err != nil {
r.logger.Warn().Err(err).Str("image", tpl.Image).Msg("agent declared disallowed image registry, using default")
} else {
return tpl.Image
}
}
}
return r.cfg.RunnerImage
}Both remaining items need addressing before this can merge. Happy to help draft either fix if useful.
— Amber
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — commit daf33ba0
Great progress on this commit. The Informer implementation is solid and I'm clearing most of my previous blockers.
✅ Fixed in this commit
| Finding | Status |
|---|---|
| 30-second polling loop (Major) | Fixed — dynamicinformer.NewFilteredDynamicSharedInformerFactory with 2s debounce and 5-minute resync. Tombstone handling in enqueueNamespace is correct. |
findAgentByName / findProviderByName / findPolicyByName Size:100 (Minor) |
Fixed — Size: 1 on all three |
| Migration rollback SQL pattern (Minor) | Fixed — fmt.Sprintf used in both migration rollbacks |
The Informer implementation is well done — label selector filtering, cache sync check with per-GVR error reporting, debounce pattern, and namespace-scoped sync function are all correct. The DynamicClient() accessor on KubeClient is the right way to expose this without breaking encapsulation.
❌ Still Needs Work — 1 remaining blocker
Registry allowlist on resolveSandboxImage (Critical — security)
kube_reconciler.go's resolveSandboxImage was not touched in this commit. The security concern from my previous two reviews remains open: a tenant namespace-admin can declare sandbox_template: '{"image": "evil.registry/malicious:latest"}' in their ConfigMap and have it used as the container image for their session, with no registry validation.
This is the only remaining blocker. The fix is straightforward — validate the parsed image against an allowed registry prefix list:
func (r *SimpleKubeReconciler) resolveSandboxImage(agent *types.Agent) string {
if agent != nil && agent.SandboxTemplate != "" {
var tpl struct{ Image string `json:"image"` }
if err := json.Unmarshal([]byte(agent.SandboxTemplate), &tpl); err != nil {
r.logger.Warn().Err(err).Str("agent", agent.Name).Msg("failed to parse sandbox_template JSON, using default image")
} else if tpl.Image != "" {
if err := r.validateImageRegistry(tpl.Image); err != nil {
r.logger.Warn().Err(err).Str("image", tpl.Image).Msg("agent declared disallowed image registry, using default")
} else {
return tpl.Image
}
}
}
return r.cfg.RunnerImage
}With validateImageRegistry checking against an ALLOWED_IMAGE_REGISTRIES env var (comma-separated prefixes, e.g. quay.io/,registry.access.redhat.com/).
Staying amber/change-requested for the registry allowlist. One fix left — this is close.
— Amber
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — Commits 035ded1, daf33ba
Good fix pass on daf33ba. The three Major items I flagged are resolved:
| Prior Finding | Status |
|---|---|
| 30s ticker polling instead of Informer | ✅ Fixed — dynamicinformer.NewFilteredDynamicSharedInformerFactory with label selector, 2s debounce, 5-minute resync |
findAgentByName truncates at Size: 100 |
✅ Fixed — Size: 1 for all three find*ByName functions |
| Migration rollback SQL string concatenation | ✅ Fixed — fmt.Sprintf used throughout |
The Informer implementation in Run() is idiomatic: tombstone handling in enqueueNamespace, WaitForCacheSync before the event loop, debounced per-namespace sync, and initial full sync after cache warm. Well done.
Remaining concerns:
Major — Test coverage gap has grown, not shrunk
035ded1 added three full sync pipelines (providers, policies) mirroring the same patterns as agents — syncNamespaceProviders, upsertProvider, pruneRemovedProviders, syncNamespacePolicies, upsertPolicy, pruneRemovedPolicies — with zero tests. The existing configmap_reconciler_test.go covers only parseAgentDeclaration and isConfigMapManaged.
Still missing coverage for the functions most likely to cause production incidents:
upsertAgent— create-vs-update branching, annotation injection, error propagationpruneRemovedAgents— namespace-scope correctness, no false positives on hand-created agentssyncNamespaceAgents— per-key error isolation (one bad key shouldn't abort the rest)- All equivalent provider/policy functions from
035ded1
For a feature gated behind OPENSHELL_USE_GATEWAY=true that modifies live agent state, this is a meaningful risk surface.
Major — resolveSandboxImage still accepts arbitrary tenant images (carried from prior review)
I understand the author's position that operator review of tenant ConfigMaps mitigates this for now. I'm keeping this open as a blocker because the control plane is being designed for multi-tenant use, and manual review doesn't scale. Minimum mitigation before non-draft: validate SandboxTemplate.image has a prefix matching an allowlist (e.g. quay.io/ambient_code/, ghcr.io/nvidia/). If the team decides to defer this explicitly, I'll accept a code comment documenting the known risk and a follow-up issue number.
Minor — pruneRemoved* functions use Size: 500
pruneRemovedAgents, pruneRemovedProviders, and pruneRemovedPolicies all query with Size: 500. A project with more than 500 agents/providers/policies will silently fail to prune stale entries. This is unlikely today but worth noting — either paginate or document the limit.
Minor — sdk.Policys() grammatical typo
Should be sdk.Policies(). If this is generated SDK code, fix it in the generator; if handwritten, rename it.
Summary: daf33ba addressed all three of my Major blockers from the previous review — solid work. The remaining blockers are the test coverage gap (amplified by the large 035ded1 addition) and the image registry validation. Happy to sketch out fake-SDK test scaffolding for the sync functions if that would speed things up.
Still amber/change-requested until tests and the image validation are addressed.
— Amber
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber — Re-review after feedback round
Good progress — the majority of my previous findings have been addressed well. The informer-based sync, namespace-scoped pruning with origin annotations, unified select block, and json.Marshal error handling are all done correctly. Tests for parseAgentDeclaration and isConfigMapManaged are solid.
Two issues remain that need fixing before this can merge:
-
Critical (security):
resolveSandboxImagepasses unvalidated image strings from tenant-editableSandboxTemplatedirectly intoCreateSandboxRequestwith no registry allowlist. Any tenant with ConfigMap write access in their namespace can run arbitrary container images through the control plane. -
Minor: SQL escaping in
findAgentByName/findProviderByName/findPolicyByNameusesstrings.ReplaceAll("'", "''")— acceptable given K8s name constraints, but DNS label validation would be cleaner and more explicit.
The draft status of this PR is appropriate. Please address the registry validation before marking ready for review.
— Amber
There was a problem hiding this comment.
Critical (security) — resolveSandboxImage no registry allowlist (still unaddressed)
The warning log on parse failure was added ✅, but the core security gap remains: the image string returned from this function flows into CreateSandboxRequest without any validation of the registry or image reference format. A tenant who can write a ConfigMap declaring an agent (namespace-admin privilege) can set sandbox_template: '{"image":"docker.io/attacker/malware:latest"}' and have the control plane launch it.
Minimum fix — add a registry prefix allowlist checked before returning the image:
var allowedRegistries = []string{
"quay.io/ambient_code/",
"ghcr.io/nvidia/",
// add others as needed; drive from r.cfg.AllowedImageRegistries for configurability
}
func (r *SimpleKubeReconciler) resolveSandboxImage(agent *types.Agent) string {
if agent != nil && agent.SandboxTemplate != "" {
var tpl struct {
Image string `json:"image"`
}
if err := json.Unmarshal([]byte(agent.SandboxTemplate), &tpl); err != nil {
r.logger.Warn().Err(err).Str("agent", agent.Name).Msg("failed to parse sandbox_template JSON, using default image")
} else if tpl.Image != "" {
for _, prefix := range allowedRegistries {
if strings.HasPrefix(tpl.Image, prefix) {
return tpl.Image
}
}
r.logger.Warn().Str("agent", agent.Name).Str("image", tpl.Image).
Msg("sandbox_template image not in allowed registries, using default")
}
}
return r.cfg.RunnerImage
}Even if the allowlist is empty/permissive at launch, having the validation hook in place means it can be tightened without code changes (drive from a config field). Without it, there's no mechanism to restrict what images tenants can run.
— Amber
There was a problem hiding this comment.
Minor — SQL escaping is still quote-substitution, not parameterized
The strings.ReplaceAll(name, "'", "''") pattern mitigates basic ' injection, but is fragile — it relies on the SDK's query parser never recognising alternative comment sequences, null bytes, or multi-byte encoding edge cases. If the SDK ever supports parameterized searches (e.g. positional ? or named $name bindings), that would be the right fix.
For now this is acceptable given the ConfigMap name comes through K8s label selectors (which enforce DNS label syntax) and the K8s API already validates it. Just flagging so future contributors don't mistake this pattern for a safe general escape function.
If the SDK doesn't support parameters, the correct defensive layer is to validate name against a DNS label regex before building the search string:
var dnsLabelRE = regexp.MustCompile(`^[a-z0-9]([a-z0-9\-]*[a-z0-9])?$`)
func validateAgentName(name string) error {
if !dnsLabelRE.MatchString(name) {
return fmt.Errorf("invalid agent name %q: must be a DNS label", name)
}
return nil
}Call this in upsertAgent/upsertProvider/upsertPolicy before the find*ByName call. This makes the allowable character set explicit and removes the need for quote-escaping entirely.
— Amber
|
🤖 Amber Re-Review — PR #201 (8 new commits since initial review) Great progress — nearly all findings are resolved. Here's the scorecard: ✅ Fixed
❌ Remaining — registry allowlist still unaddressed (Critical thread from 2026-06-29T23:35) The thread I added in the second review round ( The warning log was added (✅) but the allowlist was explicitly called out as the remaining gap. The minimum fix is a config-driven prefix check: var allowedRegistries = []string{
"quay.io/ambient_code/",
"ghcr.io/nvidia/",
}
for _, prefix := range allowedRegistries {
if strings.HasPrefix(tpl.Image, prefix) {
return tpl.Image
}
}
r.logger.Warn().Str("image", tpl.Image).Msg("sandbox_template image not in allowed registries, using default")
return r.cfg.RunnerImageEven as an empty/permissive allowlist at launch (driven from a config field), the validation hook needs to exist so it can be tightened without code changes. Without it, there's no mechanism to restrict what images tenants can run. This is a security boundary issue — please address before merge. Confidence: High on both the remaining gap and the fix. — Amber |
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — PR #201 — One Critical Security Issue Remaining
Good progress — most issues from my prior review have been addressed. Here's the verification:
| Finding | Status |
|---|---|
pruneRemovedAgents heuristic deletion (Entrypoint != "") |
✅ Fixed — now uses ambient.ai/source + ambient.ai/source-namespace annotations |
| Cross-namespace pruning collision | ✅ Fixed — isConfigMapManaged scopes by namespace |
| 30s polling → Informer pattern | ✅ dynamicinformer.NewFilteredDynamicSharedInformerFactory + 2s debounce implemented |
findAgentByName SQL injection |
✅ strings.ReplaceAll(name, "'", "''") + Size:1 |
json.Marshal errors silently discarded |
✅ Fixed — errors propagated with fmt.Errorf |
select block dead code / nested select |
✅ Fixed — unified select with nil channel pattern when gateway disabled |
| Silent marshal error in Patch handler | ✅ Addressed |
| SQL injection in rollback migration | ✅ fmt.Sprintf used |
Zero tests for configmap_reconciler.go |
✅ TestParseAgentDeclaration, TestIsConfigMapManaged, TestParsePolicyDeclaration added |
🚨 One Critical Still Unresolved:
resolveSandboxImage — no registry allowlist (my second review comment PRRT_kwDOS2zRNs6NH3NM, not outdated)
The warning log on parse failure was added ✅, but the security gap is still open: the image string flows into CreateSandboxRequest without any registry validation. A tenant who can write a ConfigMap (namespace-admin) can declare sandbox_template: '{"image":"docker.io/attacker/malware:latest"}' and the control plane will launch it.
This is a blocker per my security checklist — tenant-editable field driving container image selection with no allowlist.
Minimum fix:
var allowedRegistryPrefixes = []string{
"quay.io/ambient_code/",
"ghcr.io/nvidia/",
// drive from r.cfg.AllowedRunnerImageRegistries for configurability
}
// After parsing tpl.Image, before returning:
for _, prefix := range allowedRegistryPrefixes {
if strings.HasPrefix(tpl.Image, prefix) {
return tpl.Image
}
}
r.logger.Warn().Str("image", tpl.Image).Msg("sandbox_template image not in allowed registries, using default")
return r.cfg.RunnerImageEverything else looks good. Please address this one item and I'll approve.
— Amber
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — Follow-up on 51fc078 (2026-06-30)
Great progress on this PR. I reviewed the 8 new commits since my last pass (2026-06-29T23:35) and tracked every finding.
✅ All prior findings resolved — one security gap remains
| Finding | Status |
|---|---|
pruneRemovedAgents false-positive origin heuristic |
✅ Fixed — isConfigMapManaged annotation check, namespace-scoped |
| 30s polling loop → Informers | ✅ Fixed — dynamicinformer.NewFilteredDynamicSharedInformerFactory with 2s debounce |
SQL injection in findAgentByName/Provider/Policy |
✅ Fixed — strings.ReplaceAll escape + Size:1 lookup |
json.Marshal errors silently discarded |
✅ Fixed — all marshal errors returned or logged |
main.go dual-select dead code path |
✅ Fixed — nil channel pattern, single unified select |
Missing tests for configmap_reconciler.go |
✅ Fixed — TestParseAgentDeclaration, TestIsConfigMapManaged, TestParsePolicyDeclaration |
resolveSandboxImage silent parse failure |
✅ Fixed — warn log added |
| Migration SQL concatenation | ✅ Fixed — fmt.Sprintf |
handler.go silent json.Marshal discard |
✅ Fixed — errors.GeneralError returned |
❌ Outstanding Blocker: resolveSandboxImage missing registry allowlist
kube_reconciler.go — resolveSandboxImage:
func (r *SimpleKubeReconciler) resolveSandboxImage(agent *types.Agent) string {
if agent != nil && agent.SandboxTemplate != "" {
var tpl struct {
Image string `json:"image"`
}
if err := json.Unmarshal([]byte(agent.SandboxTemplate), &tpl); err != nil {
r.logger.Warn().Err(err).Str("agent", agent.Name).Msg("failed to parse sandbox_template JSON, using default image")
} else if tpl.Image != "" {
return tpl.Image // ← no registry validation
}
}
return r.cfg.RunnerImage
}The image string flows directly into CreateSandboxRequest.Spec.Template.Image without any registry prefix check. A namespace admin can write a ConfigMap declaring sandbox_template: '{"image":"docker.io/attacker/malware:latest"}' and the control plane will provision that image as a gateway sandbox. ConfigMap write access is scoped to namespace-admins — that's a real, tenant-accessible attack surface.
Minimum fix — add a registry prefix allowlist before returning the resolved image. Drive it from a config field so it can be tightened without code changes:
func (r *SimpleKubeReconciler) resolveSandboxImage(agent *types.Agent) string {
if agent != nil && agent.SandboxTemplate != "" {
var tpl struct {
Image string `json:"image"`
}
if err := json.Unmarshal([]byte(agent.SandboxTemplate), &tpl); err != nil {
r.logger.Warn().Err(err).Str("agent", agent.Name).Msg("failed to parse sandbox_template JSON, using default image")
} else if tpl.Image != "" {
if r.isSandboxImageAllowed(tpl.Image) {
return tpl.Image
}
r.logger.Warn().
Str("agent", agent.Name).
Str("image", tpl.Image).
Msg("sandbox_template image not in allowed registries, using default")
}
}
return r.cfg.RunnerImage
}
func (r *SimpleKubeReconciler) isSandboxImageAllowed(image string) bool {
if len(r.cfg.AllowedSandboxRegistries) == 0 {
// No allowlist configured — block all tenant-declared images as safe default
return false
}
for _, prefix := range r.cfg.AllowedSandboxRegistries {
if strings.HasPrefix(image, prefix) {
return true
}
}
return false
}Add AllowedSandboxRegistries []string to KubeReconcilerConfig (env var ALLOWED_SANDBOX_REGISTRIES, comma-separated). An empty list safely falls back to the default image. This is a one-day fix and the only blocker remaining.
The new UI code (providers/policies tables, detail pages, YAML generation) looks clean — proper unknown/Record<string, unknown> types, React Query for data ops, no any violations found. SDK regeneration is consistent. The informer-based ConfigMap sync is a solid implementation.
Keeping amber/change-requested until the registry allowlist is in place. Happy to re-review immediately after.
— Amber
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review (2026-06-30 — 8 new commits since initial review)
Summary: Great progress. 9 of 10 prior findings are fixed — the Informer-based sync, origin-annotated pruning, Size:1 lookups, json.Marshal error handling, unified main.go select, fmt.Sprintf migration SQL, and unit tests all landed cleanly. One critical security issue remains unaddressed.
✅ Fixed
| Finding | Status |
|---|---|
pruneRemovedAgents false-positive deletion heuristic |
Fixed — origin annotation (ambient.ai/source + ambient.ai/source-namespace) + namespace-scoped pruning |
| 30s polling loop | Fixed — dynamicinformer.NewFilteredDynamicSharedInformerFactory with 2s debounce + 5min resync |
findAgentByName SQL-injectable search / Size:100 truncation |
Fixed — strings.ReplaceAll escaping + Size:1 |
upsertAgent silent json.Marshal error discard |
Fixed — errors propagated |
main.go inner select early-return dead code |
Fixed — nil channel pattern, unified select |
resolveSandboxImage silent parse error swallow |
Fixed — Warn() log added |
Silent json.Marshal in Patch handler (handler.go) |
Fixed |
| Migration rollback string concatenation | Fixed — fmt.Sprintf |
Zero tests for configmap_reconciler.go |
Fixed — TestParseAgentDeclaration, TestIsConfigMapManaged, TestParsePolicyDeclaration |
| SQL escaping note | Acknowledged |
❌ Still Blocking: Registry allowlist missing in resolveSandboxImage
The warning log was added (thank you), but the core security gap is unaddressed. The image value from agent.SandboxTemplate flows into CreateSandboxRequest without any registry validation.
Attack surface: Any namespace-admin who can write a ConfigMap with ambient.ai/kind=agent can declare:
sandbox_template:
image: docker.io/attacker/malware:latestThe control plane will log a warning if JSON is malformed — but if the JSON is valid (as above), it returns the attacker-controlled image with no warning, no check. The control plane then calls CreateSandbox with that image.
Minimum fix (drive the allowlist from config so it's tightenable without code changes):
func (r *SimpleKubeReconciler) resolveSandboxImage(agent *types.Agent) string {
if agent != nil && agent.SandboxTemplate != "" {
var tpl struct {
Image string `json:"image"`
}
if err := json.Unmarshal([]byte(agent.SandboxTemplate), &tpl); err != nil {
r.logger.Warn().Err(err).Str("agent", agent.Name).Msg("failed to parse sandbox_template JSON, using default image")
} else if tpl.Image != "" {
if r.isAllowedImage(tpl.Image) {
return tpl.Image
}
r.logger.Warn().Str("agent", agent.Name).Str("image", tpl.Image).
Msg("sandbox_template image not in allowed registries, using default")
}
}
return r.cfg.RunnerImage
}
func (r *SimpleKubeReconciler) isAllowedImage(image string) bool {
if len(r.cfg.AllowedImageRegistries) == 0 {
return true // permissive if unconfigured — but log a one-time warning at startup
}
for _, prefix := range r.cfg.AllowedImageRegistries {
if strings.HasPrefix(image, prefix) {
return true
}
}
return false
}Add AllowedImageRegistries []string to KubeReconcilerConfig (populated from a comma-separated env var). This doesn't require locking down the list immediately — even with an empty/permissive default, the hook is in place and can be tightened via config without code changes.
Confidence: High — the path from tenant ConfigMap → resolveSandboxImage → CreateSandboxRequest.Spec.Template.Image is direct and unambiguous in the current code.
— Amber
Add ConfigMap-based agent declarations with sandbox config fields. Full-stack implementation: OpenAPI schema, DB migration, SDK regeneration, control plane ConfigMap syncer, and UI manifest rendering. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The go-sdk requires Go 1.25.0 but the workflow was using the generator's go.mod (Go 1.24.4), causing `go fmt` to fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The symlink pointed at ../workflows which does not exist, causing Konflux git-clone to reject the repo with "symlink check: found 1 symlink(s) pointing outside the directory". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add origin annotations (ambient.ai/source, ambient.ai/source-namespace) to track ConfigMap-managed agents and scope pruning by namespace - Fix SQL injection in findAgentByName by escaping single quotes - Handle json.Marshal errors in configmap_reconciler and handler instead of silently discarding them - Unify main.go select blocks into a single select using nil channel pattern to eliminate dead code when gateway is enabled - Add warning log when resolveSandboxImage fails to parse sandbox_template - Add unit tests for parseAgentDeclaration and isConfigMapManaged - Update example namespace to tenant-a Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements full-stack provider and policy support: OpenAPI specs, API server plugins (CRUD endpoints), SDK generation for Go/Python/TS, ConfigMap reconciler for provider/policy declarations, and UI adapters using the TS SDK with React Query hooks. Renames ConfigMap terminology to GitOps in the UI per views spec. Merges sandbox config into a table-driven Configuration card and adds payload support to the Generate YAML form. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s, safe migration SQL Replace 30s ticker polling with a dynamic SharedInformerFactory that watches ConfigMaps cluster-wide filtered by ambient.ai/kind label. Events are debounced (2s) and trigger per-namespace sync; a 5-minute resync period provides a safety net for missed events. Change findAgentByName/findProviderByName/findPolicyByName from Size:100 to Size:1 since exact-name searches return at most one match. Replace string concatenation in migration rollback SQL with fmt.Sprintf to avoid accidental injection if the column slice becomes variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…AML generation Rewrite provider and policy tables with TanStack React Table matching the agents UI pattern: sortable columns, global search filter, keyboard navigation (j/k/Enter), lifecycle badge (GitOps source), and click-to-navigate to detail pages. Add detail pages for providers and policies with tabbed layout (Manifest, Annotations), table-driven config display, ConfigMap YAML card with copy/download, and GitOps info banner. Add "Generate YAML" sheets to both list pages for creating ConfigMap declarations via form input without API calls. Extract ConfigMapYamlPreview to a shared component and add YAML generation utilities for providers (providerToConfigMapYaml) and policies (policyToConfigMapYaml with recursive YAML serializer). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dashboard
- Make provider/policy badges clickable on agents page, routing to their respective pages
- Reorder toolbar layout: title + search on left, action buttons on right (agents, providers, policies, sessions)
- Fix "Last Updated" column sorting on agents, providers, and policies tables (changed from col.display to col.accessor)
- Remove namespace column from policies table (users only see their namespace)
- Update empty state descriptions for policies/providers to match agents ("declared via GitOps")
- Adjust dashboard grid template: reduce Status max-width 240px→200px, increase Since column 80px→110px
- Increase search input width from max-w-xs to w-80 on all list pages
- Update policies search placeholder to "Filter by name..." (removed namespace)
- Add info text to model selector: "Overrides the agent's configured default model"
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Kyle Squizzato <kysquizz@redhat.com>
Introduce allowed-registry validation for agent sandbox templates in the control plane, update the SDK generator to emit structured Go/Python/TS types (map, slice, pointer) instead of raw strings for agent fields like environment, payloads, providers, and sandbox_template, and improve Makefile build-error messaging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Kyle Squizzato <kysquizz@redhat.com>
cff40de to
91adddc
Compare
Addresses PR review feedback from @jsell-rh: **What changed:** - Added `validateResourceName()` function that enforces DNS label syntax (RFC 1123: lowercase alphanumeric or hyphens, cannot start/end with hyphen) - Replaced SQL quote-escaping (`strings.ReplaceAll(name, "'", "''")`) with upfront validation in `findAgentByName`, `findProviderByName`, and `findPolicyByName` - Added comprehensive test coverage for DNS label validation including edge cases (SQL injection attempts, uppercase, special chars, etc.) **Why this matters:** The previous approach relied on quote-substitution to mitigate SQL injection, which is fragile — it assumes the SDK's query parser never recognizes alternative comment sequences, null bytes, or multi-byte encoding edge cases. The new approach validates names against the DNS label regex before building search strings. Since ConfigMap names come through K8s label selectors (which already enforce DNS label syntax), this makes the validation contract explicit and removes the need for escaping entirely. **Registry allowlist (Critical issue) — Already Fixed:** The registry allowlist validation was already implemented in commit 91adddc (2026-06-30 09:57:34). The review comment was from 2026-06-29 and hasn't seen the latest commits yet. No further action needed on that front. Fixes: #201 (review feedback) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Kyle Squizzato <kysquizz@redhat.com>
PR Review ResponseThanks for the detailed review @jsell-rh! I've addressed the remaining feedback: ✅ Critical — Registry Allowlist (Already Fixed)The registry allowlist validation was already implemented in commit 91adddc (2026-06-30 09:57:34): func (r *SimpleKubeReconciler) resolveSandboxImage(agent *types.Agent) string {
if agent != nil && agent.SandboxTemplate != nil && agent.SandboxTemplate.Image != "" {
if !r.isAllowedRegistry(agent.SandboxTemplate.Image) {
r.logger.Warn().Str("agent", agent.Name).Str("image", agent.SandboxTemplate.Image).
Strs("allowed", r.cfg.AllowedSandboxRegistries).
Msg("sandbox image not in allowed registries, using default image")
return r.cfg.RunnerImage
}
return agent.SandboxTemplate.Image
}
return r.cfg.RunnerImage
}The review comments were from 2026-06-29 and haven't seen the latest commits yet. ✅ Minor — DNS Label Validation (Just Fixed)Replaced SQL quote-escaping with DNS label validation in commit 205a31b: Changes:
Why this is better: Since K8s already enforces DNS label syntax for ConfigMap names, this validation is defensive but correct. Test coverage: All existing reconciler tests pass ✅ |
squizzi
left a comment
There was a problem hiding this comment.
Re-Review Response
Thanks for the thorough review! Both issues have been addressed:
✅ Registry Allowlist (Critical) — Already Fixed in 91adddc
The registry allowlist validation was implemented yesterday (2026-06-30 09:57:34) in commit 91adddc, after your review. The implementation matches your suggestion:
- Added
AllowedSandboxRegistries []stringto config - Implemented
isAllowedRegistry()prefix checker - Falls back to default
RunnerImageif not in allowlist - Warns when blocking an unauthorized registry
✅ DNS Label Validation (Minor) — Fixed in 205a31b
Replaced SQL quote-escaping with proper DNS label validation:
Changes:
validateResourceName()enforces RFC 1123 (lowercase alphanumeric + hyphens, no leading/trailing hyphen)- Applied to
findAgentByName,findProviderByName,findPolicyByNamebefore query construction - Added test coverage including SQL injection attempts
Why better:
Quote-substitution is fragile (assumes no alternative comment syntax, encoding tricks, etc.). DNS label validation makes the contract explicit and removes escaping need.
Both changes have test coverage and pass all existing tests. Ready for re-review.
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — Approved
All critical and major issues from my initial review have been addressed. Here's what I verified:
Critical issues — RESOLVED ✅
- Registry allowlist for
resolveSandboxImageis now in place (isAllowedRegistry+ALLOWED_SANDBOX_REGISTRIESenv var with sensible defaults ofquay.io/ambient_code/andghcr.io/nvidia/). Tests cover blocked/allowed/empty-allowlist cases. pruneRemovedAgentsnow filters byambient.ai/source=configmapannotation, scoped by namespace. Cross-namespace pruning risk eliminated.
Major issues — RESOLVED ✅
- Informer-based ConfigMap sync replaces the 30s polling ticker. Debounced events with 5-min resync — correct pattern.
json.Marshalerrors in handler and configmap reconciler are now propagated instead of silently discarded.selectblock unified (nil-channel pattern) — no dead code path whenOpenShellUseGateway=true.- SQL migration rollback uses
fmt.Sprintf— consistent with convention.
Tests — ADDED ✅
parseAgentDeclarationtable-driven testsisConfigMapManagedtestsTestResolveSandboxImage_AllowedRegistry(6 cases including blocked registries) andTestResolveSandboxImage_EmptyAllowlist
Minor remaining (acceptable)
- SQL escaping in
findAgentByNameusesstrings.ReplaceAll— acceptable since ConfigMap names go through K8s DNS label validation upstream. Flagged for awareness in case the pattern spreads.
One thing to confirm:
The RUNNER_IMAGE default in config.go was changed from quay.io/ambient_code/acp_claude_runner:latest to quay.io/ambient_code/ambient_runner_openshell:latest. This is a breaking change for non-gateway deployments that rely on the default. Is this intentional? If gateway mode is now the default runner, existing deployments without OPENSHELL_USE_GATEWAY=true will try to use the openshell image. Either confirm this is intentional or gate the default behind a condition.
Threads addressed (can now be resolved)
- pruneRemovedAgents heuristic → fixed with origin annotations
- 30s polling → Informer with debounce
- findAgentByName SQL → ReplaceAll escape
- json.Marshal silent discard → propagated
- resolveSandboxImage registry check → allowlist with tests
- select dead code → nil-channel pattern
- Migration rollback SQL → fmt.Sprintf
- Zero tests → parseAgentDeclaration + resolveSandboxImage tests added
Great work addressing a large set of issues thoroughly. LGTM with the RUNNER_IMAGE default question to confirm.
— Amber
The SDK type generation (91adddc) changed providers from string to string[], payloads from string to Array<Record>, and sandbox_template from string to Record. The adapter was still JSON.stringify-ing these values, causing a type mismatch that broke the CI build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jsell-rh
left a comment
There was a problem hiding this comment.
🤖 Amber Re-Review — Commits 91adddcb, 205a31be, 35fe35df, 6d420b07
All remaining blockers are resolved. Verified the new commits:
91adddcb — Registry allowlist (Critical blocker — ✅ Fixed)
The implementation is exactly right:
ALLOWED_SANDBOX_REGISTRIESenv var, comma-separated, configurable at runtime without code changes ✅- Sensible defaults (
quay.io/ambient_code/,ghcr.io/nvidia/) baked in when env var unset ✅ - Empty allowlist correctly blocks all custom images (falls back to default, not permissive) ✅
isAllowedRegistryis a clean, separately-testable helper ✅resolveSandboxImagenow operates on*types.SandboxTemplate(typed struct) rather than a raw JSON string — the SDK type improvement in this commit makes this much cleaner ✅- Tests cover allowed, blocked, nil template, empty image, and empty allowlist — all the meaningful cases ✅
205a31be — DNS label validation (Minor — ✅ Fixed, and improved)
The switch from SQL quote-escaping to DNS label validation is strictly better:
^[a-z0-9]([a-z0-9\-]*[a-z0-9])?$is the correct RFC 1123 DNS label regex (single-char edge case handled by[a-z0-9]([a-z0-9\-]*[a-z0-9])?) ✅- Applied consistently in
findAgentByName,findProviderByName,findPolicyByName✅ - 12-case test table covers the important injection/edge cases ✅
35fe35df — UI improvements
Cosmetic/UX changes — column reordering, empty states, search placeholder updates. No type safety issues or any escapes spotted. Looks clean.
6d420b07 — SDK adapter alignment
Necessary follow-up to the SDK type regeneration. Correct: stops JSON.stringify-ing already-structured types.
Scorecard — all findings resolved:
| Finding | Status |
|---|---|
pruneRemovedAgents false-positive deletion |
✅ Fixed (origin annotations + namespace scoping) |
findAgentByName SQL injection |
✅ Fixed (DNS validation replaces escaping) |
upsertAgent silent json.Marshal discard |
✅ Fixed |
Duplicate select / dead code path |
✅ Fixed (nil channel pattern) |
resolveSandboxImage swallowed parse error |
✅ Fixed (warning log) |
handler.go silent marshal discard |
✅ Fixed |
Zero tests for configmap_reconciler.go |
✅ Fixed |
| 30s polling vs Informer | ✅ Fixed (SharedInformerFactory, 2s debounce, 5m resync) |
Size: 100 pagination truncation |
✅ Fixed (Size: 1 for all name lookups) |
| Migration SQL string concatenation | ✅ Fixed |
| Registry allowlist | ✅ Fixed (ALLOWED_SANDBOX_REGISTRIES, defaults to quay.io/ambient_code + ghcr.io/nvidia) |
amber/approved — well-executed fix pass. The SDK type generation improvement is a nice bonus that makes future field handling cleaner across all three language targets.
— Amber
Add read-only CLI commands for listing and describing providers and policies, following existing patterns for other resource types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch provider find/update to raw HTTP helpers (queryAPI/patchAPI) to avoid SDK deserialization failures on complex annotation fields. Add ConfigMap UID annotations to track resource IDs across reconcile cycles, preventing duplicates when name-based lookup fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
ambient.ai/kind=agentConfigMaps in tenant namespaces and reconciles agent declarations into the API server database (gated behindOPENSHELL_USE_GATEWAY=true)examples/agent-sandbox-config.yaml) with agent, provider, and policy ConfigMap declarationsTest plan
go build ./...passes for ambient-api-server and ambient-control-planenpx tsc --noEmitpasses for ambient-ui🤖 Generated with Claude Code