Mark provisioned agent auth failures as error by mfreeman451 · Pull Request #114 · GoogleCloudPlatform/scion

mfreeman451 · 2026-04-10T00:41:01Z

Summary

mark provisioned agents as error when startup fails after GetAgent() has already created agent-info.json
cover auth-resolution failures that happen before runtime.Run() is called
keep the existing run-failure coverage and add focused auth-failure coverage

Testing

go test ./pkg/agent -run 'TestStart_(ErrorPropagation_Tmux|ErrorPropagation_Tmux_Missing|RunFailureMarksAgentInfoError|AuthFailureMarksAgentInfoError)$'\n

mfreeman451

lgtm

mfreeman451

lgtm

mfreeman451

lgtm

mfreeman451 · 2026-04-11T13:50:09Z

Applied the same cleanup pattern here.

Changes:

converted the prompt handling path to guard-clause flow
stopped silently dropping local config/status update failures
stopped silently dropping prompt.md and scion-agent.json writes
replaced the inline stringly tmux classification with explicit runtime-error classification
primary missing-binary detection now uses errors.Is(err, exec.ErrNotFound), with only a narrow shell-text fallback
updated focused tests accordingly

Current head: 3d360f2e

ptone · 2026-04-11T21:22:23Z

PR #114 Review: Mark provisioned agent auth failures as error

Branch: fix/provisioned-agent-auth-failure-state
Files Changed: pkg/agent/run.go, pkg/agent/error_prop_test.go
Diff: +291 / -28

Executive Summary

This PR adds a defer-based error-marking mechanism so that agents whose agent-info.json has already been written (i.e., provisioned via GetAgent()) are correctly marked as "error" when subsequent steps (auth resolution, container launch) fail. The approach is sound, but a new unscoped container-existence check introduces a grove-isolation bypass and duplicates existing logic.

Critical Issues

1. Duplicate & Unscoped Container Existence Check — Grove Isolation Bypass

File: pkg/agent/run.go, lines 60–85 (new code)

The PR adds a new container-existence check at the top of Start() that calls m.Runtime.List(ctx, nil) (no label filter) and matches containers by raw name/ContainerID without any grove scoping:

// Line 60-85 (NEW)
agents, err := m.Runtime.List(ctx, nil) // lists ALL containers
if err == nil {
    for _, a := range agents {
        if a.ContainerID == opts.Name || strings.EqualFold(a.Name, opts.Name) || ... {
            // ... may return or DELETE without grove check
        }
    }
}

The original grove-scoped check remains at lines 100–128, performing the same logic but correctly filtered by label and grove:

// Line 100-128 (EXISTING)
agents, err = m.Runtime.List(ctx, map[string]string{"scion.name": slug})
// ... with matchAgentGrove(a, groveName, groveID) guard

Problems:

Security / Correctness: The unscoped check can match and delete a container belonging to a different grove if the names collide (line 80: m.Runtime.Delete(ctx, a.ContainerID)). It can also return a running container from another grove as if it belongs to the current grove (line 76: return &a, nil).
Performance: Two List() calls to the runtime per Start() invocation. The first fetches all containers.
Redundancy: The second check already handles the same logic with proper scoping.

Suggested Fix: Remove the unscoped block (lines 60–85) entirely unless there is a specific requirement to match containers that lack Scion labels. If such a requirement exists, the grove-isolation guard (matchAgentGrove) must be applied, which requires projectDir resolution to happen first.

Observations

2. `provisioned` Variable Is Always `true` — Dead Guard

File: pkg/agent/run.go, lines 159, 162

provisioned := true  // always true, never reassigned
defer func() {
    if err == nil || !provisioned {  // !provisioned is always false
        return
    }
    // ...
}()

provisioned is set to true unconditionally after GetAgent() succeeds and is never modified. The !provisioned check in the defer is dead code. If this was intended as a future-proofing hook, it should be documented. Otherwise, simplify:

defer func() {
    if err == nil {
        return
    }
    if updateErr := UpdateAgentConfig(...); updateErr != nil {
        util.Debugf(...)
    }
}()

3. Defer Interaction After Successful `Runtime.Run()`

File: pkg/agent/run.go, lines 849–888

After a successful Runtime.Run(), the function calls UpdateAgentConfig(..., "running", ...) at line 858. If the container is subsequently found to have exited (line 873), the function returns an error, which triggers the defer to overwrite the status to "error". This is correct behavior — the agent was running but died, and the final state should be "error". Worth noting that this is intentional and well-designed.

However, there is a subtle edge case: if m.Runtime.List() at line 863 fails (network timeout, etc.), err is temporarily set to a non-nil value but the function falls through to line 888 which returns nil error. The defer correctly sees err == nil and skips. This is safe but somewhat fragile — a future refactor adding an early return between lines 863–888 could accidentally trigger the defer and overwrite the "running" status.

4. `classifyLaunchRuntimeError` Format String

File: pkg/agent/run.go, line 45

return fmt.Errorf("failed to launch container: %w in image %q: %w", ErrTmuxBinaryNotFound, resolvedImage, err)

The dual %w is valid since Go 1.20, but the message structure embeds the first wrapped error mid-sentence. The resulting message reads: "failed to launch container: tmux binary not found in image "myimage": <original error>". This is readable but unconventional. Since this function already exists on main (from PR #111), this is not a blocking issue for this PR.

5. Error Handling for `os.WriteFile` — Good Improvement

File: pkg/agent/run.go, lines 161, 632–636

Previously, os.WriteFile errors for prompt.md and scion-agent.json were silently ignored (_ =). The PR now checks these errors and returns them. This is a meaningful correctness improvement — silent write failures could lead to agents starting with stale or missing configuration.

Positive Feedback

Named return + defer pattern for error marking is idiomatic Go and elegantly covers all failure paths after provisioning without requiring error-marking boilerplate at every return statement.
profileNameForError tracking (line 160, updated at ~line 291) correctly ensures the resolved profile name is used in error marking rather than the raw input — important for multi-profile setups.
Test improvements — TestStart_ErrorPropagation_Tmux_Missing now properly wraps exec.ErrNotFound and asserts with errors.Is(err, ErrTmuxBinaryNotFound) instead of fragile string matching. The new TestStart_RunFailureMarksAgentInfoError and TestStart_AuthFailureMarksAgentInfoError tests provide good coverage for the core feature (verifying agent-info.json is marked "error" and the correct runtime is recorded).
Guard clause reordering for finalScionCfg == nil (lines 169–174) — placing the nil check first is more readable.

Final Verdict

Needs Rework. The core defer-based error-marking mechanism is well-implemented and the tests are solid. However, the unscoped container-existence check (lines 60–85) is a blocking issue — it introduces a grove-isolation bypass that could delete or return containers from unrelated groves. This block should be removed or refactored to include grove scoping before merge. The provisioned dead-code observation is minor and non-blocking.

mfreeman451 · 2026-04-11T23:22:32Z

This branch is empty against current main after the recent upstream merges, so I am closing it as superseded.

mfreeman451 commented Apr 10, 2026

View reviewed changes

mfreeman451 marked this pull request as ready for review April 10, 2026 00:48

mfreeman451 force-pushed the fix/provisioned-agent-auth-failure-state branch from 593bc6e to 8ae1cbd Compare April 11, 2026 05:15

mfreeman451 commented Apr 11, 2026

View reviewed changes

mfreeman451 force-pushed the fix/provisioned-agent-auth-failure-state branch from bfa7893 to 3d360f2 Compare April 11, 2026 06:52

ptone added the Needs rework label Apr 11, 2026

Check agent config write during provisioned auth failures

fab9d7d

mfreeman451 force-pushed the fix/provisioned-agent-auth-failure-state branch from 3d360f2 to fab9d7d Compare April 11, 2026 23:07

mfreeman451 closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark provisioned agent auth failures as error#114

Mark provisioned agent auth failures as error#114
mfreeman451 wants to merge 1 commit intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/provisioned-agent-auth-failure-state

mfreeman451 commented Apr 10, 2026

Uh oh!

mfreeman451 left a comment

Uh oh!

mfreeman451 left a comment

Uh oh!

mfreeman451 left a comment

Uh oh!

mfreeman451 commented Apr 11, 2026

Uh oh!

ptone commented Apr 11, 2026

Uh oh!

mfreeman451 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mfreeman451 commented Apr 10, 2026

Summary

Testing

Uh oh!

mfreeman451 left a comment

Choose a reason for hiding this comment

Uh oh!

mfreeman451 left a comment

Choose a reason for hiding this comment

Uh oh!

mfreeman451 left a comment

Choose a reason for hiding this comment

Uh oh!

mfreeman451 commented Apr 11, 2026

Uh oh!

ptone commented Apr 11, 2026

PR #114 Review: Mark provisioned agent auth failures as error

Executive Summary

Critical Issues

1. Duplicate & Unscoped Container Existence Check — Grove Isolation Bypass

Observations

2. provisioned Variable Is Always true — Dead Guard

3. Defer Interaction After Successful Runtime.Run()

4. classifyLaunchRuntimeError Format String

5. Error Handling for os.WriteFile — Good Improvement

Positive Feedback

Final Verdict

Uh oh!

mfreeman451 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. `provisioned` Variable Is Always `true` — Dead Guard

3. Defer Interaction After Successful `Runtime.Run()`

4. `classifyLaunchRuntimeError` Format String

5. Error Handling for `os.WriteFile` — Good Improvement