Skip to content

Mark provisioned agent auth failures as error#114

Closed
mfreeman451 wants to merge 1 commit intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/provisioned-agent-auth-failure-state
Closed

Mark provisioned agent auth failures as error#114
mfreeman451 wants to merge 1 commit intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/provisioned-agent-auth-failure-state

Conversation

@mfreeman451
Copy link
Copy Markdown
Contributor

Summary

  • mark provisioned agents as error when startup fails after GetAgent() has already created agent-info.json
  • cover auth-resolution failures that happen before runtime.Run() is called
  • keep the existing run-failure coverage and add focused auth-failure coverage

Testing

  • go test ./pkg/agent -run 'TestStart_(ErrorPropagation_Tmux|ErrorPropagation_Tmux_Missing|RunFailureMarksAgentInfoError|AuthFailureMarksAgentInfoError)$'\n

Copy link
Copy Markdown
Contributor Author

@mfreeman451 mfreeman451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mfreeman451 mfreeman451 marked this pull request as ready for review April 10, 2026 00:48
@mfreeman451 mfreeman451 force-pushed the fix/provisioned-agent-auth-failure-state branch from 593bc6e to 8ae1cbd Compare April 11, 2026 05:15
Copy link
Copy Markdown
Contributor Author

@mfreeman451 mfreeman451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Contributor Author

@mfreeman451 mfreeman451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mfreeman451 mfreeman451 force-pushed the fix/provisioned-agent-auth-failure-state branch from bfa7893 to 3d360f2 Compare April 11, 2026 06:52
@mfreeman451
Copy link
Copy Markdown
Contributor Author

Applied the same cleanup pattern here.

Changes:

  • converted the prompt handling path to guard-clause flow
  • stopped silently dropping local config/status update failures
  • stopped silently dropping prompt.md and scion-agent.json writes
  • replaced the inline stringly tmux classification with explicit runtime-error classification
  • primary missing-binary detection now uses errors.Is(err, exec.ErrNotFound), with only a narrow shell-text fallback
  • updated focused tests accordingly

Current head: 3d360f2e

@ptone
Copy link
Copy Markdown
Member

ptone commented Apr 11, 2026

PR #114 Review: Mark provisioned agent auth failures as error

Branch: fix/provisioned-agent-auth-failure-state
Files Changed: pkg/agent/run.go, pkg/agent/error_prop_test.go
Diff: +291 / -28


Executive Summary

This PR adds a defer-based error-marking mechanism so that agents whose agent-info.json has already been written (i.e., provisioned via GetAgent()) are correctly marked as "error" when subsequent steps (auth resolution, container launch) fail. The approach is sound, but a new unscoped container-existence check introduces a grove-isolation bypass and duplicates existing logic.


Critical Issues

1. Duplicate & Unscoped Container Existence Check — Grove Isolation Bypass

File: pkg/agent/run.go, lines 60–85 (new code)

The PR adds a new container-existence check at the top of Start() that calls m.Runtime.List(ctx, nil) (no label filter) and matches containers by raw name/ContainerID without any grove scoping:

// Line 60-85 (NEW)
agents, err := m.Runtime.List(ctx, nil) // lists ALL containers
if err == nil {
    for _, a := range agents {
        if a.ContainerID == opts.Name || strings.EqualFold(a.Name, opts.Name) || ... {
            // ... may return or DELETE without grove check
        }
    }
}

The original grove-scoped check remains at lines 100–128, performing the same logic but correctly filtered by label and grove:

// Line 100-128 (EXISTING)
agents, err = m.Runtime.List(ctx, map[string]string{"scion.name": slug})
// ... with matchAgentGrove(a, groveName, groveID) guard

Problems:

  1. Security / Correctness: The unscoped check can match and delete a container belonging to a different grove if the names collide (line 80: m.Runtime.Delete(ctx, a.ContainerID)). It can also return a running container from another grove as if it belongs to the current grove (line 76: return &a, nil).
  2. Performance: Two List() calls to the runtime per Start() invocation. The first fetches all containers.
  3. Redundancy: The second check already handles the same logic with proper scoping.

Suggested Fix: Remove the unscoped block (lines 60–85) entirely unless there is a specific requirement to match containers that lack Scion labels. If such a requirement exists, the grove-isolation guard (matchAgentGrove) must be applied, which requires projectDir resolution to happen first.


Observations

2. provisioned Variable Is Always true — Dead Guard

File: pkg/agent/run.go, lines 159, 162

provisioned := true  // always true, never reassigned
defer func() {
    if err == nil || !provisioned {  // !provisioned is always false
        return
    }
    // ...
}()

provisioned is set to true unconditionally after GetAgent() succeeds and is never modified. The !provisioned check in the defer is dead code. If this was intended as a future-proofing hook, it should be documented. Otherwise, simplify:

defer func() {
    if err == nil {
        return
    }
    if updateErr := UpdateAgentConfig(...); updateErr != nil {
        util.Debugf(...)
    }
}()

3. Defer Interaction After Successful Runtime.Run()

File: pkg/agent/run.go, lines 849–888

After a successful Runtime.Run(), the function calls UpdateAgentConfig(..., "running", ...) at line 858. If the container is subsequently found to have exited (line 873), the function returns an error, which triggers the defer to overwrite the status to "error". This is correct behavior — the agent was running but died, and the final state should be "error". Worth noting that this is intentional and well-designed.

However, there is a subtle edge case: if m.Runtime.List() at line 863 fails (network timeout, etc.), err is temporarily set to a non-nil value but the function falls through to line 888 which returns nil error. The defer correctly sees err == nil and skips. This is safe but somewhat fragile — a future refactor adding an early return between lines 863–888 could accidentally trigger the defer and overwrite the "running" status.

4. classifyLaunchRuntimeError Format String

File: pkg/agent/run.go, line 45

return fmt.Errorf("failed to launch container: %w in image %q: %w", ErrTmuxBinaryNotFound, resolvedImage, err)

The dual %w is valid since Go 1.20, but the message structure embeds the first wrapped error mid-sentence. The resulting message reads: "failed to launch container: tmux binary not found in image "myimage": <original error>". This is readable but unconventional. Since this function already exists on main (from PR #111), this is not a blocking issue for this PR.

5. Error Handling for os.WriteFile — Good Improvement

File: pkg/agent/run.go, lines 161, 632–636

Previously, os.WriteFile errors for prompt.md and scion-agent.json were silently ignored (_ =). The PR now checks these errors and returns them. This is a meaningful correctness improvement — silent write failures could lead to agents starting with stale or missing configuration.


Positive Feedback

  1. Named return + defer pattern for error marking is idiomatic Go and elegantly covers all failure paths after provisioning without requiring error-marking boilerplate at every return statement.

  2. profileNameForError tracking (line 160, updated at ~line 291) correctly ensures the resolved profile name is used in error marking rather than the raw input — important for multi-profile setups.

  3. Test improvementsTestStart_ErrorPropagation_Tmux_Missing now properly wraps exec.ErrNotFound and asserts with errors.Is(err, ErrTmuxBinaryNotFound) instead of fragile string matching. The new TestStart_RunFailureMarksAgentInfoError and TestStart_AuthFailureMarksAgentInfoError tests provide good coverage for the core feature (verifying agent-info.json is marked "error" and the correct runtime is recorded).

  4. Guard clause reordering for finalScionCfg == nil (lines 169–174) — placing the nil check first is more readable.


Final Verdict

Needs Rework. The core defer-based error-marking mechanism is well-implemented and the tests are solid. However, the unscoped container-existence check (lines 60–85) is a blocking issue — it introduces a grove-isolation bypass that could delete or return containers from unrelated groves. This block should be removed or refactored to include grove scoping before merge. The provisioned dead-code observation is minor and non-blocking.

@mfreeman451 mfreeman451 force-pushed the fix/provisioned-agent-auth-failure-state branch from 3d360f2 to fab9d7d Compare April 11, 2026 23:07
@mfreeman451
Copy link
Copy Markdown
Contributor Author

This branch is empty against current main after the recent upstream merges, so I am closing it as superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants