Skip to content

Mark failed local agent launches as error#111

Merged
ptone merged 5 commits intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/local-start-run-failure-state
Apr 11, 2026
Merged

Mark failed local agent launches as error#111
ptone merged 5 commits intoGoogleCloudPlatform:mainfrom
mfreeman451:fix/local-start-run-failure-state

Conversation

@mfreeman451
Copy link
Copy Markdown
Contributor

Summary

  • mark provisioned local agents as error when the runtime launch fails after agent-info.json has already been written
  • preserve the provisioned workspace/home for diagnosis instead of leaving a phantom created agent in list/status output
  • add focused regression coverage for the launch-failure state transition

Problem

The local scion start path provisions agent state before attempting the runtime launch. If Runtime.Run then fails, the provisioned agent-info.json stays in created state forever even though no container ever started. In practice that leaves stale phantom agents behind in list output after launch failures.

Validation

  • go test ./pkg/agent -run 'TestStart_(ErrorPropagation_Tmux|ErrorPropagation_Tmux_Missing|RunFailureMarksAgentInfoError)$'

Copy link
Copy Markdown
Contributor Author

@mfreeman451 mfreeman451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mfreeman451 mfreeman451 marked this pull request as ready for review April 9, 2026 22:07
@mfreeman451 mfreeman451 force-pushed the fix/local-start-run-failure-state branch from ac7a584 to 286fd84 Compare April 11, 2026 05:15
Copy link
Copy Markdown
Contributor Author

@mfreeman451 mfreeman451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mfreeman451 mfreeman451 force-pushed the fix/local-start-run-failure-state branch from 0454b40 to 3373b75 Compare April 11, 2026 06:50
@mfreeman451
Copy link
Copy Markdown
Contributor Author

Follow-up cleanup is in.

Changes:

  • stopped silently dropping local config/status update failures and now log them at debug level
  • stopped silently dropping the prompt.md and scion-agent.json writes
  • replaced the inline stringly tmux classification with explicit runtime-error classification
  • primary missing-binary detection now uses errors.Is(err, exec.ErrNotFound), with only a narrow fallback for shell-generated tmux: command not found text
  • updated focused tests accordingly

Current head: 3373b758

Reorder the format string so the sentence reads naturally:
"failed to launch container in image <img>: tmux binary not found: <cause>"
instead of embedding the sentinel error mid-clause.
@ptone ptone merged commit 7e7f9e2 into GoogleCloudPlatform:main Apr 11, 2026
1 check passed
scion-gteam bot pushed a commit to ptone/scion that referenced this pull request Apr 12, 2026
* Mark failed local agent launches as error

* Log local config update failures during failed local starts

* Use typed tmux launch error on failed local starts

* Check prompt and agent config writes during failed local starts

* Fix awkward dual-%w error message in classifyLaunchRuntimeError

Reorder the format string so the sentence reads naturally:
"failed to launch container in image <img>: tmux binary not found: <cause>"
instead of embedding the sentinel error mid-clause.

---------

Co-authored-by: Preston Holmes <ptone@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants