Skip to content

fix: exclude idle and waiting_for_input from stalled detection#109

Merged
ptone merged 2 commits intoGoogleCloudPlatform:mainfrom
Empiria:fix/sciontool-idle-status-on-startup
Apr 13, 2026
Merged

fix: exclude idle and waiting_for_input from stalled detection#109
ptone merged 2 commits intoGoogleCloudPlatform:mainfrom
Empiria:fix/sciontool-idle-status-on-startup

Conversation

@meatballs
Copy link
Copy Markdown
Contributor

Summary

Partial fix for #108 — addresses the stalled detection false positives for idle agents.

The MarkStalledAgents SQL query marks agents as stalled when their last_activity_event exceeds the threshold (5 minutes). However, idle and waiting_for_input are normal waiting states where no activity events are expected — the agent is waiting for user input, not hung.

Fix: Add idle and waiting_for_input to the exclusion list alongside other terminal/sticky states.

Note: this PR does not address the deeper phase persistence issue described in #108 (hub losing running phase after PTY disconnect). That appears to be a separate state management bug in the heartbeat reconciliation path.

Test plan

  • go build / go vet clean
  • Start agent, wait >5 minutes idle — verify it stays running in hub UI
  • Verify agents in thinking/executing states still get marked as stalled after threshold

The stalled detection handler marks agents as stalled when their last
activity event exceeds the threshold (default 5 minutes). However, it
did not exclude agents in idle or waiting_for_input states, which are
normal waiting states where no activity events are expected.

This caused agents waiting for user input to be incorrectly marked as
stalled after 5 minutes of inactivity, even though the container was
running normally.

Add idle and waiting_for_input to the exclusion list in the
MarkStalledAgents SQL query.
@ptone
Copy link
Copy Markdown
Member

ptone commented Apr 12, 2026

PR #109 Review: fix: exclude idle and waiting_for_input from stalled detection

Executive Summary

This is a small, well-motivated one-line SQL fix that prevents false-positive stalled detection for agents in idle and waiting_for_input states. However, the PR is broken: it fails two existing unit tests that were written to assert the opposite behavior, and those tests must be updated to match the new semantics.

Critical Issues

1. Failing Tests (Blocking)

The existing TestMarkStalledAgents and TestMarkStalledAgents_Idempotent tests fail with this change. Confirmed by running go test ./pkg/store/sqlite/ -run TestMarkStalledAgents:

  • TestMarkStalledAgents (line 2209): The test explicitly lists idle and waiting_for_input in stalledActivities and expects 4 agents to be marked stalled. With the fix, only 2 are marked (thinking, executing), causing assertion failures.
  • TestMarkStalledAgents_Idempotent (line 2390): Sets up an agent with activity: "idle" and expects it to be marked stalled on the first call. It now returns 0 results.

Suggested Fix in pkg/store/sqlite/sqlite_test.go:

// Line 2209: Remove idle and waiting_for_input from the "should be stalled" list
stalledActivities := []string{"thinking", "executing"}

Add new test cases to the "should NOT be marked stalled" section verifying idle and waiting_for_input are excluded:

// Idle activity (normal waiting state — should not be stalled)
idleAgent := &store.Agent{
    ID: api.NewUUID(), Slug: "idle-agent", Name: "Idle Agent",
    Template: "claude", GroveID: grove.ID, Phase: string(state.PhaseCreated),
    Visibility: store.VisibilityPrivate,
}
require.NoError(t, s.CreateAgent(ctx, idleAgent))
require.NoError(t, s.UpdateAgentStatus(ctx, idleAgent.ID, store.AgentStatusUpdate{
    Phase: "running", Activity: "idle",
}))
_, err = s.db.ExecContext(ctx,
    "UPDATE agents SET last_activity_event = ?, last_seen = ? WHERE id = ?",
    staleActivityTime, recentHeartbeat, idleAgent.ID)
require.NoError(t, err)

// waiting_for_input activity (normal waiting state — should not be stalled)
waitingAgent := &store.Agent{
    ID: api.NewUUID(), Slug: "waiting-agent", Name: "Waiting Agent",
    Template: "claude", GroveID: grove.ID, Phase: string(state.PhaseCreated),
    Visibility: store.VisibilityPrivate,
}
require.NoError(t, s.CreateAgent(ctx, waitingAgent))
require.NoError(t, s.UpdateAgentStatus(ctx, waitingAgent.ID, store.AgentStatusUpdate{
    Phase: "running", Activity: "waiting_for_input",
}))
_, err = s.db.ExecContext(ctx,
    "UPDATE agents SET last_activity_event = ?, last_seen = ? WHERE id = ?",
    staleActivityTime, recentHeartbeat, waitingAgent.ID)
require.NoError(t, err)

And add verification assertions at the end:

a, err = s.GetAgent(ctx, idleAgent.ID)
require.NoError(t, err)
assert.Equal(t, "idle", a.Activity, "idle agent should not be stalled")

a, err = s.GetAgent(ctx, waitingAgent.ID)
require.NoError(t, err)
assert.Equal(t, "waiting_for_input", a.Activity, "waiting_for_input agent should not be stalled")

For TestMarkStalledAgents_Idempotent, change the setup activity from "idle" to "thinking":

// Line 2390-2391
require.NoError(t, s.UpdateAgentStatus(ctx, agent.ID, store.AgentStatusUpdate{
    Phase: "running", Activity: "thinking",  // was "idle"
}))

Observations

Semantic Correctness of the Fix

The fix is semantically correct. Looking at the state model in pkg/agent/state/state.go (line 144):

case ActivityWaitingForInput, ActivityBlocked, ActivityCompleted, ActivityLimitsExceeded:
    // these are sticky

waiting_for_input is already defined as a sticky activity alongside blocked (which was already excluded). It makes sense to exclude it from stalled detection for the same reason blocked is excluded. Similarly, idle represents a normal resting state where no activity events are expected.

Minor: Comment Could Be Updated

The inline SQL comment on line 1745 says "terminal/sticky activity" but idle is neither terminal nor sticky. Consider updating to:

-- Are not already in a terminal/sticky/waiting activity or already stalled/offline

Positive Feedback

Final Verdict

Needs Rework. The logic change is correct and well-motivated, but the PR ships with 2 broken tests (TestMarkStalledAgents, TestMarkStalledAgents_Idempotent). These tests must be updated to reflect the new intended behavior before merging. The PR's own test plan checkbox ("Verify agents in thinking/executing states still get marked as stalled after threshold") is the right check, but the existing tests assert the old behavior and will fail CI.

Per PR GoogleCloudPlatform#109 review from ptone: the SQL fix correctly excludes
idle and waiting_for_input from stalled detection, but the
existing tests asserted the old behaviour and were failing.

Changes in pkg/store/sqlite/sqlite_test.go:
- TestMarkStalledAgents: remove idle/waiting_for_input from
  the "should be stalled" list (now only thinking/executing),
  add them as explicit "should NOT be stalled" cases with
  stale activity + recent heartbeat to verify exclusion.
- TestMarkStalledAgents_Idempotent: change setup activity
  from idle to thinking so the agent is eligible for stalling.

Also updates the inline SQL comment in sqlite.go to say
"terminal/sticky/waiting" since idle is neither terminal
nor sticky.
@meatballs
Copy link
Copy Markdown
Contributor Author

Thanks for the review @ptone — pushed 0f3acca3 updating the tests.

  • TestMarkStalledAgents: removed idle and waiting_for_input from stalledActivities, added them as explicit non-stalled cases with stale activity + recent heartbeat so the exclusion is verified.
  • TestMarkStalledAgents_Idempotent: changed setup activity from idle to thinking so the agent is eligible for stalling.
  • Updated the inline SQL comment to "terminal/sticky/waiting" per your nit.

go test ./pkg/store/sqlite/ -count=1 passes locally.

meatballs added a commit to Empiria/scion that referenced this pull request Apr 12, 2026
Per PR GoogleCloudPlatform#109 review from ptone: the SQL fix correctly excludes
idle and waiting_for_input from stalled detection, but the
existing tests asserted the old behaviour and were failing.

Changes in pkg/store/sqlite/sqlite_test.go:
- TestMarkStalledAgents: remove idle/waiting_for_input from
  the "should be stalled" list (now only thinking/executing),
  add them as explicit "should NOT be stalled" cases with
  stale activity + recent heartbeat to verify exclusion.
- TestMarkStalledAgents_Idempotent: change setup activity
  from idle to thinking so the agent is eligible for stalling.

Also updates the inline SQL comment in sqlite.go to say
"terminal/sticky/waiting" since idle is neither terminal
nor sticky.
@ptone
Copy link
Copy Markdown
Member

ptone commented Apr 13, 2026

PR #109 Review: fix: exclude idle and waiting_for_input from stalled detection

PR: #109
Files changed: pkg/store/sqlite/sqlite.go, pkg/store/sqlite/sqlite_test.go
Diff size: +42 / -4


Executive Summary

Low-risk, well-scoped bug fix that correctly prevents false-positive stall detection for agents in legitimate waiting states (idle, waiting_for_input). The change is minimal, the SQL modification is correct, and the new test coverage is thorough.


Critical Issues

None.


Observations

1. Consistency gap with MarkStaleAgentsOffline (informational, non-blocking)

The MarkStaleAgentsOffline function (line 1644) excludes completed, limits_exceeded, blocked, and offline from offline marking — but does not exclude idle or waiting_for_input. This means an idle agent with a stale heartbeat will still be correctly marked offline (process died), which is the right semantic distinction:

  • Stalled = process alive (recent heartbeat) but no activity progress → idle/waiting_for_input should NOT trigger this.
  • Offline = process dead (stale heartbeat) → idle/waiting_for_input SHOULD be marked offline.

This is consistent and correct. The TestMarkStaleAgentsOffline test at line 2021 confirms idle and waiting_for_input are still expected to transition to offline on heartbeat loss. No action needed — just noting the design is sound.

2. TestMarkStalledAgents_Idempotent fix (line 2391)

The idempotency test previously used "idle" as the activity for its test agent. Since idle is now excluded from stall detection, the test would have silently passed without actually exercising the stall→stall idempotency path. Changing it to "thinking" is the correct fix.

3. Minor: test at line 2209 still references old list in surrounding context

The existing (pre-PR) TestMarkStaleAgentsOffline test at line 2021 still lists idle and waiting_for_input in activeActivities for the offline detection test. This is correct behavior for that test (those states should go offline), but a future reader might momentarily be confused by the asymmetry. A brief comment there could help, but this is not blocking.


Positive Feedback

  • Surgical SQL change: Only the NOT IN clause is modified, minimizing blast radius.
  • Test coverage is excellent: Two dedicated test agents (idleAgent, waitingAgent) are created with the exact stalled-detection preconditions (stale last_activity_event + recent heartbeat + phase = 'running'), and assertions verify they retain their original activity after MarkStalledAgents runs.
  • Idempotency test updated correctly: The author caught that the idempotency test used idle and fixed it to thinking, ensuring the test still exercises meaningful stall transitions.
  • PR description is clear about scope and what it does NOT address (the deeper phase persistence issue in Hub marks running agents as stalled/idle, loses phase on PTY disconnect #108).

Final Verdict

Approve. This is a clean, correct fix with good test coverage. No blocking issues found.

@ptone ptone merged commit 1dd289a into GoogleCloudPlatform:main Apr 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants