fix: exclude idle and waiting_for_input from stalled detection#109
Conversation
The stalled detection handler marks agents as stalled when their last activity event exceeds the threshold (default 5 minutes). However, it did not exclude agents in idle or waiting_for_input states, which are normal waiting states where no activity events are expected. This caused agents waiting for user input to be incorrectly marked as stalled after 5 minutes of inactivity, even though the container was running normally. Add idle and waiting_for_input to the exclusion list in the MarkStalledAgents SQL query.
PR #109 Review: fix: exclude idle and waiting_for_input from stalled detectionExecutive SummaryThis is a small, well-motivated one-line SQL fix that prevents false-positive stalled detection for agents in Critical Issues1. Failing Tests (Blocking)The existing
Suggested Fix in // Line 2209: Remove idle and waiting_for_input from the "should be stalled" list
stalledActivities := []string{"thinking", "executing"}Add new test cases to the "should NOT be marked stalled" section verifying // Idle activity (normal waiting state — should not be stalled)
idleAgent := &store.Agent{
ID: api.NewUUID(), Slug: "idle-agent", Name: "Idle Agent",
Template: "claude", GroveID: grove.ID, Phase: string(state.PhaseCreated),
Visibility: store.VisibilityPrivate,
}
require.NoError(t, s.CreateAgent(ctx, idleAgent))
require.NoError(t, s.UpdateAgentStatus(ctx, idleAgent.ID, store.AgentStatusUpdate{
Phase: "running", Activity: "idle",
}))
_, err = s.db.ExecContext(ctx,
"UPDATE agents SET last_activity_event = ?, last_seen = ? WHERE id = ?",
staleActivityTime, recentHeartbeat, idleAgent.ID)
require.NoError(t, err)
// waiting_for_input activity (normal waiting state — should not be stalled)
waitingAgent := &store.Agent{
ID: api.NewUUID(), Slug: "waiting-agent", Name: "Waiting Agent",
Template: "claude", GroveID: grove.ID, Phase: string(state.PhaseCreated),
Visibility: store.VisibilityPrivate,
}
require.NoError(t, s.CreateAgent(ctx, waitingAgent))
require.NoError(t, s.UpdateAgentStatus(ctx, waitingAgent.ID, store.AgentStatusUpdate{
Phase: "running", Activity: "waiting_for_input",
}))
_, err = s.db.ExecContext(ctx,
"UPDATE agents SET last_activity_event = ?, last_seen = ? WHERE id = ?",
staleActivityTime, recentHeartbeat, waitingAgent.ID)
require.NoError(t, err)And add verification assertions at the end: a, err = s.GetAgent(ctx, idleAgent.ID)
require.NoError(t, err)
assert.Equal(t, "idle", a.Activity, "idle agent should not be stalled")
a, err = s.GetAgent(ctx, waitingAgent.ID)
require.NoError(t, err)
assert.Equal(t, "waiting_for_input", a.Activity, "waiting_for_input agent should not be stalled")For // Line 2390-2391
require.NoError(t, s.UpdateAgentStatus(ctx, agent.ID, store.AgentStatusUpdate{
Phase: "running", Activity: "thinking", // was "idle"
}))ObservationsSemantic Correctness of the FixThe fix is semantically correct. Looking at the state model in case ActivityWaitingForInput, ActivityBlocked, ActivityCompleted, ActivityLimitsExceeded:
// these are sticky
Minor: Comment Could Be UpdatedThe inline SQL comment on line 1745 says "terminal/sticky activity" but Positive Feedback
Final VerdictNeeds Rework. The logic change is correct and well-motivated, but the PR ships with 2 broken tests ( |
Per PR GoogleCloudPlatform#109 review from ptone: the SQL fix correctly excludes idle and waiting_for_input from stalled detection, but the existing tests asserted the old behaviour and were failing. Changes in pkg/store/sqlite/sqlite_test.go: - TestMarkStalledAgents: remove idle/waiting_for_input from the "should be stalled" list (now only thinking/executing), add them as explicit "should NOT be stalled" cases with stale activity + recent heartbeat to verify exclusion. - TestMarkStalledAgents_Idempotent: change setup activity from idle to thinking so the agent is eligible for stalling. Also updates the inline SQL comment in sqlite.go to say "terminal/sticky/waiting" since idle is neither terminal nor sticky.
|
Thanks for the review @ptone — pushed
|
Per PR GoogleCloudPlatform#109 review from ptone: the SQL fix correctly excludes idle and waiting_for_input from stalled detection, but the existing tests asserted the old behaviour and were failing. Changes in pkg/store/sqlite/sqlite_test.go: - TestMarkStalledAgents: remove idle/waiting_for_input from the "should be stalled" list (now only thinking/executing), add them as explicit "should NOT be stalled" cases with stale activity + recent heartbeat to verify exclusion. - TestMarkStalledAgents_Idempotent: change setup activity from idle to thinking so the agent is eligible for stalling. Also updates the inline SQL comment in sqlite.go to say "terminal/sticky/waiting" since idle is neither terminal nor sticky.
PR #109 Review: fix: exclude idle and waiting_for_input from stalled detectionPR: #109 Executive SummaryLow-risk, well-scoped bug fix that correctly prevents false-positive stall detection for agents in legitimate waiting states ( Critical IssuesNone. Observations1. Consistency gap with
|
Summary
Partial fix for #108 — addresses the stalled detection false positives for idle agents.
The
MarkStalledAgentsSQL query marks agents as stalled when theirlast_activity_eventexceeds the threshold (5 minutes). However,idleandwaiting_for_inputare normal waiting states where no activity events are expected — the agent is waiting for user input, not hung.Fix: Add
idleandwaiting_for_inputto the exclusion list alongside other terminal/sticky states.Note: this PR does not address the deeper phase persistence issue described in #108 (hub losing
runningphase after PTY disconnect). That appears to be a separate state management bug in the heartbeat reconciliation path.Test plan
go build/go vetcleanrunningin hub UIthinking/executingstates still get marked as stalled after threshold