Overview
Phase 2 of 3 for fixing stuck sessions (Issue #235).
Implement automatic session cleanup when agents disconnect from the Control Plane. This prevents sessions from getting stuck in the first place by proactively handling agent disconnections.
Problem
Currently, when an agent disconnects (crashes, network issue, scaling down), its sessions remain in "running" or "terminating" states until the Session Reconciler eventually force-terminates them after 10 minutes.
This creates:
- Poor UX: Users see stale "running" sessions that aren't actually running
- Resource waste: Kubernetes resources may linger until manual cleanup
- Delayed feedback: 10-minute wait before reconciler kicks in
Solution
Add agent disconnect handler to the Control Plane's AgentHub:
// When agent WebSocket disconnects:
func (h *AgentHub) onAgentDisconnect(agentID string) {
// 1. Find all sessions assigned to this agent
sessions := h.db.GetSessionsByAgent(agentID)
// 2. Mark sessions appropriately based on state:
for _, session := range sessions {
switch session.State {
case "running":
// Mark as "disconnected" (new state)
h.db.UpdateSessionState(session.ID, "disconnected")
case "terminating":
// Already terminating, let reconciler handle
// (it will force-terminate after threshold)
case "pending":
// Mark as failed (never started successfully)
h.db.UpdateSessionState(session.ID, "failed")
}
}
// 3. Emit metrics
h.metrics.RecordAgentDisconnect(agentID, len(sessions))
}
New Session State: "disconnected"
Add a new session state to indicate the session was running but its agent is gone:
- State:
disconnected
- Meaning: Session was running, but agent is no longer connected
- UI Display: Show warning icon, "Agent Disconnected" status
- User Actions:
- Delete (clean up)
- Wait for agent to reconnect (session may auto-recover)
- Terminate (force cleanup via reconciler)
Graceful Degradation
If agent reconnects within a grace period (e.g., 5 minutes):
- Sessions in "disconnected" state can be recovered
- Agent re-establishes VNC tunnels
- Sessions transition back to "running"
This handles temporary network blips without disrupting user sessions.
Implementation Plan
1. Database Changes
- Add "disconnected" to session state enum
- Add
disconnected_at timestamp column
- Migration 008
2. AgentHub Changes
- Add
onAgentDisconnect() handler
- Call handler in
UnregisterAgent() or WebSocket close handler
- Track agent disconnect events
3. Session Recovery Logic
- When agent reconnects, check for "disconnected" sessions
- Verify session pods still exist
- Re-establish VNC tunnels
- Transition back to "running"
4. UI Changes
- Display "disconnected" state with warning icon
- Show "Agent Disconnected" message
- Add "Retry" button to attempt recovery
- Add "Terminate" button to force cleanup
5. Metrics
sessions_disconnected_total: Sessions marked disconnected
sessions_recovered_total: Sessions recovered after reconnect
agent_disconnects_total: Agent disconnect events
Acceptance Criteria
Benefits
- Prevents stuck sessions: Proactive cleanup instead of reactive
- Better UX: Immediate feedback when agent is gone
- Resource efficiency: Faster cleanup cycle
- Graceful recovery: Handle temporary network issues
Related
Timeline
- Milestone: v2.1.0
- Priority: P2 (Important)
- Effort: ~2-3 days
- 1 day: Database migration + AgentHub handler
- 1 day: Recovery logic + testing
- 0.5 day: UI updates + metrics
🤖 Generated with Claude Code
Overview
Phase 2 of 3 for fixing stuck sessions (Issue #235).
Implement automatic session cleanup when agents disconnect from the Control Plane. This prevents sessions from getting stuck in the first place by proactively handling agent disconnections.
Problem
Currently, when an agent disconnects (crashes, network issue, scaling down), its sessions remain in "running" or "terminating" states until the Session Reconciler eventually force-terminates them after 10 minutes.
This creates:
Solution
Add agent disconnect handler to the Control Plane's AgentHub:
New Session State: "disconnected"
Add a new session state to indicate the session was running but its agent is gone:
disconnectedGraceful Degradation
If agent reconnects within a grace period (e.g., 5 minutes):
This handles temporary network blips without disrupting user sessions.
Implementation Plan
1. Database Changes
disconnected_attimestamp column2. AgentHub Changes
onAgentDisconnect()handlerUnregisterAgent()or WebSocket close handler3. Session Recovery Logic
4. UI Changes
5. Metrics
sessions_disconnected_total: Sessions marked disconnectedsessions_recovered_total: Sessions recovered after reconnectagent_disconnects_total: Agent disconnect eventsAcceptance Criteria
Benefits
Related
Timeline
🤖 Generated with Claude Code