Skip to content

Agent Disconnect Cleanup - Auto-terminate sessions when agent disconnects #237

@JoshuaAFerguson

Description

@JoshuaAFerguson

Overview

Phase 2 of 3 for fixing stuck sessions (Issue #235).

Implement automatic session cleanup when agents disconnect from the Control Plane. This prevents sessions from getting stuck in the first place by proactively handling agent disconnections.

Problem

Currently, when an agent disconnects (crashes, network issue, scaling down), its sessions remain in "running" or "terminating" states until the Session Reconciler eventually force-terminates them after 10 minutes.

This creates:

  • Poor UX: Users see stale "running" sessions that aren't actually running
  • Resource waste: Kubernetes resources may linger until manual cleanup
  • Delayed feedback: 10-minute wait before reconciler kicks in

Solution

Add agent disconnect handler to the Control Plane's AgentHub:

// When agent WebSocket disconnects:
func (h *AgentHub) onAgentDisconnect(agentID string) {
    // 1. Find all sessions assigned to this agent
    sessions := h.db.GetSessionsByAgent(agentID)
    
    // 2. Mark sessions appropriately based on state:
    for _, session := range sessions {
        switch session.State {
        case "running":
            // Mark as "disconnected" (new state)
            h.db.UpdateSessionState(session.ID, "disconnected")
            
        case "terminating":
            // Already terminating, let reconciler handle
            // (it will force-terminate after threshold)
            
        case "pending":
            // Mark as failed (never started successfully)
            h.db.UpdateSessionState(session.ID, "failed")
        }
    }
    
    // 3. Emit metrics
    h.metrics.RecordAgentDisconnect(agentID, len(sessions))
}

New Session State: "disconnected"

Add a new session state to indicate the session was running but its agent is gone:

  • State: disconnected
  • Meaning: Session was running, but agent is no longer connected
  • UI Display: Show warning icon, "Agent Disconnected" status
  • User Actions:
    • Delete (clean up)
    • Wait for agent to reconnect (session may auto-recover)
    • Terminate (force cleanup via reconciler)

Graceful Degradation

If agent reconnects within a grace period (e.g., 5 minutes):

  • Sessions in "disconnected" state can be recovered
  • Agent re-establishes VNC tunnels
  • Sessions transition back to "running"

This handles temporary network blips without disrupting user sessions.

Implementation Plan

1. Database Changes

  • Add "disconnected" to session state enum
  • Add disconnected_at timestamp column
  • Migration 008

2. AgentHub Changes

  • Add onAgentDisconnect() handler
  • Call handler in UnregisterAgent() or WebSocket close handler
  • Track agent disconnect events

3. Session Recovery Logic

  • When agent reconnects, check for "disconnected" sessions
  • Verify session pods still exist
  • Re-establish VNC tunnels
  • Transition back to "running"

4. UI Changes

  • Display "disconnected" state with warning icon
  • Show "Agent Disconnected" message
  • Add "Retry" button to attempt recovery
  • Add "Terminate" button to force cleanup

5. Metrics

  • sessions_disconnected_total: Sessions marked disconnected
  • sessions_recovered_total: Sessions recovered after reconnect
  • agent_disconnects_total: Agent disconnect events

Acceptance Criteria

  • Agent disconnect triggers session state updates
  • Sessions in "running" → "disconnected" when agent disconnects
  • Sessions in "pending" → "failed" when agent disconnects
  • "disconnected" sessions shown in UI with warning
  • Sessions can recover if agent reconnects within grace period
  • Metrics track disconnect and recovery events
  • Migration 008 adds "disconnected" state and timestamp

Benefits

  • Prevents stuck sessions: Proactive cleanup instead of reactive
  • Better UX: Immediate feedback when agent is gone
  • Resource efficiency: Faster cleanup cycle
  • Graceful recovery: Handle temporary network issues

Related

Timeline

  • Milestone: v2.1.0
  • Priority: P2 (Important)
  • Effort: ~2-3 days
    • 1 day: Database migration + AgentHub handler
    • 1 day: Recovery logic + testing
    • 0.5 day: UI updates + metrics

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions