Overview
Phase 3 of 3 for fixing stuck sessions (Issue #235).
Implement a Kubernetes CronJob that periodically scans for and cleans up orphaned session resources (Deployments, Services, PVCs) that have no corresponding database record.
This is a "belt-and-suspenders" approach to handle edge cases where database and Kubernetes state diverge.
Problem
Even with Phase 1 (Session Reconciler) and Phase 2 (Agent Disconnect Cleanup), edge cases can still leave orphaned Kubernetes resources:
- Database corruption: Session record deleted but K8s resources remain
- Manual deletions: Admin deletes session in DB but not in K8s
- Migration issues: Database rollback leaves resources behind
- Agent bugs: Agent creates resources but fails to update DB
- Catastrophic failures: Control Plane crash during session creation
These orphaned resources:
- Waste cluster capacity (CPU, memory, storage)
- Cost money (PVCs persist indefinitely)
- Clutter namespaces (hundreds of ghost sessions)
- Confuse operators (which resources are real?)
Solution
Add Helm chart template for Kubernetes CronJob that:
- Scans all session-labeled resources in the cluster
- Queries database for each session ID
- Deletes resources with no matching database record
- Logs and emits metrics for cleanup actions
Implementation Plan
1. CronJob Definition
Create chart/templates/gc-cronjob.yaml with:
- Configurable schedule (default: every 6 hours)
- Database connection from secrets
- Grace period configuration
- Dry-run mode support
2. Garbage Collector Script
Create scripts/k8s-session-gc.sh that:
- Lists all session deployments
- Checks each against database
- Deletes orphans older than grace period
- Logs actions and emits metrics
3. RBAC Permissions
Create chart/templates/gc-rbac.yaml:
- ServiceAccount for GC job
- Role with minimal permissions (get, list, delete deployments/services/PVCs)
- RoleBinding scoped to StreamSpace namespace
4. Helm Configuration
Add to chart/values.yaml:
gc:
enabled: true
schedule: "0 */6 * * *"
gracePeriod: 24 # hours
dryRun: false
Safety Features
- Grace Period: Don't delete resources <24 hours old
- Dry Run Mode: Test without actual deletions
- Audit Logging: Log all cleanup decisions
- Namespace Scoping: Only clean StreamSpace namespace
- Label Filtering: Only clean session-labeled resources
Acceptance Criteria
Benefits
- Cost savings: Reclaim wasted resources
- Namespace hygiene: Remove ghost sessions
- Operational confidence: Catches edge cases
- Belt-and-suspenders: Safety net for Phases 1 & 2
Related
Timeline
- Milestone: v2.2.0
- Priority: P3 (Nice to have)
- Effort: ~2 days
🤖 Generated with Claude Code
Overview
Phase 3 of 3 for fixing stuck sessions (Issue #235).
Implement a Kubernetes CronJob that periodically scans for and cleans up orphaned session resources (Deployments, Services, PVCs) that have no corresponding database record.
This is a "belt-and-suspenders" approach to handle edge cases where database and Kubernetes state diverge.
Problem
Even with Phase 1 (Session Reconciler) and Phase 2 (Agent Disconnect Cleanup), edge cases can still leave orphaned Kubernetes resources:
These orphaned resources:
Solution
Add Helm chart template for Kubernetes CronJob that:
Implementation Plan
1. CronJob Definition
Create
chart/templates/gc-cronjob.yamlwith:2. Garbage Collector Script
Create
scripts/k8s-session-gc.shthat:3. RBAC Permissions
Create
chart/templates/gc-rbac.yaml:4. Helm Configuration
Add to
chart/values.yaml:Safety Features
Acceptance Criteria
gc.enabled: falsechart/README.mdBenefits
Related
Timeline
🤖 Generated with Claude Code