Skip to content

Kubernetes Garbage Collector - CronJob to clean orphaned session resources #238

@JoshuaAFerguson

Description

@JoshuaAFerguson

Overview

Phase 3 of 3 for fixing stuck sessions (Issue #235).

Implement a Kubernetes CronJob that periodically scans for and cleans up orphaned session resources (Deployments, Services, PVCs) that have no corresponding database record.

This is a "belt-and-suspenders" approach to handle edge cases where database and Kubernetes state diverge.

Problem

Even with Phase 1 (Session Reconciler) and Phase 2 (Agent Disconnect Cleanup), edge cases can still leave orphaned Kubernetes resources:

  1. Database corruption: Session record deleted but K8s resources remain
  2. Manual deletions: Admin deletes session in DB but not in K8s
  3. Migration issues: Database rollback leaves resources behind
  4. Agent bugs: Agent creates resources but fails to update DB
  5. Catastrophic failures: Control Plane crash during session creation

These orphaned resources:

  • Waste cluster capacity (CPU, memory, storage)
  • Cost money (PVCs persist indefinitely)
  • Clutter namespaces (hundreds of ghost sessions)
  • Confuse operators (which resources are real?)

Solution

Add Helm chart template for Kubernetes CronJob that:

  1. Scans all session-labeled resources in the cluster
  2. Queries database for each session ID
  3. Deletes resources with no matching database record
  4. Logs and emits metrics for cleanup actions

Implementation Plan

1. CronJob Definition

Create chart/templates/gc-cronjob.yaml with:

  • Configurable schedule (default: every 6 hours)
  • Database connection from secrets
  • Grace period configuration
  • Dry-run mode support

2. Garbage Collector Script

Create scripts/k8s-session-gc.sh that:

  • Lists all session deployments
  • Checks each against database
  • Deletes orphans older than grace period
  • Logs actions and emits metrics

3. RBAC Permissions

Create chart/templates/gc-rbac.yaml:

  • ServiceAccount for GC job
  • Role with minimal permissions (get, list, delete deployments/services/PVCs)
  • RoleBinding scoped to StreamSpace namespace

4. Helm Configuration

Add to chart/values.yaml:

gc:
  enabled: true
  schedule: "0 */6 * * *"
  gracePeriod: 24  # hours
  dryRun: false

Safety Features

  1. Grace Period: Don't delete resources <24 hours old
  2. Dry Run Mode: Test without actual deletions
  3. Audit Logging: Log all cleanup decisions
  4. Namespace Scoping: Only clean StreamSpace namespace
  5. Label Filtering: Only clean session-labeled resources

Acceptance Criteria

  • CronJob runs on configurable schedule
  • Detects orphaned Deployments, Services, PVCs
  • Respects grace period before cleanup
  • Dry-run mode available for testing
  • RBAC properly scoped
  • Prometheus metrics emitted
  • Can be disabled via gc.enabled: false
  • Documentation in chart/README.md

Benefits

  • Cost savings: Reclaim wasted resources
  • Namespace hygiene: Remove ghost sessions
  • Operational confidence: Catches edge cases
  • Belt-and-suspenders: Safety net for Phases 1 & 2

Related

Timeline

  • Milestone: v2.2.0
  • Priority: P3 (Nice to have)
  • Effort: ~2 days

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions