Skip to content

fix: surface pod scheduling failures in session status conditions#619

Merged
jeremyeder merged 1 commit intoambient-code:mainfrom
jeremyeder:feature/surface-pod-scheduling-failures
Feb 13, 2026
Merged

fix: surface pod scheduling failures in session status conditions#619
jeremyeder merged 1 commit intoambient-code:mainfrom
jeremyeder:feature/surface-pod-scheduling-failures

Conversation

@jeremyeder
Copy link
Collaborator

Summary

  • When a pod can't be scheduled (insufficient memory/CPU, node affinity, etc.), the session stays in "Creating" indefinitely with no explanation to the user
  • Adds a surfacePodSchedulingFailure helper that reads pod.Status.Conditions for PodScheduled=False and surfaces the scheduler's message (e.g., "0/1 nodes are available: 1 Insufficient memory") as a PodScheduled condition on the AgenticSession CR
  • Called from both UpdateSessionFromPodStatus (controller-runtime path) and monitorPod (legacy watch path)
  • Phase remains "Creating" (correct — the pod exists, just can't schedule), but users now see why

Test plan

  • go build ./... and go vet ./... pass (verified locally)
  • Deploy to a cluster with constrained resources, create a session that can't schedule, verify kubectl get agenticsession <name> -o jsonpath='{.status.conditions}' shows PodScheduled: False with the scheduler message
  • Verify normal sessions still get PodScheduled: True with the node name once scheduled
  • Verify no behavior change when pod schedules successfully (existing NodeName check takes precedence)

🤖 Generated with Claude Code

When a pod can't be scheduled (e.g., insufficient memory/CPU), the
session was stuck in "Creating" with no explanation. The scheduler's
reason is available in pod.Status.Conditions but the operator never
read it. Now surfaces PodScheduled=False with the scheduler message
(e.g., "0/1 nodes are available: 1 Insufficient memory") so users
can see why their session isn't starting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 12, 2026

Claude Code Review

Summary

This PR adds visibility into pod scheduling failures by surfacing Kubernetes scheduler messages to the AgenticSession status conditions. When a pod cannot be scheduled due to resource constraints or affinity rules, users will now see the specific reason (e.g., "0/1 nodes are available: 1 Insufficient memory") instead of an indefinite "Creating" state with no explanation.

The implementation is clean, well-documented, and follows established operator patterns. The change is minimal (22 lines) and strategically placed in both the controller-runtime reconciliation path and the legacy watch-based monitoring path.

Issues by Severity

🚫 Blocker Issues

None - the PR is ready to merge.

🔴 Critical Issues

None identified.

🟡 Major Issues

None identified.

🔵 Minor Issues

1. Missing nil safety check for pod parameter

While the current callers always pass valid pod objects, defensive programming suggests adding a nil check:

func surfacePodSchedulingFailure(pod *corev1.Pod, statusPatch *StatusPatch) {
    if pod == nil {
        return
    }
    for _, cond := range pod.Status.Conditions {
        // ...
    }
}

Rationale: Prevents potential panics if future code paths call this with a nil pod. Following the error handling pattern: "handle edge cases gracefully, don't assume perfect inputs."

Impact: Low - current code is safe, but this improves robustness.


2. No unit test coverage for the new function

The new surfacePodSchedulingFailure function lacks unit tests. While the integration test plan is good, unit tests would help verify:

  • Behavior when PodScheduled=False condition exists
  • Behavior when no scheduling failure condition exists
  • Behavior when multiple conditions exist
  • Edge cases (empty conditions slice, nil pod)

Example test structure:

func TestSurfacePodSchedulingFailure(t *testing.T) {
    tests := []struct {
        name           string
        pod            *corev1.Pod
        expectedStatus string
        expectedReason string
    }{
        {
            name: "pod with scheduling failure",
            pod: &corev1.Pod{
                Status: corev1.PodStatus{
                    Conditions: []corev1.PodCondition{
                        {
                            Type:    corev1.PodScheduled,
                            Status:  corev1.ConditionFalse,
                            Reason:  "Unschedulable",
                            Message: "0/1 nodes are available: 1 Insufficient memory",
                        },
                    },
                },
            },
            expectedStatus: "False",
            expectedReason: "Unschedulable",
        },
        // ... more test cases
    }
    // ...
}

Impact: Medium - reduces regression risk and documents expected behavior.


3. Condition could be added even when no scheduling failure exists

If a pod has no scheduling conditions at all (rare but possible during pod initialization), the function silently returns without adding any condition. This is correct behavior, but the code doesn't explicitly document this:

func surfacePodSchedulingFailure(pod *corev1.Pod, statusPatch *StatusPatch) {
    // Early return if no conditions exist - this is normal for newly created pods
    if len(pod.Status.Conditions) == 0 {
        return
    }
    
    for _, cond := range pod.Status.Conditions {
        if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse {
            statusPatch.AddCondition(conditionUpdate{
                Type:    conditionPodScheduled,
                Status:  "False",
                Reason:  string(cond.Reason),
                Message: cond.Message,
            })
            return
        }
    }
    // If we reach here, pod has conditions but PodScheduled=False wasn't found.
    // This is normal - the pod may not have been scheduled yet but also hasn't failed.
}

Impact: Very low - current code works correctly, this is purely for clarity.

Positive Highlights

1. Excellent code placement

  • Added to both UpdateSessionFromPodStatus (controller-runtime path) and monitorPod (legacy watch path)
  • Ensures consistent behavior across both execution paths
  • Demonstrates understanding of the dual-path architecture

2. Follows established patterns

  • Uses StatusPatch.AddCondition correctly (components/operator/internal/handlers/reconciler.go:358)
  • Follows the same structure as existing condition updates (e.g., conditionPodScheduled with True status)
  • Consistent with error handling patterns: specific logging, graceful handling

3. Clear and helpful documentation

  • Function comment explains the purpose and includes a real-world example message
  • PR description thoroughly documents the problem, solution, and test plan
  • Code is self-documenting with meaningful variable names

4. Minimal, surgical change

  • Only 22 lines added, 0 lines deleted
  • No unnecessary refactoring or scope creep
  • Follows the principle: "Only make changes that are directly requested or clearly necessary"

5. Proper type conversion

  • Correctly converts cond.Reason (type string) to string: Reason: string(cond.Reason)
  • Type-safe field access without unchecked type assertions
  • Adheres to "Type-Safe Unstructured Access" pattern from CLAUDE.md

6. No token/secret handling

  • No security concerns - only surfacing non-sensitive scheduler messages
  • No RBAC implications - reads pod conditions the operator already has access to
  • No risk of exposing sensitive data in logs or status

Recommendations

Priority 1 (Nice-to-have for this PR):

  1. Add nil check for pod parameter in surfacePodSchedulingFailure
  2. Add unit tests for the new function

Priority 2 (Future work):

  1. Consider adding metrics/logging when scheduling failures are detected (e.g., log.Printf("Pod %s scheduling failed: %s", pod.Name, cond.Message))
  2. Add similar condition surfacing for other pod conditions (e.g., ContainersReady, Initialized) to provide even more visibility

Approval Recommendation:APPROVE

This PR delivers exactly what it promises: better UX by surfacing pod scheduling failures to users. The implementation is solid, follows project patterns, and introduces no regressions. The minor issues noted are enhancements rather than blockers.

The code quality is excellent and adheres to the repository's strict standards:

  • ✅ No panic in production code
  • ✅ Type-safe access patterns
  • ✅ Proper error handling (early returns)
  • ✅ Clear documentation
  • ✅ Follows operator reconciliation patterns

Reviewed against:

  • .claude/context/backend-development.md - Operator patterns ✅
  • .claude/patterns/error-handling.md - Error patterns ✅
  • .claude/patterns/k8s-client-usage.md - K8s client usage ✅
  • CLAUDE.md sections: Backend and Operator Development Standards ✅

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@jeremyeder jeremyeder merged commit 0851aba into ambient-code:main Feb 13, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants