fix: surface pod scheduling failures in session status conditions by jeremyeder · Pull Request #619 · ambient-code/platform

jeremyeder · 2026-02-12T18:20:26Z

Summary

When a pod can't be scheduled (insufficient memory/CPU, node affinity, etc.), the session stays in "Creating" indefinitely with no explanation to the user
Adds a surfacePodSchedulingFailure helper that reads pod.Status.Conditions for PodScheduled=False and surfaces the scheduler's message (e.g., "0/1 nodes are available: 1 Insufficient memory") as a PodScheduled condition on the AgenticSession CR
Called from both UpdateSessionFromPodStatus (controller-runtime path) and monitorPod (legacy watch path)
Phase remains "Creating" (correct — the pod exists, just can't schedule), but users now see why

Test plan

go build ./... and go vet ./... pass (verified locally)
Deploy to a cluster with constrained resources, create a session that can't schedule, verify kubectl get agenticsession <name> -o jsonpath='{.status.conditions}' shows PodScheduled: False with the scheduler message
Verify normal sessions still get PodScheduled: True with the node name once scheduled
Verify no behavior change when pod schedules successfully (existing NodeName check takes precedence)

🤖 Generated with Claude Code

When a pod can't be scheduled (e.g., insufficient memory/CPU), the session was stuck in "Creating" with no explanation. The scheduler's reason is available in pod.Status.Conditions but the operator never read it. Now surfaces PodScheduled=False with the scheduler message (e.g., "0/1 nodes are available: 1 Insufficient memory") so users can see why their session isn't starting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-02-12T18:23:01Z

Claude Code Review

Summary

This PR adds visibility into pod scheduling failures by surfacing Kubernetes scheduler messages to the AgenticSession status conditions. When a pod cannot be scheduled due to resource constraints or affinity rules, users will now see the specific reason (e.g., "0/1 nodes are available: 1 Insufficient memory") instead of an indefinite "Creating" state with no explanation.

The implementation is clean, well-documented, and follows established operator patterns. The change is minimal (22 lines) and strategically placed in both the controller-runtime reconciliation path and the legacy watch-based monitoring path.

Issues by Severity

🚫 Blocker Issues

None - the PR is ready to merge.

🔴 Critical Issues

None identified.

🟡 Major Issues

None identified.

🔵 Minor Issues

1. Missing nil safety check for pod parameter

While the current callers always pass valid pod objects, defensive programming suggests adding a nil check:

func surfacePodSchedulingFailure(pod *corev1.Pod, statusPatch *StatusPatch) {
    if pod == nil {
        return
    }
    for _, cond := range pod.Status.Conditions {
        // ...
    }
}

Rationale: Prevents potential panics if future code paths call this with a nil pod. Following the error handling pattern: "handle edge cases gracefully, don't assume perfect inputs."

Impact: Low - current code is safe, but this improves robustness.

2. No unit test coverage for the new function

The new surfacePodSchedulingFailure function lacks unit tests. While the integration test plan is good, unit tests would help verify:

Behavior when PodScheduled=False condition exists
Behavior when no scheduling failure condition exists
Behavior when multiple conditions exist
Edge cases (empty conditions slice, nil pod)

Example test structure:

func TestSurfacePodSchedulingFailure(t *testing.T) {
    tests := []struct {
        name           string
        pod            *corev1.Pod
        expectedStatus string
        expectedReason string
    }{
        {
            name: "pod with scheduling failure",
            pod: &corev1.Pod{
                Status: corev1.PodStatus{
                    Conditions: []corev1.PodCondition{
                        {
                            Type:    corev1.PodScheduled,
                            Status:  corev1.ConditionFalse,
                            Reason:  "Unschedulable",
                            Message: "0/1 nodes are available: 1 Insufficient memory",
                        },
                    },
                },
            },
            expectedStatus: "False",
            expectedReason: "Unschedulable",
        },
        // ... more test cases
    }
    // ...
}

Impact: Medium - reduces regression risk and documents expected behavior.

3. Condition could be added even when no scheduling failure exists

If a pod has no scheduling conditions at all (rare but possible during pod initialization), the function silently returns without adding any condition. This is correct behavior, but the code doesn't explicitly document this:

func surfacePodSchedulingFailure(pod *corev1.Pod, statusPatch *StatusPatch) {
    // Early return if no conditions exist - this is normal for newly created pods
    if len(pod.Status.Conditions) == 0 {
        return
    }
    
    for _, cond := range pod.Status.Conditions {
        if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse {
            statusPatch.AddCondition(conditionUpdate{
                Type:    conditionPodScheduled,
                Status:  "False",
                Reason:  string(cond.Reason),
                Message: cond.Message,
            })
            return
        }
    }
    // If we reach here, pod has conditions but PodScheduled=False wasn't found.
    // This is normal - the pod may not have been scheduled yet but also hasn't failed.
}

Impact: Very low - current code works correctly, this is purely for clarity.

Positive Highlights

1. Excellent code placement ✅

Added to both UpdateSessionFromPodStatus (controller-runtime path) and monitorPod (legacy watch path)
Ensures consistent behavior across both execution paths
Demonstrates understanding of the dual-path architecture

2. Follows established patterns ✅

Uses StatusPatch.AddCondition correctly (components/operator/internal/handlers/reconciler.go:358)
Follows the same structure as existing condition updates (e.g., conditionPodScheduled with True status)
Consistent with error handling patterns: specific logging, graceful handling

3. Clear and helpful documentation ✅

Function comment explains the purpose and includes a real-world example message
PR description thoroughly documents the problem, solution, and test plan
Code is self-documenting with meaningful variable names

4. Minimal, surgical change ✅

Only 22 lines added, 0 lines deleted
No unnecessary refactoring or scope creep
Follows the principle: "Only make changes that are directly requested or clearly necessary"

5. Proper type conversion ✅

Correctly converts cond.Reason (type string) to string: Reason: string(cond.Reason)
Type-safe field access without unchecked type assertions
Adheres to "Type-Safe Unstructured Access" pattern from CLAUDE.md

6. No token/secret handling ✅

No security concerns - only surfacing non-sensitive scheduler messages
No RBAC implications - reads pod conditions the operator already has access to
No risk of exposing sensitive data in logs or status

Recommendations

Priority 1 (Nice-to-have for this PR):

Add nil check for pod parameter in surfacePodSchedulingFailure
Add unit tests for the new function

Priority 2 (Future work):

Consider adding metrics/logging when scheduling failures are detected (e.g., log.Printf("Pod %s scheduling failed: %s", pod.Name, cond.Message))
Add similar condition surfacing for other pod conditions (e.g., ContainersReady, Initialized) to provide even more visibility

Approval Recommendation: ✅ APPROVE

This PR delivers exactly what it promises: better UX by surfacing pod scheduling failures to users. The implementation is solid, follows project patterns, and introduces no regressions. The minor issues noted are enhancements rather than blockers.

The code quality is excellent and adheres to the repository's strict standards:

✅ No panic in production code
✅ Type-safe access patterns
✅ Proper error handling (early returns)
✅ Clear documentation
✅ Follows operator reconciliation patterns

Reviewed against:

.claude/context/backend-development.md - Operator patterns ✅
.claude/patterns/error-handling.md - Error patterns ✅
.claude/patterns/k8s-client-usage.md - K8s client usage ✅
CLAUDE.md sections: Backend and Operator Development Standards ✅

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

CLAUDE.md - Master project instructions, development standards
backend-development.md - Go backend, K8s integration patterns
frontend-development.md - NextJS, Shadcn UI, React Query patterns
security-standards.md - Auth, RBAC, token handling
k8s-client-usage.md - User token vs service account patterns
error-handling.md - Consistent error patterns
react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

syntaxsdev approved these changes Feb 13, 2026

View reviewed changes

jeremyeder merged commit 0851aba into ambient-code:main Feb 13, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: surface pod scheduling failures in session status conditions#619

fix: surface pod scheduling failures in session status conditions#619
jeremyeder merged 1 commit intoambient-code:mainfrom
jeremyeder:feature/surface-pod-scheduling-failures

jeremyeder commented Feb 12, 2026

Uh oh!

github-actions bot commented Feb 12, 2026 •

edited

Loading

What Amber Loaded for Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeremyeder commented Feb 12, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

Issues by Severity

🚫 Blocker Issues

🔴 Critical Issues

🟡 Major Issues

🔵 Minor Issues

Positive Highlights

Recommendations

What Amber Loaded for Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 12, 2026 •

edited

Loading