fix: surface pod scheduling failures in session status conditions#619
Conversation
When a pod can't be scheduled (e.g., insufficient memory/CPU), the session was stuck in "Creating" with no explanation. The scheduler's reason is available in pod.Status.Conditions but the operator never read it. Now surfaces PodScheduled=False with the scheduler message (e.g., "0/1 nodes are available: 1 Insufficient memory") so users can see why their session isn't starting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude Code ReviewSummaryThis PR adds visibility into pod scheduling failures by surfacing Kubernetes scheduler messages to the AgenticSession status conditions. When a pod cannot be scheduled due to resource constraints or affinity rules, users will now see the specific reason (e.g., "0/1 nodes are available: 1 Insufficient memory") instead of an indefinite "Creating" state with no explanation. The implementation is clean, well-documented, and follows established operator patterns. The change is minimal (22 lines) and strategically placed in both the controller-runtime reconciliation path and the legacy watch-based monitoring path. Issues by Severity🚫 Blocker IssuesNone - the PR is ready to merge. 🔴 Critical IssuesNone identified. 🟡 Major IssuesNone identified. 🔵 Minor Issues1. Missing nil safety check for pod parameter While the current callers always pass valid pod objects, defensive programming suggests adding a nil check: func surfacePodSchedulingFailure(pod *corev1.Pod, statusPatch *StatusPatch) {
if pod == nil {
return
}
for _, cond := range pod.Status.Conditions {
// ...
}
}Rationale: Prevents potential panics if future code paths call this with a nil pod. Following the error handling pattern: "handle edge cases gracefully, don't assume perfect inputs." Impact: Low - current code is safe, but this improves robustness. 2. No unit test coverage for the new function The new
Example test structure: func TestSurfacePodSchedulingFailure(t *testing.T) {
tests := []struct {
name string
pod *corev1.Pod
expectedStatus string
expectedReason string
}{
{
name: "pod with scheduling failure",
pod: &corev1.Pod{
Status: corev1.PodStatus{
Conditions: []corev1.PodCondition{
{
Type: corev1.PodScheduled,
Status: corev1.ConditionFalse,
Reason: "Unschedulable",
Message: "0/1 nodes are available: 1 Insufficient memory",
},
},
},
},
expectedStatus: "False",
expectedReason: "Unschedulable",
},
// ... more test cases
}
// ...
}Impact: Medium - reduces regression risk and documents expected behavior. 3. Condition could be added even when no scheduling failure exists If a pod has no scheduling conditions at all (rare but possible during pod initialization), the function silently returns without adding any condition. This is correct behavior, but the code doesn't explicitly document this: func surfacePodSchedulingFailure(pod *corev1.Pod, statusPatch *StatusPatch) {
// Early return if no conditions exist - this is normal for newly created pods
if len(pod.Status.Conditions) == 0 {
return
}
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse {
statusPatch.AddCondition(conditionUpdate{
Type: conditionPodScheduled,
Status: "False",
Reason: string(cond.Reason),
Message: cond.Message,
})
return
}
}
// If we reach here, pod has conditions but PodScheduled=False wasn't found.
// This is normal - the pod may not have been scheduled yet but also hasn't failed.
}Impact: Very low - current code works correctly, this is purely for clarity. Positive Highlights1. Excellent code placement ✅
2. Follows established patterns ✅
3. Clear and helpful documentation ✅
4. Minimal, surgical change ✅
5. Proper type conversion ✅
6. No token/secret handling ✅
RecommendationsPriority 1 (Nice-to-have for this PR):
Priority 2 (Future work):
Approval Recommendation: ✅ APPROVE This PR delivers exactly what it promises: better UX by surfacing pod scheduling failures to users. The implementation is solid, follows project patterns, and introduces no regressions. The minor issues noted are enhancements rather than blockers. The code quality is excellent and adheres to the repository's strict standards:
Reviewed against:
🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Summary
surfacePodSchedulingFailurehelper that readspod.Status.ConditionsforPodScheduled=Falseand surfaces the scheduler's message (e.g., "0/1 nodes are available: 1 Insufficient memory") as aPodScheduledcondition on the AgenticSession CRUpdateSessionFromPodStatus(controller-runtime path) andmonitorPod(legacy watch path)Test plan
go build ./...andgo vet ./...pass (verified locally)kubectl get agenticsession <name> -o jsonpath='{.status.conditions}'showsPodScheduled: Falsewith the scheduler messagePodScheduled: Truewith the node name once scheduledNodeNamecheck takes precedence)🤖 Generated with Claude Code