OCPBUGS-81716: Add retry logic for transient network errors in restartKubeletOnNode#30966
OCPBUGS-81716: Add retry logic for transient network errors in restartKubeletOnNode#30966BhargaviGudi wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@BhargaviGudi: This pull request references Jira Issue OCPBUGS-81716, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: BhargaviGudi The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/jira refresh |
|
@BhargaviGudi: This pull request references Jira Issue OCPBUGS-81716, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning |
|
@BhargaviGudi: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/343f3690-324b-11f1-94ea-e53d6b882dcb-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/node/node_utils.go`:
- Around line 200-215: The retry loop in restartKubeletOnNode is blocking and
sleeps even after the final attempt and isn't cancelable; change the signature
to accept a context.Context (restartKubeletOnNode(ctx context.Context, oc
*exutil.CLI, nodeName string)), remove the unconditional time.Sleep on the final
attempt by only sleeping when attempt < maxAttempts, and replace time.Sleep with
a context-aware wait (select between time.After(backoff) and ctx.Done()) so the
retry/backoff can be canceled; keep using ExecOnNodeWithChroot and
isTransientNetworkError to detect retryable errors and propagate the lastErr
when context is done or retries exhausted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ed04415d-f102-4abf-86dc-8f0f443aeda3
📒 Files selected for processing (1)
test/extended/node/node_utils.go
| func restartKubeletOnNode(oc *exutil.CLI, nodeName string) error { | ||
| _, err := ExecOnNodeWithChroot(oc, nodeName, "systemctl", "restart", "kubelet") | ||
| return err | ||
| var lastErr error | ||
| for attempt := 1; attempt <= 3; attempt++ { | ||
| _, err := ExecOnNodeWithChroot(oc, nodeName, "systemctl", "restart", "kubelet") | ||
| if err == nil { | ||
| return nil | ||
| } | ||
| lastErr = err | ||
| if isTransientNetworkError(err) { | ||
| framework.Logf("Attempt %d/3 to restart kubelet on %s failed: %v, retrying...", attempt, nodeName, err) | ||
| time.Sleep(time.Duration(attempt*5) * time.Second) | ||
| continue | ||
| } | ||
| return err | ||
| } | ||
| return fmt.Errorf("failed to restart kubelet on %s after 3 attempts: %w", nodeName, lastErr) |
There was a problem hiding this comment.
Retry loop adds a no-op 15s delay on final failure and cannot be canceled.
Line [210] sleeps even on the last attempt (Line [202]), so failure path pays an extra 15s with no next retry. Also, the backoff is not context-aware, which can overrun cleanup windows (see test/extended/node/node_swap_cnv.go:242-262 where restart runs in defer with test ctx).
Proposed fix
-func restartKubeletOnNode(oc *exutil.CLI, nodeName string) error {
+func restartKubeletOnNode(ctx context.Context, oc *exutil.CLI, nodeName string) error {
+ const maxRetries = 3
var lastErr error
- for attempt := 1; attempt <= 3; attempt++ {
+ for attempt := 0; attempt <= maxRetries; attempt++ {
_, err := ExecOnNodeWithChroot(oc, nodeName, "systemctl", "restart", "kubelet")
if err == nil {
return nil
}
lastErr = err
- if isTransientNetworkError(err) {
- framework.Logf("Attempt %d/3 to restart kubelet on %s failed: %v, retrying...", attempt, nodeName, err)
- time.Sleep(time.Duration(attempt*5) * time.Second)
- continue
+ if !isTransientNetworkError(err) {
+ return fmt.Errorf("failed to restart kubelet on %s: %w", nodeName, err)
}
- return err
+ if attempt == maxRetries {
+ break
+ }
+ backoff := time.Duration((attempt+1)*5) * time.Second
+ framework.Logf("Attempt %d/%d to restart kubelet on %s failed: %v; retrying in %s",
+ attempt+1, maxRetries+1, nodeName, err, backoff)
+ timer := time.NewTimer(backoff)
+ select {
+ case <-ctx.Done():
+ timer.Stop()
+ return fmt.Errorf("context canceled while restarting kubelet on %s: %w", nodeName, ctx.Err())
+ case <-timer.C:
+ }
}
- return fmt.Errorf("failed to restart kubelet on %s after 3 attempts: %w", nodeName, lastErr)
+ return fmt.Errorf("failed to restart kubelet on %s after %d attempts: %w", nodeName, maxRetries+1, lastErr)
}// call sites (examples):
err = restartKubeletOnNode(ctx, oc, cnvWorkerNode)
restartKubeletOnNode(ctx, oc, nodeName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/node/node_utils.go` around lines 200 - 215, The retry loop in
restartKubeletOnNode is blocking and sleeps even after the final attempt and
isn't cancelable; change the signature to accept a context.Context
(restartKubeletOnNode(ctx context.Context, oc *exutil.CLI, nodeName string)),
remove the unconditional time.Sleep on the final attempt by only sleeping when
attempt < maxAttempts, and replace time.Sleep with a context-aware wait (select
between time.After(backoff) and ctx.Done()) so the retry/backoff can be
canceled; keep using ExecOnNodeWithChroot and isTransientNetworkError to detect
retryable errors and propagate the lastErr when context is done or retries
exhausted.
|
Scheduling required tests: |
|
All test cases in the openshift/disruptive-longrunning suite are passing successfully. |
|
/verified by CI |
|
@BhargaviGudi: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/override ci/prow/e2e-aws-ovn-fips |
|
@sosiouxme: Overrode contexts on behalf of sosiouxme: ci/prow/e2e-aws-ovn-fips, ci/prow/e2e-aws-ovn-serial-1of2, ci/prow/e2e-gcp-ovn DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@BhargaviGudi: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/unhold |
|
@cpmeadors and @ngopalak-redhat Could you please help to review this PR? Thanks |
Add retry logic for transient network errors in restartKubeletOnNode
Bug: https://redhat.atlassian.net/browse/OCPBUGS-81716
Problem
TC4 CNV swap tests were failing intermittently with "connection refused" errors when running oc debug commands to restart kubelet on nodes.
Solution
Added retry mechanism to restartKubeletOnNode that:
Network errors handled: