Skip to content

do not merge - vsphere poweron bug alt#1466

Draft
jcpowermac wants to merge 1 commit intoopenshift:mainfrom
jcpowermac:opencode-test-fix-poweron
Draft

do not merge - vsphere poweron bug alt#1466
jcpowermac wants to merge 1 commit intoopenshift:mainfrom
jcpowermac:opencode-test-fix-poweron

Conversation

@jcpowermac
Copy link
Contributor

@jcpowermac jcpowermac commented Feb 27, 2026

Summary by CodeRabbit

  • Documentation

    • Added comprehensive guides for Machine Controller, MachineSet Controller, vSphere integration, Operator, and Webhooks with lifecycle details and best practices.
  • Bug Fixes

    • Enhanced vSphere stability by improving handling of edge cases where background cleanup operations may occur unexpectedly, ensuring proper recovery.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 27, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

Walkthrough

This pull request adds six new AGENTS.md documentation files across various controllers (Machine, MachineSet, vSphere, Operator, and Webhooks) to provide guidance on architecture, conventions, and anti-patterns. Additionally, a new helper method handleTaskNotFound() is introduced in the vSphere reconciler to handle task cleanup race conditions.

Changes

Cohort / File(s) Summary
Documentation Files
pkg/controller/machine/AGENTS.md, pkg/controller/machineset/AGENTS.md, pkg/controller/vsphere/AGENTS.md, pkg/operator/AGENTS.md, pkg/webhooks/AGENTS.md
New AGENTS.md documentation files describing controller/operator architecture, lifecycle management, file locations, conventions, anti-patterns, and testing guidelines for Machine, MachineSet, vSphere, Operator, and Webhooks components.
vSphere Controller Logic
pkg/controller/vsphere/reconciler.go
Adds new handleTaskNotFound() helper method to handle race conditions where clone tasks are cleaned up before power-on completion. Detects task absence, queries VM state, powers on if needed, or clears TaskRef to proceed. Integrates into create() method to delegate task retrieval errors and nil moTask scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Pull request introduces handleTaskNotFound() method but lacks corresponding test cases for new code paths in create() method. Add comprehensive test cases covering VM not found, power-on scenarios, and suspended VM handling with Gomega assertions and proper simulator setup.
Title check ❓ Inconclusive The title 'do not merge - vsphere poweron bug alt' is partially related to the changeset. While it mentions 'vsphere poweron bug', the PR actually includes several documentation files (AGENTS.md) across multiple controllers plus the vSphere fix, making the title too narrow and not representative of the full scope of changes. Consider a more descriptive title that encompasses the main changes, such as 'Add documentation guides and fix vSphere power-on task handling' or clarify the primary objective since the PR includes extensive documentation additions alongside the bug fix.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed No dynamic test names found. All test titles use static, descriptive strings as required.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nrb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jcpowermac
Copy link
Contributor Author

/test ?

@jcpowermac
Copy link
Contributor Author

/test e2e-vsphere-ovn

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/vsphere/reconciler.go`:
- Around line 301-304: The code calls vm.powerOnVM() directly which bypasses the
shared DRS-aware helper; replace this call with the existing
powerOn(r.machineScope) helper so the same datacenter/DRS override behavior is
used (ensure you capture and handle the returned task and err the same way as
before and keep the existing log message and error handling around task/err).
Target symbols: replace vm.powerOnVM() with powerOn(r.machineScope) and preserve
surrounding logic that inspects task and err.
- Around line 301-327: The code currently only handles
types.VirtualMachinePowerStatePoweredOff and falls through for Suspended,
clearing r.providerStatus.TaskRef as a success; change this by adding an
explicit branch for types.VirtualMachinePowerStateSuspended that mirrors the
PoweredOff handling: call vm.powerOnVM() (or the appropriate resume API),
register failure metrics and set conditionFailed on error, set
r.providerStatus.TaskRef and log on success, and call setProviderStatus(...)
with conditionSuccess(); ensure you do not clear TaskRef for Suspended without
initiating the resume/power-on workflow. Use the existing symbols
vm.powerOnVM(), powerState, types.VirtualMachinePowerStateSuspended,
r.providerStatus.TaskRef, conditionFailed(), conditionSuccess(),
setProviderStatus() and metrics.RegisterFailedInstanceCreate to implement
consistent handling.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 8846555 and e59ff24.

📒 Files selected for processing (6)
  • pkg/controller/machine/AGENTS.md
  • pkg/controller/machineset/AGENTS.md
  • pkg/controller/vsphere/AGENTS.md
  • pkg/controller/vsphere/reconciler.go
  • pkg/operator/AGENTS.md
  • pkg/webhooks/AGENTS.md

Comment on lines +301 to +304
if powerState == types.VirtualMachinePowerStatePoweredOff {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := vm.powerOnVM()
if err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the existing powerOn() helper here to preserve DRS override behavior.

Line 303 calls vm.powerOnVM() directly, bypassing the existing powerOn(r.machineScope) path used elsewhere in create flow. That can diverge behavior for environments relying on the datacenter power-on path.

⚙️ Proposed fix
-		task, err := vm.powerOnVM()
+		task, err := powerOn(r.machineScope)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if powerState == types.VirtualMachinePowerStatePoweredOff {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := vm.powerOnVM()
if err != nil {
if powerState == types.VirtualMachinePowerStatePoweredOff {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := powerOn(r.machineScope)
if err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/vsphere/reconciler.go` around lines 301 - 304, The code calls
vm.powerOnVM() directly which bypasses the shared DRS-aware helper; replace this
call with the existing powerOn(r.machineScope) helper so the same datacenter/DRS
override behavior is used (ensure you capture and handle the returned task and
err the same way as before and keep the existing log message and error handling
around task/err). Target symbols: replace vm.powerOnVM() with
powerOn(r.machineScope) and preserve surrounding logic that inspects task and
err.

Comment on lines +301 to +327
if powerState == types.VirtualMachinePowerStatePoweredOff {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := vm.powerOnVM()
if err != nil {
metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
Name: r.machine.Name,
Namespace: r.machine.Namespace,
Reason: "PowerOn task finished with error",
})
conditionFailed := conditionFailed()
conditionFailed.Message = err.Error()
statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
if statusError != nil {
return fmt.Errorf("failed to set provider status: %w", statusError)
}
return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
}

// Clear the old TaskRef and set the new power-on task
r.providerStatus.TaskRef = task
klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
}

// VM is already powered on (or in another state), clear the old TaskRef
// and let the next reconcile proceed to update
klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle suspended VMs explicitly instead of clearing TaskRef as success.

Current logic only powers on when state is PoweredOff. A Suspended VM falls through to Line 328 and clears TaskRef, which can advance reconciliation without actually restoring runtime state.

🛠️ Proposed fix
-	if powerState == types.VirtualMachinePowerStatePoweredOff {
+	if powerState == types.VirtualMachinePowerStatePoweredOff ||
+		powerState == types.VirtualMachinePowerStateSuspended {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if powerState == types.VirtualMachinePowerStatePoweredOff {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := vm.powerOnVM()
if err != nil {
metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
Name: r.machine.Name,
Namespace: r.machine.Namespace,
Reason: "PowerOn task finished with error",
})
conditionFailed := conditionFailed()
conditionFailed.Message = err.Error()
statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
if statusError != nil {
return fmt.Errorf("failed to set provider status: %w", statusError)
}
return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
}
// Clear the old TaskRef and set the new power-on task
r.providerStatus.TaskRef = task
klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
}
// VM is already powered on (or in another state), clear the old TaskRef
// and let the next reconcile proceed to update
klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
if powerState == types.VirtualMachinePowerStatePoweredOff ||
powerState == types.VirtualMachinePowerStateSuspended {
klog.Infof("%v: VM exists but is powered off after task cleanup, powering on", r.machine.GetName())
task, err := vm.powerOnVM()
if err != nil {
metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
Name: r.machine.Name,
Namespace: r.machine.Namespace,
Reason: "PowerOn task finished with error",
})
conditionFailed := conditionFailed()
conditionFailed.Message = err.Error()
statusError := setProviderStatus(task, conditionFailed, r.machineScope, vm)
if statusError != nil {
return fmt.Errorf("failed to set provider status: %w", statusError)
}
return fmt.Errorf("%v: failed to power on machine: %w", r.machine.GetName(), err)
}
// Clear the old TaskRef and set the new power-on task
r.providerStatus.TaskRef = task
klog.Infof("%v: Successfully initiated power-on, tracking new task %s", r.machine.GetName(), task)
return setProviderStatus(task, conditionSuccess(), r.machineScope, vm)
}
// VM is already powered on (or in another state), clear the old TaskRef
// and let the next reconcile proceed to update
klog.V(2).Infof("%v: VM is already powered on (%s), clearing old TaskRef", r.machine.GetName(), powerState)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/vsphere/reconciler.go` around lines 301 - 327, The code
currently only handles types.VirtualMachinePowerStatePoweredOff and falls
through for Suspended, clearing r.providerStatus.TaskRef as a success; change
this by adding an explicit branch for types.VirtualMachinePowerStateSuspended
that mirrors the PoweredOff handling: call vm.powerOnVM() (or the appropriate
resume API), register failure metrics and set conditionFailed on error, set
r.providerStatus.TaskRef and log on success, and call setProviderStatus(...)
with conditionSuccess(); ensure you do not clear TaskRef for Suspended without
initiating the resume/power-on workflow. Use the existing symbols
vm.powerOnVM(), powerState, types.VirtualMachinePowerStateSuspended,
r.providerStatus.TaskRef, conditionFailed(), conditionSuccess(),
setProviderStatus() and metrics.RegisterFailedInstanceCreate to implement
consistent handling.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 27, 2026

@jcpowermac: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn e59ff24 link false /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant