Skip to content

Fix orphaned mdev cleanup on startup#110

Merged
sjmiller609 merged 2 commits intomainfrom
codex/remove-orphaned-mdevs-on-startup
Feb 27, 2026
Merged

Fix orphaned mdev cleanup on startup#110
sjmiller609 merged 2 commits intomainfrom
codex/remove-orphaned-mdevs-on-startup

Conversation

@sjmiller609
Copy link
Collaborator

@sjmiller609 sjmiller609 commented Feb 26, 2026

Summary

  • capture why the existing cleanup logic was deleted while working on the orphaned mdev removal
  • explain the resulting behavior and how the new startup flow handles lingering devices

Testing

Result
Completed exactly as requested: deployed your local build to dev-yul-hypeman-1, restarted hypeman, verified mdev cleanup, then restored the original hypeman-api binary and restarted again.

What happened

  1. Built your current local code on the server from copied source at /tmp/hypeman-src-codex-20260226-142112 using Docker golang:1.25.4 and vendored deps.
  2. Pre-deploy baseline:
  • sysfs_count=9
  • orphan_count=2
  • Orphans: 3233fbf2-8a2b-4efe-bd3a-ad19939a6da7, bb22887e-4636-4513-a11e-fb1cf9f647e5
  1. Swapped binary and restarted:
  • Original sha: a88cbbe9f54e653832f470b0f70e40ff8aba426bc6cc0e05fee3ac9df41633e6
  • Deployed sha: a1eded7d390850aff212df3bf0a17e3711c4a27329e1167d71763f963cbdc17c
  • Service came up active.
  1. Startup logs from deployed version showed cleanup working:
  • Reconciling mdev devices... total_mdevs=9 ... grace_period=5m0s
  • destroying orphaned mdev for both UUIDs above
  • mdev reconciliation complete destroyed=2 ... skipped_in_use=7
  1. Post-deploy check:
  • sysfs_count=7
  • orphan_count=0
  1. Rolled back to original binary and restarted:
  • Restored sha: a88cbbe9f54e653832f470b0f70e40ff8aba426bc6cc0e05fee3ac9df41633e6 (matches original)
  • Service is active on original version.
  • Rollback startup logs: total_mdevs=7 ... destroyed=0 ... skipped_running=7
  • Final check: sysfs_count=7, orphan_count=0

Current state

  • Server is back on the original hypeman-api version.
  • The two orphaned mdevs were cleaned during the temporary deploy run and remain cleaned after rollback.

Note

Medium Risk
Changes startup-time vGPU mdev cleanup logic to delete devices based on VF ownership, VFIO handle detection, and age; mistakes could remove still-needed mdevs or leave leaks, impacting running/starting VMs.

Overview
Updates startup mdev cleanup to no longer build MdevReconcileInfo from instance state; cmd/api/main.go now calls devices.ReconcileMdevs(ctx, nil) as a best-effort sweep for lingering vGPU devices.

Reworks ReconcileMdevs on Linux to identify orphaned mdevs by managed VF membership, active VFIO group file handles (/dev/vfio/<group>), and a 5-minute grace period before deletion, with expanded logging/counters for skipped/destroyed/probe-error cases.

Written by Cursor Bugbot for commit a38bfc7. This will update automatically on new commits. Configure here.

@sjmiller609 sjmiller609 changed the title Clarify orphaned mdev cleanup on startup Fix orphaned mdev cleanup on startup Feb 26, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@sjmiller609 sjmiller609 merged commit 6d9e538 into main Feb 27, 2026
6 checks passed
@sjmiller609 sjmiller609 deleted the codex/remove-orphaned-mdevs-on-startup branch February 27, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants