nvme_driver: Free all resources when keepalive is off by alandau · Pull Request #3086 · microsoft/openvmm

alandau · 2026-03-20T22:14:33Z

When NVMe keepalive was off, the driver didn't free all resources on shutdown. E.g., IoIssuers kept within Namespaces were still referencing DMA allocations, thus these DMA regions were saved (if MANA keepalive was on). However, after servicing with keepalive off, NVMe manager didn't restore these allocations, which failed validation ("unrestored allocations found").

This change makes several changes:

Namespace now references IoIssuers via a Weak rather than Arc, which doesn't prevent resources from being freed when NVMe manager shuts down.
NVMe manager is shut down before saving when KA is off to free all resources (including DMA allocations) before save. It's still (partially) shut down after save when KA is on, as before.
Added a test that verifies servicing succeeds for all 4 combinations of (NVMe, MANA) keepalives.

When NVMe keepalive was off, the driver didn't free all resources on shutdown. E.g., IoIssuers kept within Namespaces were still referencing DMA allocations, thus these DMA regions were saved (if MANA keepalive was on). However, after servicing with keepalive off, NVMe manager didn't restore these allocations, which failed validation ("unrestored allocations found"). This change makes several changes: 1. Namespace now references IoIssuers via a Weak rather than Arc, which doesn't prevent resources from being freed when NVMe manager shuts down. 2. NVMe manager is shut down before saving when KA is off to free all resources (including DMA allocations) before save. It's still (partially) shut down after save when KA is on, as before. 3. Added a test that verifies servicing succeeds for all 4 combinations of (NVMe, MANA) keepalive.

Copilot

Pull request overview

This PR fixes an NVMe servicing teardown/resource-lifetime issue when NVMe keepalive is disabled, ensuring DMA allocations don’t remain referenced and inadvertently get persisted via other keepalive paths (e.g., MANA).

Changes:

Switch Namespace → IoIssuers references from Arc to Weak so namespaces don’t keep IO/DMA-related resources alive past NVMe manager shutdown.
Adjust Underhill servicing flow to shut down the NVMe manager before save when NVMe keepalive is disabled (freeing resources before any DMA-manager state is captured).
Add a servicing test intended to cover all 4 combinations of (NVMe, MANA) keepalive.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs`	Adds a new servicing test for NVMe/MANA keepalive combinations.
`vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs`	Replaces `Arc<IoIssuers>` with `Weak<IoIssuers>` and upgrades on-demand when issuing IO.
`vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs`	Updates `IoIssuers::get` to return `Arc<Issuer>` (clone) instead of `&Issuer` to support the new ownership model.
`openhcl/underhill_core/src/dispatch/mod.rs`	Shuts down NVMe manager before save when NVMe KA is off; retains post-save partial shutdown behavior when NVMe KA is on.

openhcl/underhill_core/src/dispatch/mod.rs

vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Copilot · 2026-03-20T22:59:50Z

openhcl/underhill_core/src/dispatch/mod.rs

+            // If keepalive is disabled, reset all user-mode NVMe devices before
+            // save to free all resources, including DMA allocations.
+            if !nvme_keepalive_enabled {
+                shutdown_nvme_manager(
+                    &mut self.nvme_manager,
+                    nvme_keepalive_enabled,
+                    correlation_id,
+                )
+                .await;
+            }


shutdown_nvme_manager(&mut self.nvme_manager, ...) is invoked before self.save(...) when NVMe keepalive is disabled, and it take()s (drops) the NvmeManager. If self.save(...) later returns an error, the error path only calls resume_drivers() (which currently doesn’t recreate/restart nvme_manager), so the VM may resume running without NVMe devices available. Consider keeping the manager available on save failure (e.g., avoid take() until after a successful save, or add a non-consuming “reset/free resources for save” path, or explicitly reinitialize nvme_manager in the failure rollback).

@mattkur: Does this comment make sense? Can save fail (with the guest expected to continue running as if nothing happened)? If yes, I don't like any of the presented options since the NVMe manager's state is torn down by the time save failed, and recreating it might be a PITA.

Hypothetically, yes. But I don't know if we currently have an idempotent save. @jstarks: interested in your architectural perspective. (1) do we want save to be idempotent, and (2) do you think we're there?

Perhaps I'm understanding the word "idempotent" differently... In my understanding idempotent means a second run will not do anything differently from the first run (i.e. the result of run 1+2 is the same as the result of run 1). I think this is orthogonal to the issue at hand, which is: can save fail and are we expected to recover in this case? (unless I'm misunderstanding something)

You're right, and I blame weekend brain. I mean: do we expect save -> resume to work in openhcl (without a restore and reload)

@jstarks: Friendly ping

vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs

Copilot AI review requested due to automatic review settings March 20, 2026 22:14

Copilot started reviewing on behalf of alandau March 20, 2026 22:15 View session

alandau marked this pull request as ready for review March 20, 2026 22:19

alandau requested review from a team as code owners March 20, 2026 22:19

Copilot AI reviewed Mar 20, 2026

View reviewed changes

openhcl/underhill_core/src/dispatch/mod.rs Show resolved Hide resolved

vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs Show resolved Hide resolved

vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs Outdated Show resolved Hide resolved

addressed copilot's comments

1170927

alandau requested a review from Copilot March 20, 2026 22:53

Copilot started reviewing on behalf of alandau March 20, 2026 22:54 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

make sure disk exists

1524eb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvme_driver: Free all resources when keepalive is off#3086

nvme_driver: Free all resources when keepalive is off#3086
alandau wants to merge 3 commits intomicrosoft:mainfrom
alandau:ka-decouple

alandau commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

alandau Mar 20, 2026 •

edited

Loading

Uh oh!

mattkur Mar 23, 2026

Uh oh!

alandau Mar 23, 2026

Uh oh!

mattkur Mar 23, 2026 •

edited

Loading

Uh oh!

alandau Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alandau commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

alandau Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattkur Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

alandau Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

mattkur Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alandau Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alandau Mar 20, 2026 •

edited

Loading

mattkur Mar 23, 2026 •

edited

Loading