Skip to content

mana_driver: fix shutdown/save panic with outstanding vport references#3148

Draft
erfrimod wants to merge 2 commits intomicrosoft:mainfrom
erfrimod:erfrimod/netvsp-shutdown-crash-fix
Draft

mana_driver: fix shutdown/save panic with outstanding vport references#3148
erfrimod wants to merge 2 commits intomicrosoft:mainfrom
erfrimod:erfrimod/netvsp-shutdown-crash-fix

Conversation

@erfrimod
Copy link
Copy Markdown
Contributor

Both manadevice::shutdown and manadevice::save call Arc::into_inner(self.inner).unrwap() which will panic if any vport still holds a Arc reference. For example: a race where shutdown runs while vport actions like retarget_interrupts are in-flight.

  • Removing the exclusive Arc ownership requirement.
  • Locks the driver and use new take_device() method to extract the device backing directly.
  • Driver is cleaned up when the last Arc reference is dropped. There is still a chance for it to be immediate, but it may have to wait for vport operations to conclude.
  • Added tests to verify the new behavior.

Copilot AI review requested due to automatic review settings March 27, 2026 23:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a panic in the MANA driver shutdown/save paths when there are outstanding Arc<Inner> references (e.g., in-flight vport operations), by removing the exclusive Arc::into_inner(..).unwrap() requirement and making stale vport operations fail gracefully.

Changes:

  • Rework ManaDevice::{save,shutdown} to lock the GDMA driver and extract the backing device via a new take_device() API rather than requiring exclusive Arc ownership.
  • Update vport DMA-client accessors/callers to return Result and propagate errors when the backing device has already been taken.
  • Add unit tests covering take_device() semantics and shutdown/save behavior with a stale vport.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
vm/devices/net/net_mana/src/lib.rs Propagates new fallible vport.dma_client() behavior via ? when allocating bounce buffers.
vm/devices/net/mana_driver/src/tests.rs Adds tests for take_device() and verifies shutdown/save don’t panic with stale vports.
vm/devices/net/mana_driver/src/mana.rs Implements shutdown/save without Arc::into_inner; makes Vport::dma_client() fallible and avoids panics when device is taken.
vm/devices/net/mana_driver/src/gdma_driver.rs Introduces try_device()/take_device(), removes into_device(), and converts some device-access panics into errors.

Comment on lines +181 to +189
// Perform save and extract the device through the Mutex. Avoid exclusive
// Arc ownership, which can fail if a Vport reference exists.
let mut gdma = self.inner.gdma.lock().await;
let save_result = gdma.save().await;
let device = gdma
.take_device()
.expect("device should not have been taken yet");
drop(gdma);
drop(self.inner);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering if exclusive ownership was to be avoided, but I think by this point in the code we should expect no other Vports to still exist. If one does, a hard failure is warranted -- otherwise code will be trying to access aspects of the MANA device after it's already been torn down.

Is drop() really required on gdma in this case?

Comment on lines +317 to +327

// Perform deregister and extract the device through the Mutex. Avoid exclusive
// Arc ownership, which can fail if a Vport reference exists.
let dev_id = self.inner.dev_id;
let mut gdma = self.inner.gdma.lock().await;
let result = gdma.deregister_device(dev_id).await;
let device = gdma
.take_device()
.expect("device should not have been taken yet");
drop(gdma);
drop(self.inner);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as above.

Comment on lines +803 to +805
/// After this call, operations that require the device (e.g. DMA
/// allocation, interrupt retargeting) will fail with an error instead
/// of panicking.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this allows for long-term buildup of deadwood resources. The painful part of unwrap(), expect(), and the like is the crash; the benefit is that we're incentivized to ensure all allocated resources are cleaned up in their logical order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants