mana_driver: fix shutdown/save panic with outstanding vport references#3148
mana_driver: fix shutdown/save panic with outstanding vport references#3148erfrimod wants to merge 2 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a panic in the MANA driver shutdown/save paths when there are outstanding Arc<Inner> references (e.g., in-flight vport operations), by removing the exclusive Arc::into_inner(..).unwrap() requirement and making stale vport operations fail gracefully.
Changes:
- Rework
ManaDevice::{save,shutdown}to lock the GDMA driver and extract the backing device via a newtake_device()API rather than requiring exclusiveArcownership. - Update vport DMA-client accessors/callers to return
Resultand propagate errors when the backing device has already been taken. - Add unit tests covering
take_device()semantics and shutdown/save behavior with a stale vport.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| vm/devices/net/net_mana/src/lib.rs | Propagates new fallible vport.dma_client() behavior via ? when allocating bounce buffers. |
| vm/devices/net/mana_driver/src/tests.rs | Adds tests for take_device() and verifies shutdown/save don’t panic with stale vports. |
| vm/devices/net/mana_driver/src/mana.rs | Implements shutdown/save without Arc::into_inner; makes Vport::dma_client() fallible and avoids panics when device is taken. |
| vm/devices/net/mana_driver/src/gdma_driver.rs | Introduces try_device()/take_device(), removes into_device(), and converts some device-access panics into errors. |
| // Perform save and extract the device through the Mutex. Avoid exclusive | ||
| // Arc ownership, which can fail if a Vport reference exists. | ||
| let mut gdma = self.inner.gdma.lock().await; | ||
| let save_result = gdma.save().await; | ||
| let device = gdma | ||
| .take_device() | ||
| .expect("device should not have been taken yet"); | ||
| drop(gdma); | ||
| drop(self.inner); |
There was a problem hiding this comment.
I was considering if exclusive ownership was to be avoided, but I think by this point in the code we should expect no other Vports to still exist. If one does, a hard failure is warranted -- otherwise code will be trying to access aspects of the MANA device after it's already been torn down.
Is drop() really required on gdma in this case?
|
|
||
| // Perform deregister and extract the device through the Mutex. Avoid exclusive | ||
| // Arc ownership, which can fail if a Vport reference exists. | ||
| let dev_id = self.inner.dev_id; | ||
| let mut gdma = self.inner.gdma.lock().await; | ||
| let result = gdma.deregister_device(dev_id).await; | ||
| let device = gdma | ||
| .take_device() | ||
| .expect("device should not have been taken yet"); | ||
| drop(gdma); | ||
| drop(self.inner); |
| /// After this call, operations that require the device (e.g. DMA | ||
| /// allocation, interrupt retargeting) will fail with an error instead | ||
| /// of panicking. |
There was a problem hiding this comment.
I'm concerned that this allows for long-term buildup of deadwood resources. The painful part of unwrap(), expect(), and the like is the crash; the benefit is that we're incentivized to ensure all allocated resources are cleaned up in their logical order.
Both manadevice::shutdown and manadevice::save call Arc::into_inner(self.inner).unrwap() which will panic if any vport still holds a Arc reference. For example: a race where shutdown runs while vport actions like retarget_interrupts are in-flight.