fix(net): break tx loop on non-recoverable errors to prevent busy-loop by areporeporepo · Pull Request #606 · containers/libkrun

areporeporepo · 2026-03-25T18:14:23Z

Summary

Fixes #602.

When process_tx() returns a persistent error other than NothingWritten (e.g. Internal or ProcessNotRunning), the loop in process_tx_loop() continues spinning if the guest keeps supplying TX buffers, because tx_has_deferred_frame is false and has_new_entries is true:

process_tx() → Err(Internal) → tx_has_deferred_frame = false
enable_notification() → has_new_entries = true

Fix

Break immediately on non-recoverable errors. The worker will be re-entered on the next:

TX queue kick (guest adds new descriptors)
Backend socket writable event (edge-triggered epoll OUT)

This is safe because Internal and ProcessNotRunning errors are non-transient — retrying immediately won't resolve them, but the next epoll cycle gives the system a chance to recover or the guest driver to reset.

Change

 self.tx_has_deferred_frame = match self.process_tx() {
     Err(TxError::Backend(WriteError::NothingWritten)) => true,
     Err(e) => {
-        false
+        break;
     }
     _ => false,
 };

Testing

cargo clippy --features net -- -D warnings passes
cargo fmt -- --check passes
CI unit tests (Linux-only, relying on PR checks)

When process_tx() returns a persistent error other than NothingWritten (e.g. Internal or ProcessNotRunning), the loop in process_tx_loop() continues spinning if the guest keeps supplying TX buffers, because tx_has_deferred_frame is false and has_new_entries is true. Break immediately on non-recoverable errors instead of returning false. The worker will be re-entered on the next TX queue kick or backend socket writable event from epoll. Fixes: containers#602 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: anh nguyen <anh.nqqq@icloud.com> Signed-off-by: anh nguyen <29374105+areporeporepo@users.noreply.github.com>

slp · 2026-03-27T11:53:34Z

/gemini review

slp · 2026-03-27T11:53:59Z

@mtjhrc PTAL

gemini-code-assist

Code Review

This pull request modifies the virtio net worker to break the TX processing loop upon encountering non-recoverable errors. This change is intended to prevent potential busy-looping scenarios where a guest continues to provide TX buffers despite backend failures. I have no feedback to provide as there were no review comments to evaluate.

mtjhrc

LGTM, but note that this fixes only the immediate issue in the inner loop (like described in the PR text).

nirs · 2026-04-05T17:57:33Z

src/devices/src/virtio/net/worker.rs

+                    // busy-loop when the guest keeps supplying TX buffers.
+                    // The worker will be re-entered on the next TX queue kick
+                    // or backend-writable event.
+                    break;


I don't think this is right:

What would be the value of self.tx_has_deferred_frame after the break? before this change the value is false, which means we don't need to set up a timer.

Even if the value is false, this is not right since the code is not clear.

If we break here, we don't call tx_q.queue.enable_notification(&self.mem) which may not be correct. The current code always call tx_q.queue.enable_notification on each iteration.

There is no busy loop - I think the issue is wrong. The purpose of the worker is to write frames from the guest. If we we cannot write frames we should drop the fame, log an error and continue. In the worst case we will have unrecoverable error that will drop all frames and write lot of logs. There is nothing wrong about this, and it will make the issue easy to debug.

The issue was suggested by Gemini, and I opened it for discussion. I suggest to finish the discussion on the issue before making changes.

Address review feedback: the v1 break skipped enable_notification(), which could cause the guest to stop sending TX queue kicks (especially with EVENT_IDX where enable_notification writes avail_event). Also explicitly reset tx_has_deferred_frame to false for clarity. Fixes: containers#602 Signed-off-by: anh nguyen <29374105+areporeporepo@users.noreply.github.com>

areporeporepo · 2026-04-05T19:16:30Z

hi @mtjhrc, new commit just re-enables the notification before breaking tx loop per @nirs feedback. Thanks! lmk which vers u think would work better

areporeporepo requested review from MatiasVara, dorindabassey, jakecorrenti, mtjhrc, slp and tylerfanelli as code owners March 25, 2026 18:14

Copilot AI review requested due to automatic review settings March 25, 2026 18:14

Copilot started reviewing on behalf of areporeporepo March 25, 2026 18:14 View session

gemini-code-assist bot reviewed Mar 27, 2026

View reviewed changes

mtjhrc previously approved these changes Mar 31, 2026

View reviewed changes

nirs suggested changes Apr 5, 2026

View reviewed changes

areporeporepo dismissed mtjhrc’s stale review via 8e9d85d April 5, 2026 19:07

areporeporepo mentioned this pull request Apr 5, 2026

Possible busy-loop on process_tx errors #602

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(net): break tx loop on non-recoverable errors to prevent busy-loop#606

fix(net): break tx loop on non-recoverable errors to prevent busy-loop#606
areporeporepo wants to merge 2 commits intocontainers:mainfrom
areporeporepo:fix/tx-busy-loop-on-persistent-errors

areporeporepo commented Mar 25, 2026

Uh oh!

slp commented Mar 27, 2026

Uh oh!

slp commented Mar 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mtjhrc left a comment •

edited

Loading

Uh oh!

nirs Apr 5, 2026

Uh oh!

areporeporepo commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

areporeporepo commented Mar 25, 2026

Summary

Fix

Change

Testing

Uh oh!

slp commented Mar 27, 2026

Uh oh!

slp commented Mar 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mtjhrc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirs Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

areporeporepo commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mtjhrc left a comment •

edited

Loading

areporeporepo commented Apr 5, 2026 •

edited

Loading