Skip to content

fix(net): break tx loop on non-recoverable errors to prevent busy-loop#606

Open
areporeporepo wants to merge 2 commits intocontainers:mainfrom
areporeporepo:fix/tx-busy-loop-on-persistent-errors
Open

fix(net): break tx loop on non-recoverable errors to prevent busy-loop#606
areporeporepo wants to merge 2 commits intocontainers:mainfrom
areporeporepo:fix/tx-busy-loop-on-persistent-errors

Conversation

@areporeporepo
Copy link
Copy Markdown

Summary

Fixes #602.

When process_tx() returns a persistent error other than NothingWritten (e.g. Internal or ProcessNotRunning), the loop in process_tx_loop() continues spinning if the guest keeps supplying TX buffers, because tx_has_deferred_frame is false and has_new_entries is true:

process_tx() → Err(Internal) → tx_has_deferred_frame = false
enable_notification() → has_new_entries = true

Fix

Break immediately on non-recoverable errors. The worker will be re-entered on the next:

  • TX queue kick (guest adds new descriptors)
  • Backend socket writable event (edge-triggered epoll OUT)

This is safe because Internal and ProcessNotRunning errors are non-transient — retrying immediately won't resolve them, but the next epoll cycle gives the system a chance to recover or the guest driver to reset.

Change

 self.tx_has_deferred_frame = match self.process_tx() {
     Err(TxError::Backend(WriteError::NothingWritten)) => true,
     Err(e) => {
-        false
+        break;
     }
     _ => false,
 };

Testing

  • cargo clippy --features net -- -D warnings passes
  • cargo fmt -- --check passes
  • CI unit tests (Linux-only, relying on PR checks)

When process_tx() returns a persistent error other than NothingWritten
(e.g. Internal or ProcessNotRunning), the loop in process_tx_loop()
continues spinning if the guest keeps supplying TX buffers, because
tx_has_deferred_frame is false and has_new_entries is true.

Break immediately on non-recoverable errors instead of returning false.
The worker will be re-entered on the next TX queue kick or backend
socket writable event from epoll.

Fixes: containers#602

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: anh nguyen <anh.nqqq@icloud.com>
Signed-off-by: anh nguyen <29374105+areporeporepo@users.noreply.github.com>
@slp
Copy link
Copy Markdown
Collaborator

slp commented Mar 27, 2026

/gemini review

@slp
Copy link
Copy Markdown
Collaborator

slp commented Mar 27, 2026

@mtjhrc PTAL

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the virtio net worker to break the TX processing loop upon encountering non-recoverable errors. This change is intended to prevent potential busy-looping scenarios where a guest continues to provide TX buffers despite backend failures. I have no feedback to provide as there were no review comments to evaluate.

mtjhrc
mtjhrc previously approved these changes Mar 31, 2026
Copy link
Copy Markdown
Collaborator

@mtjhrc mtjhrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but note that this fixes only the immediate issue in the inner loop (like described in the PR text).

// busy-loop when the guest keeps supplying TX buffers.
// The worker will be re-entered on the next TX queue kick
// or backend-writable event.
break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right:

  • What would be the value of self.tx_has_deferred_frame after the break? before this change the value is false, which means we don't need to set up a timer.
  • Even if the value is false, this is not right since the code is not clear.
  • If we break here, we don't call tx_q.queue.enable_notification(&self.mem) which may not be correct. The current code always call tx_q.queue.enable_notification on each iteration.
  • There is no busy loop - I think the issue is wrong. The purpose of the worker is to write frames from the guest. If we we cannot write frames we should drop the fame, log an error and continue. In the worst case we will have unrecoverable error that will drop all frames and write lot of logs. There is nothing wrong about this, and it will make the issue easy to debug.

The issue was suggested by Gemini, and I opened it for discussion. I suggest to finish the discussion on the issue before making changes.

Address review feedback: the v1 break skipped enable_notification(),
which could cause the guest to stop sending TX queue kicks (especially
with EVENT_IDX where enable_notification writes avail_event). Also
explicitly reset tx_has_deferred_frame to false for clarity.

Fixes: containers#602

Signed-off-by: anh nguyen <29374105+areporeporepo@users.noreply.github.com>
@areporeporepo
Copy link
Copy Markdown
Author

areporeporepo commented Apr 5, 2026

hi @mtjhrc, new commit just re-enables the notification before breaking tx loop per @nirs feedback. Thanks! lmk which vers u think would work better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible busy-loop on process_tx errors

4 participants