Skip to content

fix(box_impl): offload blocking handler.stop() and metrics() to spawn_blocking#415

Open
lilongen wants to merge 1 commit intoboxlite-ai:mainfrom
lilongen:fix/spawn-blocking-handler
Open

fix(box_impl): offload blocking handler.stop() and metrics() to spawn_blocking#415
lilongen wants to merge 1 commit intoboxlite-ai:mainfrom
lilongen:fix/spawn-blocking-handler

Conversation

@lilongen
Copy link
Copy Markdown

@lilongen lilongen commented Apr 1, 2026

Summary

  • ShimHandler::stop() uses a std::thread::sleep(50ms) polling loop (up to 2 seconds).
    Called directly from async BoxImpl::stop(), this blocks a Tokio worker thread.
    Similarly, handler.metrics() performs sysinfo syscalls (~5ms), also a blocking operation.
  • Wrap LiveState.handler from std::sync::Mutex<...> to Arc<std::sync::Mutex<...>>
    so it can be cloned and moved into spawn_blocking closures.
  • Both handler.stop() and handler.metrics() now run on Tokio's dedicated blocking
    thread pool via tokio::task::spawn_blocking, keeping worker threads free.
  • Lock poisoning uses asymmetric handling: stop() swallows it (shutdown must not be
    blocked by a poisoned lock), metrics() propagates it (monitoring should surface
    anomalies).

Files changed

File Change
boxlite/src/litebox/box_impl.rs Arc-wrap handler field; spawn_blocking for stop() and metrics()

Test plan

  • cargo test -p boxlite --no-default-features --lib — 593 passed (main branch)
  • cargo clippy -p boxlite --no-default-features --lib -- -D warnings — clean
  • cargo fmt -- --check — clean

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 1, 2026 02:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates boxlite’s BoxImpl lifecycle/observability paths to avoid blocking Tokio worker threads by running VM handler shutdown and metrics collection on Tokio’s blocking thread pool.

Changes:

  • Wrap LiveState.handler in Arc<std::sync::Mutex<...>> so it can be moved into spawn_blocking closures.
  • Move handler.metrics() into tokio::task::spawn_blocking to avoid blocking async workers during syscalls.
  • Move handler.stop() into tokio::task::spawn_blocking to avoid blocking async workers during stop’s polling/sleep loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

h.metrics()
})
.await
.map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The join error message from spawn_blocking is too generic; when this triggers it will be hard to tell whether the failure happened during metrics collection vs some other blocking task. Consider including operation context (e.g., metrics spawn_blocking join) and/or the box_id in the error message to make debugging easier.

Suggested change
.map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??;
.map_err(|e| {
BoxliteError::Internal(format!(
"metrics spawn_blocking join (box_id={}): {}",
self.config.id, e
))
})??;

Copilot uses AI. Check for mistakes.
}
})
.await
.map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: the spawn_blocking join error message here doesn't indicate which operation failed. Adding context (e.g., stop spawn_blocking join) and/or the box_id would make operational debugging significantly easier.

Suggested change
.map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??;
.map_err(|e| {
BoxliteError::Internal(format!(
"stop spawn_blocking join (box_id={}): {}",
self.config.id, e
))
})??;

Copilot uses AI. Check for mistakes.
Comment on lines +365 to +368
if let Ok(mut h) = handler.lock() {
h.stop()
} else {
Ok(())
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the handler mutex is poisoned, this path currently returns Ok(()) without attempting to stop the handler. That can leave the VM/shim running and leak resources. If the intent is to not fail shutdown on poisoning, consider recovering the guard via PoisonError::into_inner() and still calling stop(), while continuing to swallow the poisoning error itself.

Suggested change
if let Ok(mut h) = handler.lock() {
h.stop()
} else {
Ok(())
match handler.lock() {
Ok(mut h) => h.stop(),
Err(poisoned) => {
tracing::warn!("Handler mutex poisoned during shutdown; recovering guard to stop handler");
let mut h = poisoned.into_inner();
h.stop()
}

Copilot uses AI. Check for mistakes.
…_blocking

ShimHandler::stop() uses a std::thread::sleep(50ms) polling loop (up to 2s).
Called directly from async BoxImpl::stop(), this blocks a Tokio worker thread.
Similarly, handler.metrics() performs sysinfo syscalls (~5ms blocking I/O).

Changes:
- Wrap LiveState.handler in Arc so it can be cloned into spawn_blocking closures
- handler.stop() and handler.metrics() now run on Tokio's blocking thread pool
- Lock poisoning uses asymmetric handling: stop() swallows (shutdown must not
  be blocked), metrics() propagates (monitoring should surface anomalies)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lilongen lilongen force-pushed the fix/spawn-blocking-handler branch from 8aaef1f to 8d043d1 Compare April 2, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants