fix(box_impl): offload blocking handler.stop() and metrics() to spawn_blocking#415
fix(box_impl): offload blocking handler.stop() and metrics() to spawn_blocking#415lilongen wants to merge 1 commit intoboxlite-ai:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates boxlite’s BoxImpl lifecycle/observability paths to avoid blocking Tokio worker threads by running VM handler shutdown and metrics collection on Tokio’s blocking thread pool.
Changes:
- Wrap
LiveState.handlerinArc<std::sync::Mutex<...>>so it can be moved intospawn_blockingclosures. - Move
handler.metrics()intotokio::task::spawn_blockingto avoid blocking async workers during syscalls. - Move
handler.stop()intotokio::task::spawn_blockingto avoid blocking async workers during stop’s polling/sleep loop.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| h.metrics() | ||
| }) | ||
| .await | ||
| .map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??; |
There was a problem hiding this comment.
The join error message from spawn_blocking is too generic; when this triggers it will be hard to tell whether the failure happened during metrics collection vs some other blocking task. Consider including operation context (e.g., metrics spawn_blocking join) and/or the box_id in the error message to make debugging easier.
| .map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??; | |
| .map_err(|e| { | |
| BoxliteError::Internal(format!( | |
| "metrics spawn_blocking join (box_id={}): {}", | |
| self.config.id, e | |
| )) | |
| })??; |
| } | ||
| }) | ||
| .await | ||
| .map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??; |
There was a problem hiding this comment.
Same as above: the spawn_blocking join error message here doesn't indicate which operation failed. Adding context (e.g., stop spawn_blocking join) and/or the box_id would make operational debugging significantly easier.
| .map_err(|e| BoxliteError::Internal(format!("spawn_blocking join: {}", e)))??; | |
| .map_err(|e| { | |
| BoxliteError::Internal(format!( | |
| "stop spawn_blocking join (box_id={}): {}", | |
| self.config.id, e | |
| )) | |
| })??; |
| if let Ok(mut h) = handler.lock() { | ||
| h.stop() | ||
| } else { | ||
| Ok(()) |
There was a problem hiding this comment.
When the handler mutex is poisoned, this path currently returns Ok(()) without attempting to stop the handler. That can leave the VM/shim running and leak resources. If the intent is to not fail shutdown on poisoning, consider recovering the guard via PoisonError::into_inner() and still calling stop(), while continuing to swallow the poisoning error itself.
| if let Ok(mut h) = handler.lock() { | |
| h.stop() | |
| } else { | |
| Ok(()) | |
| match handler.lock() { | |
| Ok(mut h) => h.stop(), | |
| Err(poisoned) => { | |
| tracing::warn!("Handler mutex poisoned during shutdown; recovering guard to stop handler"); | |
| let mut h = poisoned.into_inner(); | |
| h.stop() | |
| } |
…_blocking ShimHandler::stop() uses a std::thread::sleep(50ms) polling loop (up to 2s). Called directly from async BoxImpl::stop(), this blocks a Tokio worker thread. Similarly, handler.metrics() performs sysinfo syscalls (~5ms blocking I/O). Changes: - Wrap LiveState.handler in Arc so it can be cloned into spawn_blocking closures - handler.stop() and handler.metrics() now run on Tokio's blocking thread pool - Lock poisoning uses asymmetric handling: stop() swallows (shutdown must not be blocked), metrics() propagates (monitoring should surface anomalies) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8aaef1f to
8d043d1
Compare
Summary
ShimHandler::stop()uses astd::thread::sleep(50ms)polling loop (up to 2 seconds).Called directly from async
BoxImpl::stop(), this blocks a Tokio worker thread.Similarly,
handler.metrics()performs sysinfo syscalls (~5ms), also a blocking operation.LiveState.handlerfromstd::sync::Mutex<...>toArc<std::sync::Mutex<...>>so it can be cloned and moved into
spawn_blockingclosures.handler.stop()andhandler.metrics()now run on Tokio's dedicated blockingthread pool via
tokio::task::spawn_blocking, keeping worker threads free.stop()swallows it (shutdown must not beblocked by a poisoned lock),
metrics()propagates it (monitoring should surfaceanomalies).
Files changed
boxlite/src/litebox/box_impl.rsspawn_blockingforstop()andmetrics()Test plan
cargo test -p boxlite --no-default-features --lib— 593 passed (main branch)cargo clippy -p boxlite --no-default-features --lib -- -D warnings— cleancargo fmt -- --check— clean🤖 Generated with Claude Code