fix(server): isolate CUDA tensor registry on a dedicated thread#301
Merged
Conversation
The gRPC (tonic) and HTTP (axum) servers share one tokio runtime. Several async handlers took `registry.lock()` and called CudaTensorRegistry methods inline on async worker threads. Those methods take the Python GIL and run `torch.cuda.empty_cache()`, a CUDA device sync that never returns on a wedged GPU. A single stuck call pinned a worker; a few pinned every worker, so the whole runtime went dark -- including pure-Rust `/health` and `/metrics` (executor starvation, not GIL contention). A few `POST /instances/cleanup` on an idle server were enough to reproduce it. Move the registry behind a dedicated `cuda-registry` OS thread (actor) that exclusively owns it and drains commands over an mpsc channel with oneshot replies. RegistryHandle exposes async-only register_layers / drop_instance / clear; every call site only `.await`s a reply, so GIL/CUDA work can no longer run on an async worker by construction. A wedged GPU now pins one thread. All five inline sites are rewired: HTTP cleanup_handler / cleanup_memory_cache_handler and gRPC register_context_batch / unregister_context / shutdown / session-close watcher. The per-handler spawn_blocking boilerplate and Arc<Mutex<CudaTensorRegistry>> are gone. Also: - shutdown no longer releases tensors before signaling: empty_cache() on a wedged GPU would stall the very shutdown path meant to escape it. - the "skip GIL when no live tensors" decision is unified in one release_contexts() instead of duplicated `== 0` checks. - register errors are stringified on the registry thread, so service.rs no longer touches pyo3 at all. Regression test (tests/http_cleanup_hang_repro.rs) wedges the registry actor via the real mechanism (held GIL inside materialize_tensor) and asserts /health, /metrics, and a gRPC health RPC all stay responsive on a 2-worker runtime; verified it hangs >60s against the pre-fix inline path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
b82a3cb to
af550c0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The gRPC (tonic) and HTTP (axum) servers share one tokio runtime. Several async handlers took
registry.lock()and calledCudaTensorRegistrymethods inline on async worker threads. Those methods take the Python GIL and runtorch.cuda.empty_cache()— a CUDA device sync that never returns on a wedged GPU.A single stuck call pinned a tokio worker; a few of them pinned every worker, so the whole runtime went dark — including pure-Rust
/healthand/metricsthat touch neither the GIL nor the registry. They hung not on the GIL, but because no worker was left to schedule them (executor starvation). A fewPOST /instances/cleanupon an idle server were enough to trigger it.Fix
Move the registry behind a dedicated
cuda-registryOS thread (actor) that exclusively owns it and drains commands over anmpscchannel withoneshotreplies.RegistryHandleexposes async-onlyregister_layers/drop_instance/clear; every call site only.awaits a reply, so GIL/CUDA work can no longer run on an async worker by construction. A wedged GPU now pins exactly one thread.All five inline sites are rewired:
cleanup_handler/cleanup_memory_cache_handlerregister_context_batch/unregister_context/shutdown/ session-close watcherThe per-handler
spawn_blockingboilerplate andArc<Mutex<CudaTensorRegistry>>are gone.Smaller hardening (from an adversarial self-review):
shutdownno longer releases tensors before signaling —empty_cache()on a wedged GPU would stall the very shutdown path meant to escape it.release_contexts()instead of duplicated== 0checks.service.rsno longer touches pyo3 at all.Test
pegaflow-server/tests/http_cleanup_hang_repro.rswedges the registry actor via the real mechanism — holding the process GIL from a side thread, then driving the gRPCregister_context_batchhot path so the actor blocks insidematerialize_tensor'sPython::attach. It first asserts the wedge is real (assert_registry_wedged: a register RPC must time out), then asserts HTTP/health+/metricsand a gRPChealthRPC all answer within 2s on a 2-worker runtime.Verified it hangs >60s against the pre-fix inline path, so it genuinely bites (not a paper guard).
Verification
cargo clippy -p pegaflow-server --all-targets -- -D warnings(default cu12 and--no-default-features --features cuda-13): clean.cargo fmt --all -- --check: clean.🤖 Generated with Claude Code