feat(predict-rlm): reattachable persistent sandboxes + on-demand interrupt (#41, #42)#44
Merged
Merged
Conversation
interrupt() only sent the ws frame and returned, ignoring its `timeout`. The worker blocked in the execute recv loop was still the sole reader of self._ws, so the next request raced it with a concurrent recv and tripped a websockets ConcurrencyError -- surfaced by Fractal running two turns in one event loop (the existing tests masked it via asyncio.run teardown joining the orphaned worker between turns). interrupt() now waits on the execution gate until the interrupted cell drains (bounded by `timeout`), raising so the cancellation path hard-aborts if it never releases. Adds BackendExecutionGate.wait_until_idle() and a deterministic regression test asserting the gate is idle once interrupt returns.
When the deterministic staging root was nested in a direct workspace mount it relocated to a random mkdtemp path each session, so a reattached container's bind mounts pointed at the prior session's vanished temp dir and the websocket supervisor never started (30s connect timeout). Relocate reusable named sandboxes to a stable per-name temp path instead.
…ure (#41) A connect/handshake failure left the spawned supervisor alive with _websocket_url still set, so the next prewarm/execute on the same backend short-circuited relaunch and reconnected to the dead endpoint (another full startup timeout), while the failed attempt's process and published port lingered. Reset transport state on failure so the next attempt rebuilds. Also adds a real-sbx test that reattaches after an interpreter error and keeps using the sandbox.
A terminal Ctrl-C interrupting an RLM turn delivers SIGINT to the whole
foreground process group. The sbx exec supervisor subprocess was launched
without start_new_session, so the Go child caught the signal, cancelled its
context ("ERROR: context canceled"), and exited. Python only saw an
asyncio.CancelledError during the LLM phase (no execute in flight), so the
in-band #42 teardown never ran and the dead supervisor was handed back as
healthy. The next request then failed with "Sbx supervisor exited
unexpectedly".
Launch all supervisor subprocesses with start_new_session=True so they run in
their own process group, detached from the controlling terminal. Interruption
now flows exclusively through the in-band websocket interrupt frame as #42
intended. shutdown() already kills by explicit pid, so teardown is unaffected.
…egistration register_runtime_hooks pre-creates the in-sandbox kernel via _ensure_kernel without the host_tool_bridge, so the kernel forks with HOST_TOOL_REQUEST_QUEUE unset. Every later execute reuses that live kernel, so the bridge is never wired and _send_protocol writes tool calls to stdout (DEVNULL in websocket mode) — host tool calls vanish and the supervisor hangs until the watchdog. Thread host_tool_bridge through register_runtime_hooks so the kernel is wired to the bridge from creation.
1f525bf to
377bfdf
Compare
glesperance
approved these changes
Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale
SbxBackend(the Docker Sandboxes execution backend) has two lifecycle gaps that hurt consumers running it interactively:No sandbox reuse between sessions (PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse) #41). Every session does a full
sbx create→ package bootstrap → run →sbx rmcycle. The create + bootstrap steps dominate startup latency, and teardown discards a container that the very next session in the same directory immediately recreates. There was no path to discover and reattach to a pre-existing named sandbox.No way to interrupt a running cell, and
aexecutewedges on cancellation (SbxBackend: on-demand execution interrupt + cancellation-safe aexecute #42).aexecuteruns syncexecuteon anasyncio.to_threadworker. Host cancellation only cancels the awaiting coroutine — the worker keeps blocking inws.recv()as the sole reader of the websocket. The next request then raiseswebsockets ConcurrencyError, the cell keeps running in-sandbox until its own timeout (~140s teardown observed), and the event loop can't tear down cleanly. The subprocess supervisor backend already guards this;SbxBackenddid not.Both land in
SbxBackendbut are orthogonal: #41 governs the sandbox lifecycle between sessions; #42 governs interrupting a cell within a live session. They only shareshutdown()adjacency, which is kept isolated.Summary
Reattachable persistent sandboxes (#41)
SbxConfiggainsreuse: bool(requiresname; impliespersist=True+remove_on_shutdown=False) andstop_on_shutdown: bool.namewhenreuse=True(stable mounts across sessions) instead ofuuid4()._start_sbx_and_prepare_supervisor: probesbx ls→ 3-way resolve running (reattach, skip create/bootstrap), stopped (sbx startthen reattach), missing/unhealthy (sbx rm+ clean recreate). Self-healing — never a hard error.destroy()/remove(name)teardown API; reattach lifecycle telemetry (sbx.reattach.*).shutdown()underreuse=Truenever runssbx rmand never deletes the staging root; optionalsbx stopviastop_on_shutdown.reuse=False) behaviour is unchanged.Concurrent reuse safety (#41)
websocket_portnow defaults to0: the in-container supervisor binds an ephemeral port and reports it on stdout, which the host reads and publishes. Previously the fixed8765meant a second supervisor in the same container's network namespace could not bind and exited — surfacing as a websocket connect/exit error on the second backend. Multiple supervisors now coexist in one container.reuse=Truebackends share a deterministic staging root (the container's primary workspace), so running them concurrently in non-direct mode would clobber each other's skill modules / output files andreset()would wipe the sibling's state. A host-side attach lock in that shared root now makes the second non-direct attach to a sandbox a live backend already holds fail fast with a clear, actionable error instead of corrupting silently. Stale locks (dead owner process) self-reclaim; the lock releases onshutdown(). Direct-mount backends bypass the staging root, take no lock, and may coexist (each supervisor on its own ephemeral port).Terminal SIGINT isolation (#42)
start_new_session=True. A terminal Ctrl-C interrupting an RLM turn was delivered to the whole foreground process group, so thesbxGo child caught it, cancelled its context (ERROR: context canceled), and exited — while Python only saw anasyncio.CancelledErrorduring the LLM phase and handed the dead supervisor back as healthy, failing the next turn. Detaching the process group keeps Ctrl-C off the child; interruption flows exclusively through the in-band websocket interrupt frame.On-demand interrupt + cancellation-safe
aexecute(#42)backends/supervisor/_payload.py, shared by sbx): new out-of-band"interrupt"JSON-RPC method that trips the same SIGINT → grace → hard-kill → restore-from-snapshot path the execution timeout already uses; no-op ack when idle.SbxBackend):interrupt()/ainterrupt()send the interrupt frame thread-safely viaws.send(neverws.recv);aexecutenow catchesasyncio.CancelledError, gracefully interrupts so the worker unwinds promptly, and re-raises — keeping the warm sandbox. Documented hard fallback (tear down ws + supervisor) when graceful interrupt fails.Built test-first: each issue has a
test(sbx)commit landed before itsfeat(sbx)commit.Test Plan
make test-sbx— sbx/supervisor unit seam (no Docker): 215 passed, 19 skippedtests/test_supervisor_client.py— 5 passedtest_persist_reattach_destroy_lifecycle(PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse) #41 persist → reconnect → destroy) — skipped withoutsbx loginmake test-integration-sbxon a host with thesbxCLI +sbx login(exercises PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse) #41'ssbx ls/start/stopCLI assumptions end-to-end)TestSbxBackendAttachLock); ephemeral-port report + two supervisors coexist (TestSbxSupervisorEphemeralPort)test_supervisor_runs_in_its_own_process_grouptime.sleep(120)cell → nextexecutesucceeds within ~timeout, warm variable survives, noConcurrencyErrorCloses #41
Closes #42