Skip to content

feat(predict-rlm): reattachable persistent sandboxes + on-demand interrupt (#41, #42)#44

Merged
magix022 merged 11 commits into
mainfrom
feat/sbx-reattach-and-interrupt-41-42
Jun 18, 2026
Merged

feat(predict-rlm): reattachable persistent sandboxes + on-demand interrupt (#41, #42)#44
magix022 merged 11 commits into
mainfrom
feat/sbx-reattach-and-interrupt-41-42

Conversation

@magix022

@magix022 magix022 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Rationale

SbxBackend (the Docker Sandboxes execution backend) has two lifecycle gaps that hurt consumers running it interactively:

  1. No sandbox reuse between sessions (PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse) #41). Every session does a full sbx create → package bootstrap → run → sbx rm cycle. The create + bootstrap steps dominate startup latency, and teardown discards a container that the very next session in the same directory immediately recreates. There was no path to discover and reattach to a pre-existing named sandbox.

  2. No way to interrupt a running cell, and aexecute wedges on cancellation (SbxBackend: on-demand execution interrupt + cancellation-safe aexecute #42). aexecute runs sync execute on an asyncio.to_thread worker. Host cancellation only cancels the awaiting coroutine — the worker keeps blocking in ws.recv() as the sole reader of the websocket. The next request then raises websockets ConcurrencyError, the cell keeps running in-sandbox until its own timeout (~140s teardown observed), and the event loop can't tear down cleanly. The subprocess supervisor backend already guards this; SbxBackend did not.

Both land in SbxBackend but are orthogonal: #41 governs the sandbox lifecycle between sessions; #42 governs interrupting a cell within a live session. They only share shutdown() adjacency, which is kept isolated.

Summary

Reattachable persistent sandboxes (#41)

  • SbxConfig gains reuse: bool (requires name; implies persist=True + remove_on_shutdown=False) and stop_on_shutdown: bool.
  • Deterministic staging root derived from name when reuse=True (stable mounts across sessions) instead of uuid4().
  • Reattach detection in _start_sbx_and_prepare_supervisor: probe sbx ls → 3-way resolve running (reattach, skip create/bootstrap), stopped (sbx start then reattach), missing/unhealthy (sbx rm + clean recreate). Self-healing — never a hard error.
  • New destroy() / remove(name) teardown API; reattach lifecycle telemetry (sbx.reattach.*).
  • shutdown() under reuse=True never runs sbx rm and never deletes the staging root; optional sbx stop via stop_on_shutdown.
  • Default (reuse=False) behaviour is unchanged.

Concurrent reuse safety (#41)

  • Ephemeral supervisor port. websocket_port now defaults to 0: the in-container supervisor binds an ephemeral port and reports it on stdout, which the host reads and publishes. Previously the fixed 8765 meant a second supervisor in the same container's network namespace could not bind and exited — surfacing as a websocket connect/exit error on the second backend. Multiple supervisors now coexist in one container.
  • Guarded case — non-direct concurrent reattach. Two same-name reuse=True backends share a deterministic staging root (the container's primary workspace), so running them concurrently in non-direct mode would clobber each other's skill modules / output files and reset() would wipe the sibling's state. A host-side attach lock in that shared root now makes the second non-direct attach to a sandbox a live backend already holds fail fast with a clear, actionable error instead of corrupting silently. Stale locks (dead owner process) self-reclaim; the lock releases on shutdown(). Direct-mount backends bypass the staging root, take no lock, and may coexist (each supervisor on its own ephemeral port).

Terminal SIGINT isolation (#42)

  • Supervisor subprocesses are launched with start_new_session=True. A terminal Ctrl-C interrupting an RLM turn was delivered to the whole foreground process group, so the sbx Go child caught it, cancelled its context (ERROR: context canceled), and exited — while Python only saw an asyncio.CancelledError during the LLM phase and handed the dead supervisor back as healthy, failing the next turn. Detaching the process group keeps Ctrl-C off the child; interruption flows exclusively through the in-band websocket interrupt frame.

On-demand interrupt + cancellation-safe aexecute (#42)

  • Server (backends/supervisor/_payload.py, shared by sbx): new out-of-band "interrupt" JSON-RPC method that trips the same SIGINT → grace → hard-kill → restore-from-snapshot path the execution timeout already uses; no-op ack when idle.
  • Client (SbxBackend): interrupt() / ainterrupt() send the interrupt frame thread-safely via ws.send (never ws.recv); aexecute now catches asyncio.CancelledError, gracefully interrupts so the worker unwinds promptly, and re-raises — keeping the warm sandbox. Documented hard fallback (tear down ws + supervisor) when graceful interrupt fails.

Built test-first: each issue has a test(sbx) commit landed before its feat(sbx) commit.

Test Plan

  • make test-sbx — sbx/supervisor unit seam (no Docker): 215 passed, 19 skipped
  • tests/test_supervisor_client.py5 passed
  • Real-Docker integration tests collect cleanly, including the new test_persist_reattach_destroy_lifecycle (PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse) #41 persist → reconnect → destroy) — skipped without sbx login
  • make test-integration-sbx on a host with the sbx CLI + sbx login (exercises PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse) #41's sbx ls / start / stop CLI assumptions end-to-end)
  • Concurrent reuse — unit: attach-lock reject / release / idempotent / stale-reclaim (TestSbxBackendAttachLock); ephemeral-port report + two supervisors coexist (TestSbxSupervisorEphemeralPort)
  • Concurrent reuse — integration (real sbx): second non-direct reattach is rejected and the first is unharmed; two direct-mount backends coexist on distinct ephemeral ports
  • Terminal SIGINT isolation — test_supervisor_runs_in_its_own_process_group
  • Manual SbxBackend: on-demand execution interrupt + cancellation-safe aexecute #42 check: interrupt a time.sleep(120) cell → next execute succeeds within ~timeout, warm variable survives, no ConcurrencyError

Closes #41
Closes #42

magix022 added 5 commits June 16, 2026 17:18
interrupt() only sent the ws frame and returned, ignoring its `timeout`.
The worker blocked in the execute recv loop was still the sole reader of
self._ws, so the next request raced it with a concurrent recv and tripped
a websockets ConcurrencyError -- surfaced by Fractal running two turns in
one event loop (the existing tests masked it via asyncio.run teardown
joining the orphaned worker between turns).

interrupt() now waits on the execution gate until the interrupted cell
drains (bounded by `timeout`), raising so the cancellation path hard-aborts
if it never releases. Adds BackendExecutionGate.wait_until_idle() and a
deterministic regression test asserting the gate is idle once interrupt
returns.
@magix022 magix022 marked this pull request as draft June 17, 2026 00:42
magix022 added 4 commits June 17, 2026 10:30
When the deterministic staging root was nested in a direct workspace
mount it relocated to a random mkdtemp path each session, so a reattached
container's bind mounts pointed at the prior session's vanished temp dir
and the websocket supervisor never started (30s connect timeout).
Relocate reusable named sandboxes to a stable per-name temp path instead.
…ure (#41)

A connect/handshake failure left the spawned supervisor alive with
_websocket_url still set, so the next prewarm/execute on the same backend
short-circuited relaunch and reconnected to the dead endpoint (another full
startup timeout), while the failed attempt's process and published port
lingered. Reset transport state on failure so the next attempt rebuilds.

Also adds a real-sbx test that reattaches after an interpreter error and
keeps using the sandbox.
A terminal Ctrl-C interrupting an RLM turn delivers SIGINT to the whole
foreground process group. The sbx exec supervisor subprocess was launched
without start_new_session, so the Go child caught the signal, cancelled its
context ("ERROR: context canceled"), and exited. Python only saw an
asyncio.CancelledError during the LLM phase (no execute in flight), so the
in-band #42 teardown never ran and the dead supervisor was handed back as
healthy. The next request then failed with "Sbx supervisor exited
unexpectedly".

Launch all supervisor subprocesses with start_new_session=True so they run in
their own process group, detached from the controlling terminal. Interruption
now flows exclusively through the in-band websocket interrupt frame as #42
intended. shutdown() already kills by explicit pid, so teardown is unaffected.
…egistration

register_runtime_hooks pre-creates the in-sandbox kernel via _ensure_kernel
without the host_tool_bridge, so the kernel forks with HOST_TOOL_REQUEST_QUEUE
unset. Every later execute reuses that live kernel, so the bridge is never
wired and _send_protocol writes tool calls to stdout (DEVNULL in websocket
mode) — host tool calls vanish and the supervisor hangs until the watchdog.

Thread host_tool_bridge through register_runtime_hooks so the kernel is wired
to the bridge from creation.
@magix022 magix022 marked this pull request as ready for review June 17, 2026 16:40
@magix022 magix022 force-pushed the feat/sbx-reattach-and-interrupt-41-42 branch 2 times, most recently from 1f525bf to 377bfdf Compare June 17, 2026 20:52
@magix022 magix022 merged commit cbd04cf into main Jun 18, 2026
7 of 8 checks passed
@magix022 magix022 deleted the feat/sbx-reattach-and-interrupt-41-42 branch June 18, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SbxBackend: on-demand execution interrupt + cancellation-safe aexecute PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse)

2 participants