feat(predict-rlm): reattachable persistent sandboxes + on-demand interrupt (#41, #42) by magix022 · Pull Request #44 · Trampoline-AI/predict-rlm

magix022 · 2026-06-17T00:31:29Z

Rationale

SbxBackend (the Docker Sandboxes execution backend) has two lifecycle gaps that hurt consumers running it interactively:

No sandbox reuse between sessions (PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse) #41). Every session does a full sbx create → package bootstrap → run → sbx rm cycle. The create + bootstrap steps dominate startup latency, and teardown discards a container that the very next session in the same directory immediately recreates. There was no path to discover and reattach to a pre-existing named sandbox.
No way to interrupt a running cell, and aexecute wedges on cancellation (SbxBackend: on-demand execution interrupt + cancellation-safe aexecute #42). aexecute runs sync execute on an asyncio.to_thread worker. Host cancellation only cancels the awaiting coroutine — the worker keeps blocking in ws.recv() as the sole reader of the websocket. The next request then raises websockets ConcurrencyError, the cell keeps running in-sandbox until its own timeout (~140s teardown observed), and the event loop can't tear down cleanly. The subprocess supervisor backend already guards this; SbxBackend did not.

Both land in SbxBackend but are orthogonal: #41 governs the sandbox lifecycle between sessions; #42 governs interrupting a cell within a live session. They only share shutdown() adjacency, which is kept isolated.

Summary

Reattachable persistent sandboxes (#41)

SbxConfig gains reuse: bool (requires name; implies persist=True + remove_on_shutdown=False) and stop_on_shutdown: bool.
Deterministic staging root derived from name when reuse=True (stable mounts across sessions) instead of uuid4().
Reattach detection in _start_sbx_and_prepare_supervisor: probe sbx ls → 3-way resolve running (reattach, skip create/bootstrap), stopped (sbx start then reattach), missing/unhealthy (sbx rm + clean recreate). Self-healing — never a hard error.
New destroy() / remove(name) teardown API; reattach lifecycle telemetry (sbx.reattach.*).
shutdown() under reuse=True never runs sbx rm and never deletes the staging root; optional sbx stop via stop_on_shutdown.
Default (reuse=False) behaviour is unchanged.

Concurrent reuse safety (#41)

Ephemeral supervisor port. websocket_port now defaults to 0: the in-container supervisor binds an ephemeral port and reports it on stdout, which the host reads and publishes. Previously the fixed 8765 meant a second supervisor in the same container's network namespace could not bind and exited — surfacing as a websocket connect/exit error on the second backend. Multiple supervisors now coexist in one container.
Guarded case — non-direct concurrent reattach. Two same-name reuse=True backends share a deterministic staging root (the container's primary workspace), so running them concurrently in non-direct mode would clobber each other's skill modules / output files and reset() would wipe the sibling's state. A host-side attach lock in that shared root now makes the second non-direct attach to a sandbox a live backend already holds fail fast with a clear, actionable error instead of corrupting silently. Stale locks (dead owner process) self-reclaim; the lock releases on shutdown(). Direct-mount backends bypass the staging root, take no lock, and may coexist (each supervisor on its own ephemeral port).

Terminal SIGINT isolation (#42)

Supervisor subprocesses are launched with start_new_session=True. A terminal Ctrl-C interrupting an RLM turn was delivered to the whole foreground process group, so the sbx Go child caught it, cancelled its context (ERROR: context canceled), and exited — while Python only saw an asyncio.CancelledError during the LLM phase and handed the dead supervisor back as healthy, failing the next turn. Detaching the process group keeps Ctrl-C off the child; interruption flows exclusively through the in-band websocket interrupt frame.

On-demand interrupt + cancellation-safe aexecute (#42)

Server (backends/supervisor/_payload.py, shared by sbx): new out-of-band "interrupt" JSON-RPC method that trips the same SIGINT → grace → hard-kill → restore-from-snapshot path the execution timeout already uses; no-op ack when idle.
Client (SbxBackend): interrupt() / ainterrupt() send the interrupt frame thread-safely via ws.send (never ws.recv); aexecute now catches asyncio.CancelledError, gracefully interrupts so the worker unwinds promptly, and re-raises — keeping the warm sandbox. Documented hard fallback (tear down ws + supervisor) when graceful interrupt fails.

Built test-first: each issue has a test(sbx) commit landed before its feat(sbx) commit.

Test Plan

make test-sbx — sbx/supervisor unit seam (no Docker): 215 passed, 19 skipped
tests/test_supervisor_client.py — 5 passed
Real-Docker integration tests collect cleanly, including the new test_persist_reattach_destroy_lifecycle (PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse) #41 persist → reconnect → destroy) — skipped without sbx login
make test-integration-sbx on a host with the sbx CLI + sbx login (exercises PRD: Reattachable persistent sandboxes in SbxBackend (hot per-directory reuse) #41's sbx ls / start / stop CLI assumptions end-to-end)
Concurrent reuse — unit: attach-lock reject / release / idempotent / stale-reclaim (TestSbxBackendAttachLock); ephemeral-port report + two supervisors coexist (TestSbxSupervisorEphemeralPort)
Concurrent reuse — integration (real sbx): second non-direct reattach is rejected and the first is unharmed; two direct-mount backends coexist on distinct ephemeral ports
Terminal SIGINT isolation — test_supervisor_runs_in_its_own_process_group
Manual SbxBackend: on-demand execution interrupt + cancellation-safe aexecute #42 check: interrupt a time.sleep(120) cell → next execute succeeds within ~timeout, warm variable survives, no ConcurrencyError

Closes #41
Closes #42

interrupt() only sent the ws frame and returned, ignoring its `timeout`. The worker blocked in the execute recv loop was still the sole reader of self._ws, so the next request raced it with a concurrent recv and tripped a websockets ConcurrencyError -- surfaced by Fractal running two turns in one event loop (the existing tests masked it via asyncio.run teardown joining the orphaned worker between turns). interrupt() now waits on the execution gate until the interrupted cell drains (bounded by `timeout`), raising so the cancellation path hard-aborts if it never releases. Adds BackendExecutionGate.wait_until_idle() and a deterministic regression test asserting the gate is idle once interrupt returns.

When the deterministic staging root was nested in a direct workspace mount it relocated to a random mkdtemp path each session, so a reattached container's bind mounts pointed at the prior session's vanished temp dir and the websocket supervisor never started (30s connect timeout). Relocate reusable named sandboxes to a stable per-name temp path instead.

…ure (#41) A connect/handshake failure left the spawned supervisor alive with _websocket_url still set, so the next prewarm/execute on the same backend short-circuited relaunch and reconnected to the dead endpoint (another full startup timeout), while the failed attempt's process and published port lingered. Reset transport state on failure so the next attempt rebuilds. Also adds a real-sbx test that reattaches after an interpreter error and keeps using the sandbox.

A terminal Ctrl-C interrupting an RLM turn delivers SIGINT to the whole foreground process group. The sbx exec supervisor subprocess was launched without start_new_session, so the Go child caught the signal, cancelled its context ("ERROR: context canceled"), and exited. Python only saw an asyncio.CancelledError during the LLM phase (no execute in flight), so the in-band #42 teardown never ran and the dead supervisor was handed back as healthy. The next request then failed with "Sbx supervisor exited unexpectedly". Launch all supervisor subprocesses with start_new_session=True so they run in their own process group, detached from the controlling terminal. Interruption now flows exclusively through the in-band websocket interrupt frame as #42 intended. shutdown() already kills by explicit pid, so teardown is unaffected.

…egistration register_runtime_hooks pre-creates the in-sandbox kernel via _ensure_kernel without the host_tool_bridge, so the kernel forks with HOST_TOOL_REQUEST_QUEUE unset. Every later execute reuses that live kernel, so the bridge is never wired and _send_protocol writes tool calls to stdout (DEVNULL in websocket mode) — host tool calls vanish and the supervisor hangs until the watchdog. Thread host_tool_bridge through register_runtime_hooks so the kernel is wired to the bridge from creation.

magix022 added 5 commits June 16, 2026 17:18

test(sbx): cover reattach/persist lifecycle (#41)

0270101

feat(sbx): reattachable persistent sandboxes (#41)

d75346a

test(sbx): cover execution interrupt + cancellation-safe aexecute (#42)

da84a1c

feat(sbx): on-demand interrupt + cancellation-safe aexecute (#42)

df84ad0

magix022 marked this pull request as draft June 17, 2026 00:42

magix022 added 4 commits June 17, 2026 10:30

magix022 marked this pull request as ready for review June 17, 2026 16:40

magix022 force-pushed the feat/sbx-reattach-and-interrupt-41-42 branch 2 times, most recently from 1f525bf to 377bfdf Compare June 17, 2026 20:52

magix022 added 2 commits June 17, 2026 17:10

fix(predict-rlm): isolate reusable sbx websocket supervisors

1611d41

chore: deslop branch changes

511c28c

glesperance approved these changes Jun 18, 2026

View reviewed changes

magix022 merged commit cbd04cf into main Jun 18, 2026
7 of 8 checks passed

magix022 deleted the feat/sbx-reattach-and-interrupt-41-42 branch June 18, 2026 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(predict-rlm): reattachable persistent sandboxes + on-demand interrupt (#41, #42)#44

feat(predict-rlm): reattachable persistent sandboxes + on-demand interrupt (#41, #42)#44
magix022 merged 11 commits into
mainfrom
feat/sbx-reattach-and-interrupt-41-42

magix022 commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

magix022 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

Summary

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

magix022 commented Jun 17, 2026 •

edited

Loading