Skip to content

docs: processless by design + how the caller adds hard process ceilings#135

Merged
ivarvong merged 5 commits into
mainfrom
docs-processless-isolation
Jun 29, 2026
Merged

docs: processless by design + how the caller adds hard process ceilings#135
ivarvong merged 5 commits into
mainfrom
docs-processless-isolation

Conversation

@ivarvong

Copy link
Copy Markdown
Owner

Why

A reader asked the right question after the host-safety work: the interpreter's resource ceilings are cooperative (checked between evaluation steps), so a single native op or a blocking NIF can run to completion before the next check. The truly input-independent guarantee — bound memory and time regardless of what the code does — comes from the runtime, not the interpreter.

Pyex deliberately does not provide that itself: run/2 is a synchronous call on the caller's process, and Pyex.BannedCallTracer already enforces on every CI run that Process/Task/spawn never appear in lib/pyex. Keeping it processless is what makes it deterministic, embeddable, and pool-friendly. So the right move is to document how the caller adds the process layer, not to bake one in.

What this adds (docs only)

  • A new "Hard ceilings the caller adds (Pyex is processless by design)" subsection in the README sandbox model, explaining the two-layer resource model:
    • In-process, cooperative (Pyex): step/memory-estimate/output/call-depth/timeout, returning a clean %Pyex.Error{}.
    • Out-of-process, hard (caller): a GC-enforced max_heap_size (with kill: true and include_shared_binaries so large off-heap binaries count too) plus a wall-clock brutal-kill watchdog.
  • A ready-to-use SafeRunner wrapper (spawn_monitor so a guest OOM can't take the caller down, message-passed result, :out_of_memory / :timeout outcomes).
  • A pointer from the Pyex moduledoc so integrators find it from the API docs.

Verified

The documented wrapper was exercised end to end, not just written:

  • happy path → {:ok, 42, _}
  • infinite loop with Pyex's own limits disabled → caller's {:error, :timeout} (watchdog fires)
  • unbounded allocation with Pyex's memory budget disabled, 40 MB heap cap → {:error, :out_of_memory} (GC kill fires)

No library code changes — Pyex stays processless. Independent of #134 (different files; no conflict).

🤖 Generated with Claude Code

https://claude.ai/code/session_019NokzcR7BiAigPgC78zpk9

ivarvong and others added 5 commits June 29, 2026 10:36
Pyex never spawns, links, or sleeps a process of its own — run/2 is a
synchronous call on the caller's process, and a CI static analyzer already
enforces that (Process/Task/spawn are banned in lib/pyex). That keeps the
interpreter deterministic, embeddable, and pool-friendly, and leaves the
process model to the caller.

Documents the two-layer resource model and gives a ready-to-use SafeRunner:
the interpreter's cooperative in-process limits, plus the caller's hard,
BEAM-enforced ceilings (max_heap_size with kill + include_shared_binaries,
and a wall-clock brutal-kill watchdog) that bound memory and time
regardless of what the guest does — including gaps in the cooperative
accounting and uninterruptible NIFs. The wrapper was verified end to end:
happy path returns, an infinite loop with limits disabled hits the
watchdog, and an unbounded allocation hits the heap kill.

Adds the guidance to the README sandbox model and a pointer from the Pyex
moduledoc. Docs only; no library code changes (Pyex stays processless).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019NokzcR7BiAigPgC78zpk9
…ledger

A small Plug/Bandit server demonstrating the processless model end to end:
each HTTP request is its own process, so the request handler is the
isolation boundary — it spawns a monitored worker, caps the worker's heap
(GC-enforced), runs the untrusted Python, and watchdogs the wall clock.

Returns proper status codes (200 ok / 400 python_error / 504 timeout /
507 out_of_memory), and on success the response also carries `trace`: the
host's own rendered span tree of every storage op the program caused —
unforgeable by the guest. Verified end to end over real HTTP (curl): a
storage program comes back with its db.set/db.get/db.delete spans, a
runaway loop with 504, a memory bomb with 507.

Referenced from the README sandbox section. Docs/example only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019NokzcR7BiAigPgC78zpk9
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019NokzcR7BiAigPgC78zpk9
507 (WebDAV "Insufficient Storage") and 504 (proxy "Gateway Timeout") were
the wrong shape for "the guest used too much memory / time" — those are
execution verdicts, not transport conditions, and there is no honest HTTP
code for them.

The HTTP status now describes the API call: running and bounding a job is a
successful request (200) whose verdict — ok / error / timeout /
out_of_memory — is a field in the body. 400 is reserved for a malformed HTTP
request (empty/oversized body) and 5xx for a fault in the sandbox itself.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019NokzcR7BiAigPgC78zpk9
…disciplined)

Reshapes the response around the principle that the HTTP status describes the
sandbox SERVICE, never the guest program — so the guest can't move the
operator's 5xx rate (and thus its circuit breakers, health checks, and pager).

Every job that runs returns 200 with a body envelope:
  { run_id, verdict: ok|error|timeout|out_of_memory|host_fault,
    stdout, value | error{type,kind,message,line},
    usage{steps,compute_ms,duration_ms,memory_bytes,output_bytes},
    trace }                       # the host capability ledger

usage + the capability ledger are folded in from Pyex's telemetry
([:pyex, :run, :stop] and [:pyex, :run, :exception]), captured per-worker, so
they're present even when the program FAILED — "what did it touch before it
crashed?" is exactly when the ledger matters. A host_fault (contained
interpreter bug) stays a 200 verdict but also fires a dedicated high-severity
log, keeping service-health and containment-health on separate channels.

400 is reserved for a malformed HTTP request; 5xx for a fault in the service
itself. Verified over curl: ok/error/timeout/out_of_memory all return 200 with
the verdict in the body, and an erroring program still returns its db.set/db.get
ledger + usage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_019NokzcR7BiAigPgC78zpk9
@ivarvong ivarvong merged commit 5dd4afb into main Jun 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant