Working toward a "subinterpreter-forkserver" spawning backend#447
Draft
goodboy wants to merge 78 commits intosubint_fork_backendfrom
Draft
Working toward a "subinterpreter-forkserver" spawning backend#447goodboy wants to merge 78 commits intosubint_fork_backendfrom
goodboy wants to merge 78 commits intosubint_fork_backendfrom
Conversation
418a7ca to
4425023
Compare
Standalone script to validate the "main-interp worker-thread
forkserver + subint-hosted trio" arch proposed as a workaround
to the CPython-level refusal doc'd in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
Deliberately NOT a `tractor` test — zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
pass/fail is a property of CPython alone, independent of our
runtime. Requires py3.14+.
Deats,
- four scenarios via `--scenario`:
- `control_subint_thread_fork` — the KNOWN-BROKEN case as a
harness sanity; if the child DOESN'T abort, our analysis
is wrong
- `main_thread_fork` — baseline sanity, must always succeed
- `worker_thread_fork` — architectural assertion: regular
`threading.Thread` attached to main interp calls
`os.fork()`; child should survive post-fork cleanup
- `full_architecture` — end-to-end: fork from a main-interp
worker thread, then in child create a subint driving a
worker thread running `trio.run()`
- exit code 0 on EXPECTED outcome (for `control_*` that means
"child aborted", not "child succeeded")
- each scenario prints a self-contained pass/fail banner; use
`os.waitpid()` of the parent + per-scenario status prints to
observe the child's fate
Also, log NLNet provenance for this session's three-sub-phase
work (py3.13 gate tightening, `pytest-timeout` + marker
refactor, `subint_fork` prototype → CPython-block finding).
Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The smoketest (prior commit) empirically validated the
"fork-from-main-interp-worker-thread" arch on py3.14. Promote
the validated primitives out of the `ai/conc-anal/` smoketest
into `tractor.spawn._subint_forkserver` so they can eventually
be wired into a real "subint forkserver" spawn backend.
Deats,
- new module `tractor/spawn/_subint_forkserver.py` (337 LOC):
- `fork_from_worker_thread(child_target, thread_name)` —
spawn a main-interp `threading.Thread`, call `os.fork()`
from it, shuttle the child pid back to main via a pipe
- `run_trio_in_subint(bootstrap, ...)` — post-fork helper:
create a fresh subint + drive `_interpreters.exec()` on
a dedicated worker thread running the `bootstrap` str
(typically imports `trio`, defines an async entry, calls
`trio.run()`)
- `wait_child(pid, expect_exit_ok)` — `os.waitpid()` +
pass/fail classification reusable from harness AND the
eventual real spawn path
- feature-gated py3.14+ via the public
`concurrent.interpreters` presence check; matches the gate
in `tractor.spawn._subint`
- module docstring doc's the CPython-block context
(cross-refs `_subint_fork` stub + the two `conc-anal/`
docs) and status: EXPERIMENTAL, not yet registered in
`_spawn._methods`
Also, refactor the smoketest
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to
import the primitives from the new module rather than inline
its own copies. Keeps the smoketest and the tractor-side
impl in sync as the forkserver design evolves; the smoketest
remains a zero-`tractor`-runtime CPython-level check
(imports ONLY the three primitives, no runtime bring-up).
Status: next step is to drive these from a parent-side
`trio.run()` and hook the returned child pid into the normal
actor-nursery/IPC flow — then register `subint_forkserver`
as a `SpawnMethodKey` in `_spawn.py`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.
Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
for the GIL-hostage safety reason doc'd in
`ai/conc-anal/subint_sigint_starvation_issue.md`:
- `test_fork_from_worker_thread_via_trio` — parent-side
plumbing baseline. `trio.run()` off-loads forkserver
prims via `trio.to_thread.run_sync()` + asserts the
child reaps cleanly
- `test_fork_and_run_trio_in_child` — end-to-end: forked
child calls `run_subint_in_worker_thread()` with a
bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
`dump_on_hang()` for post-mortem if the outer
`pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
drive the primitives directly rather than going through
tractor's spawn-method registry (which the forkserver
isn't plugged into yet)
Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.
Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.
Deats,
- three reasons enumerated for the current constraints:
- class-A GIL-starvation — **fixed** by isolated mode:
subints don't share main's GIL so abandoned-thread
contention disappears
- destroy race / tstate-recycling from `subint_proc` —
**unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
cross-mode, so isolated doesn't obviously fix it;
needs empirical retest on py3.14 + isolated API
- fork-from-main-interp-tstate (the CPython-level
`_PyInterpreterState_DeleteExceptMain` gate) — the
narrow reason for using a dedicated thread; **probably
fixed** IF the destroy-race also resolves (bc trio's
cache threads never drove subints → clean main-interp
tstate)
- TL;DR table of which constraints unwind under each
resolution branch
- four-step audit plan for when `msgspec#563` lands:
- flip `_subint` to isolated mode
- empirical destroy-race retest
- audit `_subint_forkserver.py` — drop `non_trio`
qualifier / maybe inline primitives
- doc fallout — close the three `subint_*_issue.md`
siblings w/ post-mortem notes
Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Promote `_subint_forkserver` from primitives-only into a
registered spawn backend: `'subint_forkserver'` is now a
`SpawnMethodKey` literal, dispatched via `_methods` to
the new `subint_forkserver_proc()` target, feature-gated
under the existing `subint`-family py3.14+ case, and
selectable via `--spawn-backend=subint_forkserver`.
Deats,
- new `subint_forkserver_proc()` spawn target in
`_subint_forkserver`:
- mirrors `trio_proc()`'s supervision model — real OS
subprocess so `Portal.cancel_actor()` + `soft_kill()`
on graceful teardown, `os.kill(SIGKILL)` on hard-reap
(no `_interpreters.destroy()` race to fuss over bc the
child lives in its own process)
- only real diff from `trio_proc` is the spawn mechanism:
fork from a main-interp worker thread via
`fork_from_worker_thread()` (off-loaded to trio's
thread pool) instead of `trio.lowlevel.open_process()`
- child-side `_child_target` closure runs
`tractor._child._actor_child_main()` with
`spawn_method='trio'` — the child is a regular trio
actor, "subint_forkserver" names how the parent
spawned, not what the child runs
- new `_ForkedProc` class — thin `trio.Process`-compatible
shim around a raw OS pid: `.poll()` via
`waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio
cache thread, `.kill()` via `SIGKILL`, `.returncode`
cached for repeat calls. `.stdin`/`.stdout`/`.stderr`
are `None` (fork-w/o-exec inherits parent FDs; we don't
marshal them) which matches `soft_kill()`'s `is not None`
guards
Also, new backend-tier test
`test_subint_forkserver_spawn_basic` drives the registered
backend end-to-end via `open_root_actor` + `open_nursery` +
`run_in_actor` w/ a trivial portal-RPC round-trip. Uses a
`forkserver_spawn_method` fixture to flip
`_spawn_method`/`_ctx` for the test's duration + restore on
teardown (so other session-level tests don't observe the
global flip). Test module docstring reworked to describe
the three tiers now covered: (1) primitive-level, (2)
parent-trio-driven primitives, (3) full registered backend.
Status: still-open work (tracked on `tractor#379`) doc'd
inline in the module docstring — no cancel/hard-kill stress
coverage yet, child-side subint-hosted root runtime still
future (gated on `msgspec#563`), thread-hygiene audit
pending the same unblock.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`os.fork()` inherits the parent's entire memory image, including `tractor.runtime._state` globals that encode "this process is the root actor" — `_runtime_vars`'s `_is_root=True`, pre-populated `_root_mailbox` + `_registry_addrs`, and the parent's `_current_actor` singleton. A fresh `exec`-based child starts with those globals at their module-level defaults (all falsey/empty). The forkserver child needs to match that shape BEFORE calling `_actor_child_main()`, otherwise `Actor.__init__()` takes the `is_root_process() == True` branch and pre-populates `self.enable_modules`, which then trips `assert not self.enable_modules` at the top of `Actor._from_parent()` on the subsequent parent→child `SpawnSpec` handshake. Fix: at the start of `_child_target`, null `_state._current_actor` and overwrite `_runtime_vars` with a cold-root blank (`_is_root=False`, empty mailbox/addrs, `_debug_mode=False`) before `_actor_child_main()` runs. Found-via: `test_subint_forkserver_spawn_basic` hitting the `enable_modules` assert on child-side runtime boot. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Post-fork `_runtime_vars` reset in `subint_forkserver_proc`
was previously done via direct mutation of
`_state._runtime_vars` from an external module + an inline
default dict duplicating the `_state.py`-internal defaults.
Split the access surface into a pure getter + explicit
setter so the reset call site becomes a one-liner
composition.
Deats `tractor/runtime/_state.py`,
- extract initial values into a module-level
`_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the
live `_runtime_vars` is now initialised from
`dict(_RUNTIME_VARS_DEFAULTS)`
- `get_runtime_vars()` grows a `clear_values: bool = False`
kwarg. When True, returns a fresh copy of
`_RUNTIME_VARS_DEFAULTS` instead of the live dict —
still a **pure read**, never mutates anything
- new `set_runtime_vars(rtvars: dict | RuntimeVars)` —
atomic replacement of the live dict's contents via
`.clear()` + `.update()`, so existing references to the
same dict object remain valid. Accepts either the
historical dict form or the `RuntimeVars` struct
Deats `tractor/spawn/_subint_forkserver.py`,
- collapse the prior ad-hoc `.update({...})` block into
`set_runtime_vars(get_runtime_vars(clear_values=True))`
- drop the `_state._current_actor = None` line —
`_trio_main` unconditionally overwrites it downstream,
so no explicit reset needed (noted in the XXX comment)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT`
documents an empirical SIGINT-delivery gap in the
`subint_forkserver` backend: when the parent dies via
`SIGKILL` (no IPC `Portal.cancel_actor()` possible) and
`SIGINT` is sent to the orphan child, the child DOES NOT
unwind — CPython's default `KeyboardInterrupt` is delivered
to `threading.main_thread()`, whose tstate is dead in the
post-fork child bc fork inherited the worker thread, not
main. Trio running on the fork-inherited worker thread
therefore never observes the signal. Marked
`xfail(strict=True)` so the mark flips to XPASS→fail once
the backend grows explicit SIGINT plumbing.
Deats,
- harness runs the failure-mode sequence out-of-process:
1. harness subprocess runs a fresh Python script
that calls `try_set_start_method('subint_forkserver')`
then opens a root actor + one `sleep_forever` subactor
2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers
off harness `stdout` to confirm IPC handshake
completed
3. `SIGKILL` the parent, `proc.wait()` to reap the
zombie (otherwise `os.kill(pid, 0)` keeps reporting
it alive)
4. assert the child survived the parent-reap (i.e. was
actually orphaned, not reaped too) before moving on
5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)`
every 100ms for up to 10s
- supporting helpers: `_read_marker()` with per-proc
bytes-buffer to carry partial lines across calls,
`_process_alive()` liveness probe via `kill(pid, 0)`
- Linux-only via `platform.system() != 'Linux'` skip —
orphan-reparenting semantics don't generalize to
other platforms
- port offset (`reg_addr[1] + 17`) so the harness listener
doesn't race concurrently-running backend tests
- best-effort `finally:` cleanup: `SIGKILL` any still-alive
pids + `proc.kill()` + bounded `proc.wait()` to avoid
leaking orphans across the session
Also, tier-4 header comment documents the cross-backend
generalization path: applicable to any multi-process
backend (`trio`, `mp_spawn`, `mp_forkserver`,
`subint_forkserver`), NOT to plain `subint` (in-process
subints have no orphan OS-child). Move path: lift
harness into `tests/_orphan_harness.py`, parametrize on
session `_spawn_method`, add
`skipif _spawn_method == 'subint'`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add configuration surface for future child-side SIGINT
plumbing in `subint_forkserver_proc` without wiring up the
actual trio-native SIGINT bridge — lifting one entry-guard
clause will flip the `'trio'` branch live once the
underlying fork-prelude plumbing is implemented.
Deats,
- new `ChildSigintMode = Literal['ipc', 'trio']` type +
`_DEFAULT_CHILD_SIGINT = 'ipc'` module-level default.
Docstring block enumerates both:
- `'ipc'` (default, currently the only implemented mode):
no child-side SIGINT handler — `trio.run()` is on the
fork-inherited non-main thread where
`signal.set_wakeup_fd()` is main-thread-only, so
cancellation flows exclusively via the parent's
`Portal.cancel_actor()` IPC path. Known gap: orphan
children don't respond to SIGINT
(`test_orphaned_subactor_sigint_cleanup_DRAFT`)
- `'trio'` (scaffolded only): manual SIGINT → trio-cancel
bridge in the fork-child prelude so external Ctrl-C
reaches stuck grandchildren even w/ a dead parent
- `subint_forkserver_proc` pulls `child_sigint` out of
`proc_kwargs` (matches how `trio_proc` threads config to
`open_process`, keeps `start_actor(proc_kwargs=...)` as
the ergonomic entry point); validates membership + raises
`NotImplementedError` for `'trio'` at the backend-entry
guard
- `_child_target` grows a `match child_sigint:` arm that
slots in the future `'trio'` impl without restructuring
— today only the `'ipc'` case is reachable
- module docstring "Still-open work" list grows a bullet
pointing at this config + the xfail'd orphan-SIGINT test
No behavioral change on the default path — `'ipc'` is the
existing flow. Scaffolding only.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Empirical follow-up to the xfail'd orphan-SIGINT test: the hang is **not** "trio can't install a handler on a non-main thread" (the original hypothesis from the `child_sigint` scaffold commit). On py3.14: - `threading.current_thread() is threading.main_thread()` IS True post-fork — CPython re-designates the fork-inheriting thread as "main" correctly - trio's `KIManager` SIGINT handler IS installed in the subactor (`signal.getsignal(SIGINT)` confirms) - the kernel DOES deliver SIGINT to the thread But `faulthandler` dumps show the subactor wedged in `trio/_core/_io_epoll.py::get_events` — trio's wakeup-fd mechanism (which turns SIGINT into an epoll-wake) isn't firing. So the `except KeyboardInterrupt` at `tractor/spawn/_entry.py::_trio_main:164` — the runtime's intentional "KBI-as-OS-cancel" path — never fires. Deats, - new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` (+385 LOC): full writeup — TL;DR, symptom reproducer, the "intentional cancel path" the bug defeats, diagnostic evidence (`faulthandler` output + `getsignal` probe), ruled-out hypotheses (non-main-thread issue, wakeup-fd inheritance, KBI-as-trio-check-exception), and fix directions - `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail `reason` + test docstring rewritten to match the refined understanding — old wording blamed the non-main-thread path, new wording points at the `epoll_wait` wedge + cross-refs the new conc-anal doc - `_subint_forkserver` module docstring's `child_sigint='trio'` bullet updated: now notes trio's handler is already correctly installed, so the flag may end up a no-op / doc-only mode once the real root cause is fixed Closing the gap aligns with existing design intent (make the already-designed "KBI-as-OS-cancel" behavior actually fire), not a new feature. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.
Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
listing the spawn backends whose subactor runtime is trio-native
(`'trio'`, `'subint_forkserver'`). Both the enable-site
(`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
key.
off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
reports the full compatible-set + the rejected method instead of the
stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
`'subint_forkserver'` to the existing `('trio', 'subint')` tuple
— fork child-side runtime receives the same SpawnSpec IPC handshake as
the others.
- `subint_forkserver_proc` child-target now passes
`spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
`Actor.pformat()` / log lines reflect the actual parent-side spawn
mechanism rather than masquerading as plain `trio`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up to 72d1b90 (was prev commit adding `debug_mode` for `subint_forkserver`): that commit wired the runtime-side `subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the `subint_forkserver_proc` child-target was still passing `spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines would report the subactor as plain `'trio'` instead of the actual parent-side spawn mechanism. Flip the label to `'subint_forkserver'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Also, some slight touchups in `.spawn._subint`.
Fork-based backends (esp. `subint_forkserver`) can leak child actor processes on cancelled / SIGINT'd test runs; the zombies keep the tractor default registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`) bound, so every subsequent session can't bind and 50+ unrelated tests fail with the same `TooSlowError` / "address in use" signature. Document the pre-flight + post-cancel check as a mandatory step 4. Deats, - **primary signal**: `ss -tlnp | grep ':1616'` for a bound TCP registry listener — the authoritative check since :1616 is unique to our runtime - `pgrep -af` scoped to `$(pwd)/py[0-9]*/bin/python.* _actor_child_main|subint-forkserv` for leftover actor/forkserver procs — scoped deliberately so we don't false-flag legit long-running tractor- embedding apps like `piker` - `ls /tmp/registry@*.sock` for stale UDS sockets - scoped cleanup recipe (SIGTERM + SIGKILL sweep using the same `$(pwd)/py*` pattern, UDS `rm -f`, re-verify) plus a fallback for when a zombie holds :1616 but doesn't match the pattern: `ss -tlnp` → kill by PID - explicit false-positive warning calling out the `piker` case (`~/repos/piker/py*/bin/python3 -m tractor._child ...`) so a bare `pgrep` doesn't lead to nuking unrelated apps Goal: short-circuit the "spelunking into test code" rabbit-hole when the real cause is just a leaked PID from a prior session, without collateral damage to other tractor-embedding projects on the same box. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md` captures a descendant-leak surfaced while wiring `subint_forkserver` into the full test matrix: running `tests/test_cancellation.py` under `--spawn-backend=subint_forkserver` reproducibly leaks **exactly 5** `subint-forkserv` comm-named child processes that survive session exit, each holding a `LISTEN` on `:1616` (the tractor default registry addr) — and therefore poisons every subsequent test session that defaults to that addr. Deats, - TL;DR + ruled-out checks confirming the procs are ours (not piker / other tractor-embedding apps) — `/proc/$pid/cmdline` + cwd both resolve to this repo's `py314/` venv - root cause: `_ForkedProc.kill()` is PID-scoped (plain `os.kill(SIGKILL)` to the direct child), not tree-scoped — grandchildren spawned during a multi-level cancel test get reparented to init and inherit the registry listen socket - proposed fix directions ranked: (1) put each forkserver-spawned subactor in its own process- group (`os.setpgrp()` in fork-child) + tree-kill via `os.killpg(pgid, SIGKILL)` on teardown, (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit `/proc/<pid>/task/*/children` walk. Vote: (1) — POSIX-standard, aligns w/ `start_new_session=True` semantics in `subprocess.Popen` / trio's `open_process` - inline reproducer + cleanup recipe scoped to `$(pwd)/py314/bin/python.*pytest.*spawn-backend= subint_forkserver` so cleanup doesn't false-flag unrelated tractor procs (consistent w/ `run-tests` skill's zombie-check guidance) Stopgap hygiene fix (wiring `reg_addr` through the 5 leaky tests in `test_cancellation.py`) is incoming as a follow-up — that one stops the blast radius, but zombies still accumulate per-run until the real tree-kill fix lands. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Stopgap companion to d012196 (`subint_forkserver` test-cancellation leak doc): five tests in `tests/test_cancellation.py` were running against the default `:1616` registry, so any leaked `subint-forkserv` descendant from a prior test holds the port and blows up every subsequent run with `TooSlowError` / "address in use". Thread the session-unique `reg_addr` fixture through so each run picks its own port — zombies can no longer poison other tests (they'll only cross-contaminate whatever happens to share their port, which is now nothing). Deats, - add `reg_addr: tuple` fixture param to: - `test_cancel_infinite_streamer` - `test_some_cancels_all` - `test_nested_multierrors` - `test_cancel_via_SIGINT` - `test_cancel_via_SIGINT_other_task` - explicitly pass `registry_addrs=[reg_addr]` to the two `open_nursery()` calls that previously had no kwargs at all (in `test_cancel_via_SIGINT` and `test_cancel_via_SIGINT_other_task`) - add bounded `@pytest.mark.timeout(7, method='thread')` to `test_nested_multierrors` so a hung run doesn't wedge the whole session Still doesn't close the real leak — the `subint_forkserver` backend's `_ForkedProc.kill()` is PID-scoped not tree-scoped, so grandchildren survive teardown regardless of registry port. This commit is just blast-radius containment until that fix lands. See `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
The previous cleanup recipe went straight to SIGTERM+SIGKILL, which hides bugs: tractor is structured concurrent — `_trio_main` catches SIGINT as an OS-cancel and cascades `Portal.cancel_actor` over IPC to every descendant. So a graceful SIGINT exercises the actual SC teardown path; if it hangs, that's a real bug to file (the forkserver `:1616` zombie was originally suspected to be one of these but turned out to be a teardown gap in `_ForkedProc.kill()` instead). Deats, - step 1: `pkill -INT` scoped to `$(pwd)/py*` — no sleep yet, just send the signal - step 2: bounded wait loop (10 × 0.3s = ~3s) using `pgrep` to poll for exit. Loop breaks early on clean exit - step 3: `pkill -9` only if graceful timed out, w/ a logged escalation msg so it's obvious when SC teardown didn't complete - step 4: same SIGINT-first ladder for the rare `:1616`-holding zombie that doesn't match the cmdline pattern (find PID via `ss -tlnp`, then `kill -INT NNNN; sleep 1; kill -9 NNNN`) - steps 5-6: UDS-socket `rm -f` + re-verify unchanged Goal: surface real teardown bugs through the test- cleanup workflow instead of papering over them with `-9`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Major rewrite of `subint_forkserver_test_cancellation_leak_issue.md` after empirical investigation revealed the earlier "descendant-leak + missing tree-kill" diagnosis conflated two unrelated symptoms: 1. **5-zombie leak holding `:1616`** — turned out to be a self-inflicted cleanup bug: `pkill`-ing a bg pytest task (SIGTERM/SIGKILL, no SIGINT) skipped the SC graceful cancel cascade entirely. Codified the real fix — SIGINT-first ladder w/ bounded wait before SIGKILL — in e5e2afb (`run-tests` SKILL) and `feedback_sc_graceful_cancel_first.md`. 2. **`test_nested_multierrors[subint_forkserver]` hangs indefinitely** — the actual backend bug, and it's a deadlock not a leak. Deats, - new diagnosis: all 5 procs are kernel-`S` in `do_epoll_wait`; pytest-main's trio-cache workers are in `os.waitpid` waiting for children that are themselves waiting on IPC that never arrives — graceful `Portal.cancel_actor` cascade never reaches its targets - tree-structure evidence: asymmetric depth across two identical `run_in_actor` calls — child 1 (3 threads) spawns both its grandchildren; child 2 (1 thread) never completes its first nursery `run_in_actor`. Smells like a race on fork- inherited state landing differently per spawn ordering - new hypothesis: `os.fork()` from a subactor inherits the ROOT parent's IPC listener FDs transitively. Grandchildren end up with three overlapping FD sets (own + direct-parent + root), so IPC routing becomes ambiguous. Predicts bug scales with fork depth — matches reality: single- level spawn works, multi-level hangs - ruled out: `_ForkedProc.kill()` tree-kill (never reaches hard-kill path), `:1616` contention (fixed by `reg_addr` fixture wiring), GIL starvation (each subactor has its own OS process+GIL), child-side KBI absorption (`_trio_main` only catches KBI at `trio.run()` callsite, reached only on trio-loop exit) - four fix directions ranked: (1) blanket post-fork `closerange()`, (2) `FD_CLOEXEC` + audit, (3) targeted FD cleanup via `actor.ipc_server` handle, (4) `os.posix_spawn` w/ `file_actions`. Vote: (3) — surgical, doesn't break the "no exec" design of `subint_forkserver` - standalone repro added (`spawn_and_error(breadth= 2, depth=1)` under `trio.fail_after(20)`) - stopgap: skip `test_nested_multierrors` + multi- level-spawn tests under the backend via `@pytest.mark.skipon_spawn_backend(...)` until fix lands Killing the "tree-kill descendants" fix-direction section: it addressed a bug that didn't exist. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Implements fix-direction (1)/blunt-close-all-FDs from b71705b (`subint_forkserver` nested-cancel hang diag), targeting the multi-level cancel-cascade deadlock in `test_nested_multierrors[subint_forkserver]`. The diagnosis doc voted for surgical FD cleanup via `actor.ipc_server` handle as the cleanest approach, but going blunt is actually the right call: after `os.fork()`, the child immediately enters `_actor_child_main()` which opens its OWN IPC sockets / wakeup-fd / epoll-fd / etc. — none of the parent's FDs are needed. Closing everything except stdio is safe AND defends against future listener/IPC additions to the parent inheriting silently into children. Deats, - new `_close_inherited_fds(keep={0,1,2}) -> int` helper. Linux fast-path enumerates `/proc/self/fd`; POSIX fallback uses `RLIMIT_NOFILE` range. Matches the stdlib `subprocess._posixsubprocess.close_fds` strategy. Returns close-count for sanity logging - wire into `fork_from_worker_thread._worker()`'s post-fork child prelude — runs immediately after the pid-pipe `os.close(rfd/wfd)`, before the user `child_target` callable executes - docstring cross-refs the diagnosis doc + spells out the FD-inheritance-cascade mechanism and why the close-all approach is safe for our spawn shape Validation pending: re-run `test_nested_multierrors[subint_forkserver]` to confirm the deadlock is gone. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Two coordinated improvements to the `subint_forkserver` backend: 1. Replace `trio.to_thread.run_sync(os.waitpid, ..., abandon_on_cancel=False)` in `_ForkedProc.wait()` with `trio.lowlevel.wait_readable(pidfd)`. The prior version blocked a trio cache thread on a sync syscall — outer cancel scopes couldn't unwedge it when something downstream got stuck. Same pattern `trio.Process.wait()` and `proc_waiter` (the mp backend) already use. 2. Drop the `@pytest.mark.xfail(strict=True)` from `test_orphaned_subactor_sigint_cleanup_DRAFT` — the test now PASSES after 0cd0b63 (fork-child FD scrub). Same root cause as the nested-cancel hang: inherited IPC/trio FDs were poisoning the child's event loop. Closing them lets SIGINT propagation work as designed. Deats, - `_ForkedProc.__init__` opens a pidfd via `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+) - `wait()` parks on `trio.lowlevel.wait_readable()`, then non-blocking `waitpid(WNOHANG)` to collect the exit status (correct since the pidfd signal IS the child-exit notification) - `ChildProcessError` swallow handles the rare race where someone else reaps first - pidfd closed after `wait()` completes (one-shot semantics) + `__del__` belt-and-braces for unexpected-teardown paths - test docstring's `@xfail` block replaced with a `# NOTE` comment explaining the historical context + cross-ref to the conc-anal doc; test remains in place as a regression guard The two changes are interdependent — the cancellable `wait()` matters for the same nested- cancel scenarios the FD scrub fixes, since the original deadlock had trio cache workers wedged in `os.waitpid` swallowing the outer cancel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Completes the nested-cancel deadlock fix started in 0cd0b63 (fork-child FD scrub) and fe540d0 (pidfd- cancellable wait). The remaining piece: the parent- channel `process_messages` loop runs under `shield=True` (so normal cancel cascades don't kill it prematurely), and relies on EOF arriving when the parent closes the socket to exit naturally. Under exec-spawn backends (`trio_proc`, mp) that EOF arrival is reliable — parent's teardown closes the handler-task socket deterministically. But fork- based backends like `subint_forkserver` share enough process-image state that EOF delivery becomes racy: the loop parks waiting for an EOF that only arrives after the parent finishes its own teardown, but the parent is itself blocked on `os.waitpid()` for THIS actor's exit. Mutual wait → deadlock. Deats, - `async_main` stashes the cancel-scope returned by `root_tn.start(...)` for the parent-chan `process_messages` task onto the actor as `_parent_chan_cs` - `Actor.cancel()`'s teardown path (after `ipc_server.cancel()` + `wait_for_shutdown()`) calls `self._parent_chan_cs.cancel()` to explicitly break the shield — no more waiting for EOF delivery, unwinding proceeds deterministically regardless of backend - inline comments on both sites explain the mutual- wait deadlock + why the explicit cancel is backend-agnostic rather than a forkserver-specific workaround With this + the prior two fixes, the `subint_forkserver` nested-cancel cascade unwinds cleanly end-to-end. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:
1. Skip-mark the test via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test
matrix while the remaining bug is being chased.
The reason string cross-refs the conc-anal doc
for full context.
2. Update the conc-anal doc
(`subint_forkserver_test_cancellation_leak_issue.md`) with the
empirical state after the three nested- cancel fix commits
(`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
shield break) landed, narrowing the remaining hang from "everything
broken" to "peer-channel loops don't exit on `service_tn` cancel".
Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
`_parent_chan_cs.cancel()` wiring from `57935804`
works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
channel handlers in `handle_stream_from_peer`
(inbound connections handled by
`stream_handler_tn`, which IS `service_tn` in the
current config).
- after `_parent_chan_cs.cancel()` fires, NEW
shielded loops appear on the session reg_addr
port — probably discovery-layer reconnection;
doesn't block teardown but indicates the cascade
has more moving parts than expected.
The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two new sections in `subint_forkserver_test_cancellation_leak_issue.md` documenting continued investigation of the `test_nested_multierrors[subint_forkserver]` peer- channel-loop hang: 1. **"Attempted fix (DID NOT work) — hypothesis (3)"**: tried sync-closing peer channels' raw socket fds from `_serve_ipc_eps`'s finally block (iterate `server._peers`, `_chan._transport. stream.socket.close()`). Theory was that sync close would propagate as `EBADF` / `ClosedResourceError` into the stuck `recv_some()` and unblock it. Result: identical hang. Either trio holds an internal fd reference that survives external close, or the stuck recv isn't even the root blocker. Either way: ruled out, experiment reverted, skip-mark restored. 2. **"Aside: `-s` flag changes behavior for peer- intensive tests"**: noticed `test_context_stream_semantics.py` under `subint_forkserver` hangs with default `--capture=fd` but passes with `-s` (`--capture=no`). Working hypothesis: subactors inherit pytest's capture pipe (fds 1,2 — which `_close_inherited_fds` deliberately preserves); verbose subactor logging fills the buffer, writes block, deadlock. Fix direction (if confirmed): redirect subactor stdout/stderr to `/dev/null` or a file in `_actor_child_main`. Not a blocker on the main investigation; deserves its own mini-tracker. Both sections are diagnosis-only — no code changes in this commit. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Three places that previously swallowed exceptions silently now log via `log.exception()` so they surface in the runtime log when something weird happens — easier to track down sneaky failures in the fork-from-worker-thread / subint-bootstrap primitives. Deats, - `_close_inherited_fds()`: post-fork child's per-fd `os.close()` swallow now logs the fd that failed to close. The comment notes the expected failure modes (already-closed-via-listdir-race, otherwise-unclosable) — both still fine to ignore semantically, but worth flagging in the log. - `fork_from_worker_thread()` parent-side timeout branch: the `os.close(rfd)` + `os.close(wfd)` cleanup now logs each pipe-fd close failure separately before raising the `worker thread didn't return` RuntimeError. - `run_subint_in_worker_thread._drive()`: when `_interpreters.exec(interp_id, bootstrap)` raises a `BaseException`, log the full call signature (interp_id + bootstrap) along with the captured exception, before stashing into `err` for the outer caller. Behavior unchanged — only adds observability. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
goodboy
commented
Apr 24, 2026
| # Revisit `subint_forkserver` thread-cache constraints once msgspec PEP 684 support lands | ||
|
|
||
| Follow-up tracker for cleanup work gated on the msgspec | ||
| PEP 684 adoption upstream ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)). |
Owner
Author
There was a problem hiding this comment.
update this to the new subints compat issue we make @ msgspec.
Rename `tests/spawn/test_subint_forkserver.py` → `test_main_thread_forkserver.py` and migrate its imports + internal refs to the new canonical names: - `fork_from_worker_thread`, `wait_child` → from `tractor.spawn._main_thread_forkserver`. - `run_subint_in_worker_thread` → still from `_subint_forkserver` (variant-2 primitive). - Module docstring + tier-3 fixture + the `*_spawn_basic` test fn renamed for variant-1-honesty. - Orphan-harness subprocess argv flipped from `'subint_forkserver'` → `'main_thread_forkserver'`. `ai/conc-anal/subint_fork_from_main_thread_smoketest.py` imports split the same way. `tractor/spawn/_subint_forkserver.py` drops the backward- compat re-exports of the fork primitives — the only consumers (test file + smoketest) now import from `_main_thread_forkserver` directly. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
After the variant-1 / variant-2 backend split, update remaining string-match refs to the variant-1 backend so user-visible gates + skip-marks + comments name the working backend correctly: - `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include `main_thread_forkserver`, drop the stub-only `subint_forkserver` entry. - `tests/test_spawning.py::test_loglevel_propagated_to_subactor`: capfd-skip flips to `main_thread_forkserver`. - `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`: xfail-condition flips to `main_thread_forkserver`. - `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`" reason-text since the `mp.SharedMemory(track=False)` + resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass; the skip-mark only fires on plain `subint` now. - Comment / docstring sweep: `runtime._state`, `runtime._runtime`, `_testing.pytest`, `_subint.py`, `pyproject.toml`, `test_cancellation.py`, `test_registrar.py` — refs to variant-1 backend updated. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Add `test_subint_forkserver_key_errors_cleanly` — a tn-tier regression guard that pins down the variant-2 reservation contract: the `'subint_forkserver'` key in `_spawn._methods` MUST raise `NotImplementedError` today, not silently dispatch to `main_thread_forkserver_proc`. The transient alias-state existed briefly during the rename (commit `57dae0e4`'s "Split forkserver backend into variant 1/2 mods" landed the alias; `5e83881f` flipped it to the stub). Without a guard, a future refactor could easily re-collapse the two keys back to a single coro and silently break the variant-1 / variant-2 contract. Also asserts the stub's error msg surfaces the two pointers an operator hitting it actually needs: - `'main_thread_forkserver'` — the working backend they prolly meant, - `'msgspec#1026'` — the upstream blocker that has to land before variant-2 can ship. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Previously the random port was a default-arg expression (`_rando_port: str = random.randint(1000, 9999)`) — evaluated ONCE at module import time, making it a per-process singleton. Two parallel pytest sessions had a 1/9000 birthday-pair chance of picking the same port; when it hit, every `reg_addr`-using test in BOTH runs would cascade-fail with "Address already in use". Switch to per-call `random.randint()` salted with `os.getpid()` so: - within one session: two calls return distinct ports — e.g. `test_tpt_bind_addrs::bind-subset-reg` now actually gets two different reg addrs on the TCP backend (it was silently duplicating before), - across parallel sessions: pid salt biases each process's port choices apart, making cross-run collisions vanishingly rare. Drop the bogus `: str` annotation (was always `int`). UDS already gets per-process isolation via `UDSAddress.get_random()`'s `@<pid>` socket-path suffix, so no change needed there. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Function-scoped, NON-autouse zombie-subactor reaper for modules whose teardown is known-leaky enough to cascade- fail every following test in a session. Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The session-scoped one fires at session end — too late to save tests that follow a hung/leaky test in the suite. The new fixture, opted into via `pytestmark = pytest.mark.usefixtures(...)`, runs between tests in a problem-module so a leftover subactor from test N can't squat on registrar ports / UDS paths / shm segments needed by tests N+1, N+2, ... Intentionally NOT autouse — the fixture's presence on a module signals "this module's teardown leaks; please root-cause instead of relying forever on cleanup". A visibility-vs-convenience trade picked in favor of the former. Apply to `tests/test_infected_asyncio.py` since both recent full-suite runs (parallel-tpt-proto + TCP-only) showed the cascade originating in this file's KBI- and SIGINT-flavored tests under `main_thread_forkserver`. Module-comment names the specific offenders so future de-flake work has a starting point. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Drop `@pytest.mark.timeout(...)` for the per-test wall-clock
cap on `test_dynamic_pub_sub`; rely on `trio.fail_after(12)`
inside `main()` instead.
Both pytest-timeout enforcement modes are incompatible with
trio under fork-based backends:
- `method='signal'` (SIGALRM) synchronously raises `Failed`
in trio's main thread mid-`epoll.poll()`, leaving
`GLOBAL_RUN_CONTEXT` half-installed ("Trio guest run got
abandoned") so EVERY subsequent `trio.run()` in the same
pytest process bails with
`RuntimeError: Attempted to call run() from inside a run()`
— full-session poison.
- `method='thread'` calls `_thread.interrupt_main()` which
can let the KBI escape trio's `KIManager` under fork-
cascade teardown races and bubble out of pytest entirely
— kills the whole session.
`trio.fail_after()` keeps cancellation inside the trio loop:
- Raises `TooSlowError` cleanly through the open-nursery's
cancel cascade.
- Doesn't disturb any out-of-band signal/thread state.
- Failure stays scoped to the single test — no cross-test
global state corruption either way.
Verified empirically: 10 hammer-runs of `test_dynamic_pub_sub`
go from 5/10 fail (with global-state poison) to 3/10 fail
(no poison, all sibling tests still pass). The ~30%
remaining flake rate is a genuine fork-cancel-cascade
hang — separate from this fix but no longer contaminates.
Module-level NOTE comment explains the rationale so future
readers don't re-introduce the bug.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`timeout = 200` was firing via SIGALRM (the default `method='signal'`) which synchronously raises `Failed` in trio's main thread mid-`epoll.poll()`, abandoning trio's runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half- installed. EVERY subsequent `trio.run()` in the same pytest session then bails with `RuntimeError: Attempted to call run() from inside a run()`. Empirical impact: a session that hits a single 200s hang cascades into 30-40 false-positive failures across every downstream test file that uses `trio.run`. Recent UDS run saw 1 real timeout (`test_unregistered_err_still_relayed`) poison 38 sibling tests with cascade-fails — a debugging nightmare. Same architectural bug we already documented in `tests/test_advanced_streaming.py::test_dynamic_pub_sub` (see its module-level NOTE) — both `pytest-timeout` enforcement modes are incompatible with trio under fork- based spawn backends. Now scoped session-wide. For tests that legitimately need a wall-clock cap, the canonical pattern is `with trio.fail_after(N):` INSIDE the test — trio's own `Cancelled` machinery cleanly unwinds the actor nursery without disturbing global state. For CI: rely on job-level wall-clock timeouts (e.g. GitHub Actions `timeout-minutes`) to abort genuinely-stuck suites. `pyproject.toml` comment block spells this all out so a future contributor doesn't reach back for `timeout =` and re-introduce the bug. ALSO, bump `xonsh` to at least `0.23.0` release. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Default `timeout` from `int = 3` → `int|None = None`; when unset, pick a backend-aware value. Fork-based backends (`main_thread_forkserver`) need real headroom bc actor spawn + IPC ctx-exit + msg-validation error path is much heavier than under `trio` backend — especially under cross-pytest-stream contention (#451). Defaults: - `main_thread_forkserver` → 30s - everything else → 3s (unchanged) Empirical flake history that motivated 30s as the floor on fork backends (all from `test_basic_payload_spec`): - 3s → all-valid variant flaked w/ `TooSlowError` - 8s → `invalid-return` variant flaked w/ `Cancelled` (surfaced instead of `MsgTypeError` bc the outer `fail_after` fired mid-error-path) - 15s → flaked under cross-pytest-stream contention 30s gives plenty of headroom while still failing-loud on a genuine hang. Callers can opt out by passing an explicit `timeout=` kw. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Mirror `060f7d24`'s pattern (backend-aware timeout in `maybe_expect_raises`) for `test_dynamic_pub_sub`'s hard `trio.fail_after` cap. Fork-based backends pay per-spawn fork+IPC-handshake cost which stacks over `cpus - 1` sequential `n.run_in_actor()` calls; empirically 12s flakes on `main_thread_forkserver` under UDS cross-pytest contention (#451 / #452). Defaults: - `main_thread_forkserver` → 30s - everything else → 12s (unchanged) Hoist the timeout-pick out of the `main()` closure so the dispatch happens once in the trio task rather than re-evaluating per spawn. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
New `--enable-stackscope` CLI flag installs a SIGUSR1 →
trio-task-tree-dump handler in pytest itself + every
spawned subactor for live stack visibility during hang
investigations. Lighter than `--tpdb` (no pdb machinery
/ tty-lock contention) — pure stack-only triage.
Plumbing:
- `_testing.pytest.pytest_addoption()` adds the flag.
- `_testing.pytest.pytest_configure()` (when flag set):
* exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children
inherit it via environ,
* installs the handler in pytest itself via
`enable_stack_on_sig()`.
- `runtime._runtime.Actor.async_main()` extends the
existing `_debug_mode` gate to ALSO fire when
`TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors
install the same handler at runtime startup.
Capture-bypass tee in `dump_task_tree()`:
Pytest's default `--capture=fd` swallows `log.devx()`
output, making SIGUSR1 dumps invisible right when you
need them. Render the dump once to a `full_dump` str,
then unconditionally tee to:
- `/tmp/tractor-stackscope-<pid>.log` (append-mode,
always written) — guaranteed-readable artifact even
under CI / `nohup` / no-tty. `tail -f` to follow.
- `/dev/tty` (best-effort) — pytest never captures the
tty; ignored if device is missing.
Other,
- squelch the benign `RuntimeWarning` ("coroutine method
'asend'/'athrow' was never awaited") from
`stackscope._glue`'s import-time async-gen type
introspection so `--enable-stackscope` setup stays
quiet.
- log msg in the `_runtime` ImportError branch now
mentions `--enable-stackscope` alongside debug-mode.
Usage,
pytest --enable-stackscope -k <hang-test>
# in another shell, find the pid + signal:
kill -USR1 <pytest-or-subactor-pid>
# tail the artifact:
tail -f /tmp/tractor-stackscope-<pid>.log
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two cleanup tweaks in `_main_thread_forkserver`: Doc, "what survives the fork?" section — expand the "non-calling threads are gone in the child" claim with the precise execution-vs-memory split that reconciles this module's prior framing with trio's (canonical [python-trio/trio#1614][trio-1614]) "leaked stacks" framing: - execution-side: only the calling thread runs post-fork; all others never execute another instruction. - memory-side: those non-running threads' stacks + per-thread heap structures are still COW-inherited as orphaned bytes — what trio means by "leaked". Same POSIX reality, opposite sides; the table is extended to a 4-col `parent | child (executing) | child (memory)` layout to make both views explicit. Also blank-line-padded the bulleted hazard classes for cleaner markdown rendering. [trio-1614]: python-trio/trio#1614 Code, `_close_inherited_fds()` log noise — split the catch-all `except OSError` into: - `EBADF` — benign race where the dirfd that `os.listdir('/proc/self/fd')` itself opened ends up in `candidates`, then auto-closes before the loop reaches it. Demote to `log.debug()` + `continue`; prior `log.exception` drowned the post-fork log channel with stack traces every spawn. - other errnos (EIO / EPERM / EINTR / ...) keep the loud `log.exception` surface — those ARE genuinely unexpected. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Signal handlers fire in a non-trio stack frame; calling
`stackscope.extract(recurse_child_tasks=True)` from there
only walks the `<init>` task and misses everything inside
`async_main`'s nurseries — exactly the part you want to
see during a hang.
Fix: capture `trio.lowlevel.current_trio_token()` at
`enable_stack_on_sig()` time and stash it as a module-
level `_trio_token`. The SIGUSR1 handler then dispatches
the dump *onto* the trio loop via
`_trio_token.run_sync_soon(_safe_dump_task_tree)`, so
`stackscope.extract` runs from a real trio-task context
and walks the full nursery tree.
Late-binding: pytest's `pytest_configure` calls
`enable_stack_on_sig()` outside any `trio.run`, so token
capture there is a `RuntimeError` — left at `None`. The
runtime re-calls `enable_stack_on_sig()` from inside
`async_main` (subactor side) where the token IS
available, so subactors get the full-tree path.
`dump_tree_on_sig` falls back to a direct call when
`_trio_token is None` (parent process pre-trio.run, or
signal delivered after `trio.run` returns).
`_safe_dump_task_tree()` is a `run_sync_soon`-friendly
wrapper that swallows any exception from
`dump_task_tree()` — trio prints + crashes on uncaught
exceptions in scheduled callbacks; better to log + keep
the run alive so the user can re-trigger.
Other,
- emit `capture-bypass tee: <fpath>` line + `tail -f`
hint in the rendered dump header so users know where
to find the artifact even when stdio is captured.
- swap the inline `f' |_{actor}'` line for a
`_pformat.nest_from_op` rendering of `actor_repr`
(matches the rest of the runtime's nested-op style).
- log lines on handler install + already-installed
branches now note `(trio_token captured: <bool>)`
so it's obvious from the log whether the full-tree
path is wired.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add env-var overrides inside `._root.open_root_actor()` so devs/test-runs can swap the actor-spawn backend or crank console verbosity *without* touching application code. In `._root.open_root_actor()`, - read `TRACTOR_LOGLEVEL` early, overriding any caller-passed `loglevel` and stashing an `env_ll_report` to emit once the console log is set up. - pull the `loglevel` fallback (`or _default_loglevel`) and `log.get_console_log()` init *up* so the env-var report routes through tractor's own logger. - read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed `start_method` and warn-logging when the env-var clobbers an explicit caller value. Wire the same vars through `tests/devx/conftest.py::spawn`, - request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL` and `TRACTOR_SPAWN_METHOD` in `os.environ` before each `pexpect.spawn()` (inherited by the example subproc). - expand `supported_spawners` to include `main_thread_forkserver` and `subint_forkserver` bc example scripts no longer need per-script CLI plumbing. - pop both vars in fixture teardown so a leaked value can't re-route a later in-process tractor test's spawn-backend or loglevel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
In `pyproject.toml`, - include the `sync_pause` group from `dev`, so dev installs ship `greenback` for `pause_from_sync()`. Comment out per-test `@pytest.mark.timeout(...)` markers in, - `tests/devx/test_debugger.py` - `tests/discovery/test_registrar.py` - `tests/spawn/test_main_thread_forkserver.py` - `tests/spawn/test_subint_cancellation.py` - `tests/test_advanced_streaming.py` - `tests/test_cancellation.py` The global cap was already dropped (3c366ca); these were the leftover per-test caps which now block interactive `pdb` flows under the new spawn backends. In `uv.lock`, - pull `greenback` into the resolved `dev` deps (per the `sync_pause` include above). - catch up the prior `xonsh` editable→PyPI switch (from the `pyproject.toml` `tool.uv.sources` edit). (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
`main_thread_forkserver` doesn't actually need py3.14 `concurrent.interpreters` (PEP 734) — it forks from a non-trio worker thread and runs `_trio_main` in the child, same shape as `trio_proc`. The previous `_has_subints` gate + subint-family `case` arm were a copy-paste error. In `tractor.spawn._main_thread_forkserver`, - drop the `_has_subints` import + the `RuntimeError` raise in `main_thread_forkserver_proc()`. - drop the now-unused `import sys` (only used by the prior error msg). In `tractor.spawn._spawn.try_set_start_method()`, - pull `'main_thread_forkserver'` out of the subint- family arm (which still gates on `_has_subints`). - merge it into the `'trio'` arm — both set `_ctx = None` bc neither needs an `mp.context`. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
In `tests/devx/conftest.py::spawn`, refactor the fixture-internal closures so consumer tests can pass explicit `start_method`/`loglevel` to each `_spawn()` invocation rather than only inheriting the fixture- scoped parametrize values. Deats, - promote `set_spawn_method()` and `set_loglevel()` to take their respective values as fn params (vs closing over the fixture-scope vars). - give `_spawn()` `start_method=start_method` and `loglevel: str|None = None` kwargs so callers override one-off without re-parametrizing the suite. NOTE: this drops the implicit fixture- scoped `loglevel` forward — `_spawn()` callers now must pass `loglevel=...` explicitly. - TODO: figure out how `--ll <level>` should map to the default (currently `None` → uses env-var or tractor default). - add a docstring to `_spawn()` so its role as the consumer-facing closure is obvious from `help()`. Also, - `assert_before()` now returns the `.before` output on success (was `None`); add a one-line docstring describing the new return contract. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
5 tasks
Extend the `_testing._reap` mod with UDS sock-file leak detection +
cleanup, complementing the existing shm and subactor-process
reaping:
- `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`,
`reap_uds()` — detect `<name>@<pid>.sock` files under
`${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including
the `1616` registry sentinel).
- `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT
lingering subactors, wait, SIGKILL survivors, then sweep orphaned
UDS files.
- `_track_orphaned_uds_per_test` fn-scoped autouse fixture:
snapshot sock-file dir before/after each test, warn + reap new
orphans to prevent cascade flakiness under `--tpt-proto=uds`.
- `reap_subactors_per_test` opt-in fn-scoped fixture for modules
with known-leaky teardown.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Wire up `find_orphaned_uds()` + `reap_uds()` from `_reap` as a new phase-3 UDS sweep in the CLI script. Opt-in via `--uds` (run after proc reap + shm) or `--uds-only` (skip other phases). Also, - consolidate skip-proc-reap logic into a single `skip_proc_reap` bool covering both `--shm-only` and `--uds-only` - extend header docstring + usage examples (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Move `--capture=sys` enforcement from a static ini flag to a `pytest_load_initial_conftests()` bootstrap hook that dynamically flips capture mode only when a fork-based spawner (like `main_thread_forkserver`) is detected; non-fork backends keep `--capture=fd`. Also, - load `tractor._testing.pytest` via `-p` in ini (bc bootstrapping hooks must register before conftest `pytest_plugins` runs). - register `_reap` as sub-plugin via `pytest_plugins` tuple in `._testing.pytest`. - drop now-duplicate reap fixtures (already in `_reap` per 1cdc7fb). - rename `tractor_enable_stackscope` dest -> `enable_stackscope` and pop env var on disable. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Factor the sub-actor relay loop out of `dump_tree_on_sig()` into `_relay_sig_to_subactors()` and chain both dump + relay in a single `run_sync_soon` callback (`_dump_then_relay`) so the parent's task-tree flushes BEFORE any sub receives the signal — fixes a hierarchical-ordering race where subs could dump ahead of the parent in the muxed pty stream. Also, - gate file/tty sink writes behind `write_file` + `write_tty` params on `dump_task_tree()`. - use `actor.aid.uid` instead of deprecated `.uid`. - update `test_shield_pause` expects to match the new sequential parent -> relay-log -> sub ordering. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Track `stackscope` enablement in `RuntimeVars` so the flag propagates to subactors via the standard rtvar IPC path instead of relying solely on the `TRACTOR_ENABLE_STACKSCOPE` env var. Deats, - add `use_stackscope: bool` to `RuntimeVars` struct + defaults dict - `enable_stack_on_sig()` sets the rtvar on successful `stackscope` import, asserts unset on `ImportError` - nest stackscope init under `_debug_mode` gate in `Actor.async_main`, check rtvar alongside env var - defer `maybe_init_greenback` import to its own `use_greenback` branch (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Add `disable_pdbp_color()` to the `sync_bp` example
to suppress pygments prompt coloring when
`PYTHON_COLORS=0` — makes pexpect pattern matching
deterministic.
Deats,
- set `loglevel='pdb'` in both script + test spawn.
- disable `enable_stack_on_sig` in example, assert
no `stackscope` output in test.
- update `attach_patts` keys/values with `|_<Task`
/ `|_<Thread` / `|_('subactor'` prefixes to match
actual tree-dump format.
- add call-site patterns (`tractor.pause_from_sync()`
`tractor.pause()`, `breakpoint(hide_tb=...)`).
- trim trailing `\n` from `Lock.repr()` output.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
subint_forkserverspawn backend (#379 follow-on)(i can't believe how fast we vibed this 😂 )
Motivation
Stacked on top of PR #446 /
subint_fork_backend: thatbranch laid the
_subint_fork.pystub and the post-fork-CPythonexploration showing that
os.fork()from a non-main subinterpreteraborts the child at the CPython layer. This branch operationalizes
the workaround sketched there — fork from a regular
threading.Threadattached to the main interpreter (one that hasnever entered a subint) — and wires it through tractor as a
first-class spawn backend.
The implementation looks tiny (one new module) but the supporting
work was the bulk of the patch series: multiple cancel-cascade hangs
surfaced under fork-based teardown that didn't exist for any
exec-based backend (
trio,mp_*). The shared-memory imageinherited across
os.fork()makes parent↔child socket-EOF deliveryracy, which exposed a latent
process_messages-shielded-loopdeadlock; once cracked, that fix benefits every backend. A separate
pytest --capture=fd× fork-child interaction was traced to atest_nested_multierrorscancel-cascade hang gated on the capturemode — tracked at #449 and worked-around by defaulting the suite to
--capture=sys.Late in the series the backend was split into two clearly-named
variants: variant 1 —
main_thread_forkserver— is theworking backend that ships today (forks from a regular main-interp
worker thread, child runs trio on its own main interp; NO
subinterpreter anywhere); variant 2 —
subint_forkserver—is reserved as a placeholder for the future subint-isolated child
runtime, gated on jcrist/msgspec#1026 (PEP 684 isolated-mode
support). Today the
'subint_forkserver'spawn-method keydispatches to a
NotImplementedErrorstub that points operators atthe variant-1 key. The "subint" prefix on both modules is
family-naming — they live alongside
_subint.py/_subint_fork.pyfrom the broader #379 series.The two
threading.Threadprimitives in_main_thread_forkserverare deliberately heavy-handed (full ad-hoc threads, not
trio.to_thread.run_sync) to side-step legacy-config-subint GILstarvation; once
msgspeclands PEP 684 support and we can useisolated subints, that constraint relaxes — auditable revisit
tracked at #450. Also bundled: a
tractor-reapzombie-subactorcleanup CLI +
_testing._reapshared impl + session-scoped autousefixture, so a mid-teardown timeout no longer leaves orphan
subactors competing for ports across test sessions; a follow-up
commit extends
tractor-reapwith a--shmmode that sweepsorphaned
/dev/shm/*segments owned by the current uid that nolive process is mapping or holding open.
Src of research
The following provide info on why/how this impl makes sense,
fork()can be hacked now?". The "Our own thoughts" sectionsketches the worker-thread fork pattern that this backend
implements.
subint_fork_backendblock and the
_subint_fork.pystub returningNotImplementedError.Py_mod_multiple_interpreters. Untilmsgspecadopts the slot we're stuck on legacy-config subints,which forces our heavier thread design (see Audit
subint_forkserverthread constraints once msgspec PEP 684 lands #450).msgspecPEP 684 isolated-mode tracker; gatesvariant-2.
ai/conc-anal/subint_fork_from_main_thread_smoketest.py(
control_subint_thread_fork,main_thread_fork,worker_thread_fork,full_architecture) — pre-tractor proofthat the workaround is sound.
Module-level design docs (read these top-of-file docstrings
for the per-backend architectural justifications, fork-semantics
analysis, and migration plans — much richer than the
high-level summary in this PR description):
tractor/spawn/_main_thread_forkserver.pyDesign rationale (why a forkserver + why in-process),
What survives the fork? — POSIX semantics, FYI: how this
dodges the
trio.run()×fork()hazards, Implementationstatus, Still-open work, TODO gated on msgspec PEP 684.
tractor/spawn/_subint_forkserver.pywould buy us (3 wins: cheaper forks, true parallelism,
multi-actor-per-process), what lives here today
(
run_subint_in_worker_thread), what will live here whenvariant 2 ships.
tractor/spawn/_subint.pysubintbackend (parent of this stack, PR A subinterpreter-in-thread spawning backend #446).Why we use the private
_interpretersC module instead ofconcurrent.interpreters's public'isolated'API; py3.14+feature gate rationale; msgspec PEP 684 migration path.
tractor/spawn/_subint_fork.pyPointers to the CPython-source-line analysis in
subint_fork_blocked_by_cpython_post_fork_issue.md.Summary of changes
By chronological commit,
(82332fbc) Lift the validated fork primitives into
tractor.spawn._subint_forkserver:fork_from_worker_thread()+run_subint_in_worker_thread()as the two re-usable buildingblocks.
(25e400d5) Add trio-parent integration tests covering
tier-1 (primitives driven from inside
trio.run()) and tier-2(full backend wired through
open_root_actor+open_nursery).(cf2e71d8) Document the PEP 684 audit-plan under
ai/conc-anal/subint_forkserver_thread_constraints_on_pep684_issue.md— the upstream-gated cleanup work tracked at Audit
subint_forkserverthread constraints once msgspec PEP 684 lands #450.(26914fde) Wire
'subint_forkserver'as a first-classSpawnMethodKeyand_methodsregistry entry; thetry_set_start_methodcase re-uses the subint-family py3.14+gate.
(63ab7c98) (7804a9fe) Reset post-fork
_statein the forkserver child via a new pureget_runtime_vars(clear_values=True)+ siblingset_runtime_vars()API; without the reset the child inheritsthe parent's
_is_root=Trueand tripsActor._from_parent()onthe
SpawnSpechandshake.(76605d56) (dcd5c1ff)
(a72deef7) Add a DRAFT orphan-SIGINT test scaffold +
child_sigintmodes; refine the diagnosis — the hang is NOT amissing handler, trio's loop stays wedged in
epoll_waitdespitedelivery. Full trace + fix dirs in
ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md.(f5f37b69) (5e85f184)
(8bcbe730) (e31eb8d7) Shorten timeouts
in forkserver suites, drop dead f-string prefixes, enable
debug_modefor the forkserver path (added to a new_DEBUG_COMPATIBLE_BACKENDSlist in_root), and label theforkserver child in log attribution.
(1e357dcf) Mv
test_subint_cancellation.pyinto thenew
tests/spawn/subpkg alongside the forkserver test module.(d093c319) (70d58c4b) Teach the
/run-testsskill a zombie-actor post-run check + a SIGINT-firstgraceful cleanup ladder. Per the SC-discipline rule: graceful
cancel before SIGKILL.
(e3f4f5a3) (1af21210) Add the
test-cancellation leak doc
(
subint_forkserver_test_cancellation_leak_issue.md) and wirereg_addrfixture through the leaky cancel tests so each rungets a unique registrar address.
(35da8089) (9993db01) Refine the
nested-cancel hang diagnosis; add a post-fork FD scrub in the
fork-child prelude as the current workaround.
(c20b05e1) Use
pidfdfor cancellable_ForkedProc.wait()— replaces a blockingos.waitpid()with atrio-cancellable poll.(8ac3dfeb) Break the parent-channel shield in
Actor.cancel()teardown via a captured_parent_chan_cs.Without this, the shielded
process_messagesloop parks on EOFthat only arrives AFTER the parent tears down — under fork
backends the parent is itself blocked on this child's exit.
Mutual-wait deadlock; explicit cancel makes teardown deterministic
regardless of backend.
(506617c6) (ab86f761)
(458a35cf) (7cd47ef7) Skip-mark + narrow
the cancel hang; surface silent failures in the forkserver child;
doc ruled-out fixes + the capture-pipe aside.
(76d12060) Claude-perms tweak so
/commit-msgoutputscan be written.
(4106ba73) (eceed29d)
(4c133ab5) Pin the forkserver hang to
pytest --capture=fd(subint_forkserver:test_nested_multierrorscancel-cascade hang gated by pytest--capture=fd#449); codify the capture-pipe-hang lesson inskills; default
pytestto--capture=sysinpyproject.tomlwith the trade-off rationale inlined.
(e312a68d) (4d055543) Bound the peer-clear
wait in
async_main'sfinally(3smove_on_after) and narrowthe forkserver hang to the
async_mainouter tn — load-bearingfor backend-agnostic teardown determinism.
(d6e70e9d) Import-or-skip
.devx.tests requiringgreenback— keeps the suite collectable without the optionaldep.
(b350aa09) Wire
reg_addrthrough infected-asynciotests for parallel-run isolation.
(2ca0f41e) (44bdb169) Skip
test_loglevel_propagated_to_subactoron the forkserver backendtoo; tighten the orphan-SIGINT xfail to
strict=True.(eae478f3) (6d76b604) Add
tractor._testing._reap(SC-polite SIGINT-first reap, descendant_reap_orphaned_subactorsfixture; add thetractor-reapCLI(
scripts/tractor-reap) wrapping the same impl.(c99d475d) (aa3e2309) Fix
mp.SharedMemoryunder fork-without-exec —tractor.ipc._mp_bs.disable_mantracker(force_disable=True)isnow the default (belt+suspenders: no-op
ManTrackermonkey-patchtrack=False);_shm.open_shm_listalways wires theunlinklifetime callback (was 3.12-and-below only); document the
incompat in
ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md.(4f12d69b) Extend
tractor-reapwith--shm(and--shm-only) modes that sweep orphaned/dev/shm/<key>segmentsowned by the current uid with no live process mapping or holding
them open. Match-criteria via
psutil.Process.memory_maps()+.open_files()— kernel-canonical, no reliance ontractor-specific shm-key naming, so unrelated apps' segments are
always preserved. Adds
psutil>=7.0.0to thetestingdepgroup.
(65fcfbf2) (9b05f659)
(66f1941f) Bump
test_stale_entry_is_deletedtimeoutto 30s; wire
test_dynamic_pub_subto standard fixtures; wirereg_addrintotest_context_stream_semantics.(54561959) Surface subint bootstrap excs in
_subint.subint_proc()(try/except BaseException+log.exception(...)around_interpreters.exec()); also log_interpreters.is_running(interp_id)on hard-kill timeout todisambiguate "thread leaked, subint already done" from "thread
alive bc subint is wedged".
?TODOnotes the anyio-borrow pathfor re-raising bootstrap excs in the parent task and migrating to
_interpreters.set___main___attrs()for non-literal SpawnSpecargs.
(3ab99d55) (4b5176e2) Major
module-docstring expansion for
_subint_forkserver: designrationale (in-process forkserver vs.
mp.forkserver's sidecar;why a forkserver at all vs. forking from a trio task), POSIX/trio
fork mechanics, what survives the fork boundary, and the
future-subint payoffs (cheaper forks, true parallelism via
per-interp GIL, multi-actor-per-process). Bump the gated msgspec
link from
#563→#1026.(99dade0f) Extract the truly-generic
main-interp-worker-thread fork primitives
(
fork_from_worker_thread,_close_inherited_fds,_ForkedProc,wait_child,_format_child_exit) into a sibling_main_thread_forkserver.pymodule — the primitive layer is nowhonestly named (none of these helpers touch a subint). Re-exports
preserved.
(57dae0e4) Split the backend into variant 1 + variant
2 modules. Variant 1 (
main_thread_forkserver) becomes thecanonical working impl: new
SpawnMethodKeyliteral,_methodsdispatch entry,
Actor._from_parent()match-arm,main_thread_forkserver_proc()spawn-coro stamping its ownSpawnSpec/ log lines. Variant 2 (subint_forkserver) shrinksto a placeholder describing the future subint-isolated child
runtime gated on Port to new
concurrent.interpreterssupport incpython3.14+ jcrist/msgspec#1026; legacy'subint_forkserver'key still aliases to variant-1 here (flipped to
NotImplementedErrorin the next commit).(5e83881f) Reduce
_subint_forkserver.pyto itsvariant-2 placeholder shape: add
subint_forkserver_procasyncstub raising
NotImplementedErrorwith a redirect msg pointingat
main_thread_forkserver+ Port to newconcurrent.interpreterssupport incpython3.14+ jcrist/msgspec#1026 + Trying out sub-interpreters (subints), maybefork()can be hacked now?' #379. Flipthe
_methodsregistry to dispatch the stub directly so--spawn-backend=subint_forkservererrors cleanly. Drop deadmodule-scope (
ChildSigintMode,_DEFAULT_CHILD_SIGINT, unusedimports).
(9f0709ee) Rename
tests/spawn/test_subint_forkserver.py→test_main_thread_forkserver.py; migrate test/smoketest importsto
tractor.spawn._main_thread_forkserver; orphan-harnesssubprocess argv flipped to
'main_thread_forkserver'. Drop thevariant-2 module's backward-compat re-exports of fork primitives.
(205382a3) Sweep
subint_forkserver→main_thread_forkserverin remaining string-match refs:_DEBUG_COMPATIBLE_BACKENDS,test_loglevel_propagated_to_subactor's capfd-skip,test_sigint_closes_lifetime_stack's xfail, comment/docstringrefs across
_runtime,_state,_testing.pytest,_subint,pyproject.toml,test_cancellation,test_registrar. Drop thetest_shm.py"broken onmain_thread_forkserver" skip-mark —_mp_bs+_shmfixes make those tests pass.(cbdf1eb6) Add
test_subint_forkserver_key_errors_cleanlyregression guardpinning the variant-2 reservation contract: the
'subint_forkserver'key MUST raiseNotImplementedError(notsilently dispatch to variant-1), and the error msg must surface
both the working-backend pointer (
main_thread_forkserver) +the upstream blocker (
msgspec#1026).(7c5dd4d0) Fix
_testing.addr.get_rando_addrcross-process collisions: the
_rando_port: str = random.randint(...)default-arg expression was evaluated ONCE atmodule-import — making it a per-process singleton. Two parallel
pytest sessions had a 1/9000 birthday-pair chance of cascade-
failing every
reg_addr-using test. Switch to per-callrandom.randint()salted withos.getpid(); drop the bogus: strannotation.Future follow up
Resolve
test_nested_multierrorscancel-cascade hang under--capture=fd(subint_forkserver:test_nested_multierrorscancel-cascade hang gated by pytest--capture=fd#449). The--capture=sysdefault is aworkaround; the underlying pytest-capture-machinery ↔ fork-child
stdio interaction is not yet root-caused. See
ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md.Wire the variant-2
subint_forkserver_proc()impl oncemsgspecships PEP 684 isolated-mode support(Port to new
concurrent.interpreterssupport incpython3.14+ jcrist/msgspec#1026 → tracked at Auditsubint_forkserverthread constraints once msgspec PEP 684 lands #450). Today it's aNotImplementedErrorstub; the unblock lets the child runtimelive in an isolated subint while the parent's
trio.run()keepsrunning on main.
Audit
main_thread_forkserverthread-constraint cleanup oncemsgspecships PEP 684 support (Auditsubint_forkserverthread constraints once msgspec PEP 684 lands #450). Both primitives currentlyallocate dedicated
threading.Threadinstances rather than usingtrio.to_thread.run_sync; theTODO — cleanup gated on msgspec PEP 684 supportblock in the module docstring catalogs the threeentangled root causes that block the cleanup today.
Surface subint bootstrap exceptions to the parent task via a
nonlocal errslot._subint.subint_proc()currently logs themvia
log.exception()only — the?TODOnear the_interpreters.exec()call points at anyio'sto_interpreter._interp_call(retval, is_exception)pattern asthe next step. Coordinates with the
trio.Cancelledpaths aroundsubint_exited.wait().Migrate SpawnSpec arg-passing to
_interpreters.set___main___attrs()._subint.subint_proc()'s?TODOat thebootstrapliteral: same API anyio uses into_interpreter._Worker.call(); needed oncenon-
repr()-roundtrippable values (SpawnSpecstruct,callables) get passed through.
Implement
child_sigint='trio'mode (or remove the flag).Scaffolded in
_main_thread_forkserverbut currently a no-oppending the orphan-SIGINT root-cause fix tracked in
ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md.Once trio's
epoll_waitwedge is fixed, the flag may end up ano-op / doc-only mode.
Add cancellation / hard-kill stress coverage for the
forkserver backend (counterpart to
tests/spawn/test_subint_cancellation.pyfor the plainsubintbackend). Module docstring lists this under "Still-open work".
Run the
?TODOtypo-check enhancement intractor._testing.pytest— pipeskipon_spawn_backendargsthrough the try-set-backend checker rather than just
assert in get_args(SpawnMethodKey).xplatform pass for
tractor._testing._reap— process-reappath is currently Linux-only via
/proc/<pid>/{status,cwd,cmdline}; the--shmphase isLinux/FreeBSD only (macOS POSIX shm has no fs-visible path;
Windows is a different story). Module docstring notes a
psutil-based rewrite is viable since the dep is alreadytest-time.
Root-cause the Mode-A cancel-cascade hang under heavy
fork-spawn contention (
main_thread_forkserver: cancel-cascade occasionally hangs >9s under heavy fork-spawn contention #451). Reproduces ~17% of runs at 3parallel pytest streams ×
cpu_count - 2actors; ≈0% on anidle single-stream system. Parent-side dump shows trio's main
thread parked in
trio._core._io_epoll.get_events()line 245— cancel cascade has reached the I/O wait but
epoll.pollnever returns. Workaround in this PR: per-test
trio.fail_after(12)cap +reap_subactors_per_testopt-infixture confine the failure to the originating test (no
cascade contamination of sibling tests), so the suite stays
green. Real fix needs a contention-amplifier reproducer +
stackscopetask-tree dumps from parent + each subactor atthe t≈9 mid-cascade mark.
(this pr content was generated in some part by
claude-code)