Skip to content

fix(codex): force valid model on app-server spawn to avoid silent hang#34

Merged
cskwork merged 7 commits into
mainfrom
fix/codex-0130-invalid-model-hang
May 17, 2026
Merged

fix(codex): force valid model on app-server spawn to avoid silent hang#34
cskwork merged 7 commits into
mainfrom
fix/codex-0130-invalid-model-hang

Conversation

@cskwork
Copy link
Copy Markdown
Owner

@cskwork cskwork commented May 17, 2026

Summary

  • WORKFLOW.md codex.command now passes -c model=gpt-5-codex to codex app-server so an invalid model in the user's ~/.codex/config.toml no longer makes thread/start hang silently.
  • Adds docs/SMA-26/diagnosis.md with reproduction, root cause, and follow-up notes.

Why

codex 0.130.0 does not return a JSON-RPC error on thread/start when the configured model is invalid — it just stops responding. Symphony's _request() has no per-call timeout, so the silent hang surfaces much later as the confusing error: port_exit: subprocess stdout closed once the subprocess is killed for unrelated reasons. Six straight retries on SMA-25 hit exactly this failure mode.

The override pins a valid model at spawn time without mutating the user's home config (other tools that share ~/.codex/ may rely on whatever the user has set).

How verified

Manual JSON-RPC handshake against codex app-server (with and without the override):

( printf '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"clientInfo":{"name":"symphony","version":"0.2.0"}}}\n'
  printf '{"jsonrpc":"2.0","id":2,"method":"thread/start","params":{"cwd":"/tmp"}}\n'
  sleep 5
) | timeout 7 codex app-server -c model=gpt-5-codex

Without -c model=...: only the initialize response comes back, thread/start (id=2) returns zero bytes, timeout kills the process at 7s. With the override: thread/start responds immediately with result.thread.id = 019e34f0-c760-7540-b18e-d9fbfedd65bd plus the expected thread/started notification.

Symphony backend dispatch verified live — see docs/SMA-26/diagnosis.md for the full trace and file:line references.

Test plan

  • Existing pytest suite stays green (pytest -q).
  • On a fresh clone with a default ~/.codex/config.toml, a codex-routed ticket dispatches without port_exit.
  • On a host with no ~/.codex/config.toml, the override has no negative effect — codex picks gpt-5-codex as expected.

Follow-ups (separate)

  • Upstream codex: should return a JSON-RPC error on thread/start for invalid models instead of hanging.
  • Symphony backend: add a per-call timeout to _request() (or at minimum to thread/start) so silent hangs surface fast as a clear RequestTimeout rather than a stale port_exit minutes later.

symphony and others added 7 commits May 17, 2026 16:56
Codifies the _rebuild_backend_for_phase try/except envelope, the
BaseException catch rationale, the _install_fake_backend factory-call
contract, and the WorkspaceManager hot-reload three-setter contract.

Sources: PR #19 (12b4610), PR #21 (dfdbdc8), local follow-up 3a8bb7e.
SMA-24: Verify orchestrator fixes from PR #19 + PR #21 cherry-picks.

- 2 regression tests pinning PR #19 (workflow_dir + reuse_policy +
  hook_env simultaneous reload) and PR #21 (start_session failure
  cleanup) invariants.
- New docs/llm-wiki/orchestrator-phase-transition.md entry.
- Zero production code changes (git diff main -- src/ empty pre-merge).

QA: targeted 91 passed / 1 skipped, full suite 524 passed / 6 skipped.
Verify orchestrator fixes from PR #19 + PR #21 cherry-picks

Source: symphony/SMA-24 0e26e1b
WORKFLOW.md `codex.command` now passes `-c model=gpt-5-codex` so codex
0.130's `thread/start` returns a real response even when the user's
`~/.codex/config.toml` pins an invalid model. Without the override,
codex hangs silently on `thread/start` (no JSON-RPC error, no exit),
and Symphony eventually surfaces this as the confusing
`error: port_exit: subprocess stdout closed` after the subprocess is
killed for unrelated reasons.

Override at the command level keeps the user-level codex config
untouched — other tools that share `~/.codex/` may rely on whatever
the user has set.

Full diagnosis (reproduction, root cause, upstream + Symphony
follow-ups) lives at docs/SMA-26/diagnosis.md.
…nched backends

When the Symphony orchestrator is launched in the background (e.g.
`nohup python -m symphony.cli ... &` or a systemd-style unit with no
TTY), the orchestrator process inherits a closed or half-broken fd 0.
The previous `subprocess.run` for hooks did not pin `stdin`, so the
hook's `bash` and any grandchild it spawned inherited that broken fd.

The most visible victim is `after_create`'s
`scripts/symphony-setup-worktree.sh`, which runs `python -m venv .venv`
+ `pip install`. CPython aborts at startup with

  Fatal Python error: init_sys_streams: can't initialize sys standard
  streams / OSError: [Errno 9] Bad file descriptor

and the orchestrator surfaces this as a confusing
`hook after_create exited 1`. Symptomatically the very first hook
invocation per-backend tends to succeed (workspace freshly created)
while subsequent ones fail — observed live as 4+ consecutive
`hook_failed` after ~9 healthy dispatches against the same backend
on macOS.

The one-line `stdin=subprocess.DEVNULL` pins fd 0 for every hook
spawn regardless of how the parent was launched. No effect on hooks
that don't read stdin (every existing hook in the repo).

Pairs with the `-c model=gpt-5-codex` codex command override on this
branch — both are backend-reliability fixes that make Symphony
behave the same whether you run it from a TTY or in the background.
@cskwork cskwork merged commit 686504f into main May 17, 2026
2 of 3 checks passed
@cskwork cskwork deleted the fix/codex-0130-invalid-model-hang branch May 17, 2026 09:31
@cskwork cskwork mentioned this pull request May 17, 2026
4 tasks
cskwork added a commit that referenced this pull request May 17, 2026
Patch release rolling up the post-0.6.0 reliability fixes:

- #19 refresh workflow dir on hook reload (cherry-pick)
- #21 stop failed phase-transition backend (cherry-pick)
- #22 isolate corrupt file tickets during scan
- #23 honor symphony.autocommitExclude in commit_workspace_on_done
       (opt-in escape hatch; default behavior unchanged)
- #34 force valid codex model on app-server spawn to avoid silent hang
- #34 pin hook stdin to /dev/null for background-launched backends

All changes are bug fixes restoring intended behavior — no user-facing
feature additions or signature changes. Pinning pyproject.toml and
src/symphony/__init__.py in lockstep.
cskwork added a commit that referenced this pull request May 17, 2026
SMA-25 (Verify autocommitExclude mechanism from PR #23) ran on codex
and was self-Blocked at Learn when the merge gate failed against
.git/worktrees/<ID>/ — the codex sandbox refused writes through the
worktree's git admin dir. That sandbox gap is fixed separately in PR #36.

The agent reached Learn with substantive artefacts though, and they are
worth keeping independently of the merge-gate failure. This recovers
only the high-signal files from `symphony/SMA-25` (commit 9c964fe) and
leaves out the per-turn status echoes and raw pytest JSON/diff runs
that were progress noise rather than reference material.

Recovered:
- docs/SMA-25/{explore,plan,work,qa,learn}/* — phase deliverables
  (Explore notes + reuse inventory, implementation plan, work + qa-rewind
  summaries, QA api-surface + details, Learn details).
- docs/features/SMA-25/index.md — As-Is/To-Be one-pager.
- docs/llm-wiki/workspace-auto-commit-excludes.md — new wiki entry
  covering the opt-in `symphony.autocommitExclude` mechanism, including
  the base-squash safety case Explore surfaced beyond the original brief.
- docs/llm-wiki/INDEX.md — one new row for the wiki entry (kept the
  existing SMA-24 `orchestrator-phase-transition` row, which the
  SMA-25 branch had silently dropped because it was forked before SMA-24
  merged into main).

Not recovered (deliberately):
- docs/SMA-25/todo/turn-*-status.md + blocker.md (stale-worker echo logs
  from the pre-restart cycle).
- docs/SMA-25/qa/runs/*.json + qa/diff/*.diff (raw pytest output, large
  and reproducible from `pytest` directly).
- src/symphony/*.py / tests/*.py / pyproject.toml / __init__.py changes
  on the SMA-25 branch — those were forked before PR #19/#21/#22/#23/#34/
  #36 landed and would revert main.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant