Skip to content

fix(codex): harden bootstrap with stderr capture, health probe, retry + soft-skip#28

Merged
openlawbot merged 2 commits into
mainfrom
fix/codex-bootstrap-reliability
May 1, 2026
Merged

fix(codex): harden bootstrap with stderr capture, health probe, retry + soft-skip#28
openlawbot merged 2 commits into
mainfrom
fix/codex-bootstrap-reliability

Conversation

@openlawbot
Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: PR #278's Gate Auto-Fix: failed was caused by macOS Gatekeeper silently blocking the codex binary in _dyld_start for 40 minutes. Gate had no stderr capture, no health probe, no retry, and a single shared 2400s timeout for both bootstrap and long-running calls.
  • Adds BOOTSTRAP_TIMEOUT_S (120s), stderr capture, codex_health_check() with 5-min TTL cache + macOS Gatekeeper diagnostics, preflight + one retry + soft-skip in the fix pipeline, check_codex_cli + check_claude_cli in periodic health checks, notify.codex_unavailable with 1-hour throttle, gate doctor Gatekeeper hints, and docs.
  • When codex is unavailable, the fix pipeline now returns FixResult(success=True, pushed=False) — preserving the original review verdict instead of turning the Auto-Fix check red.

Test plan

  • All 190 existing + new tests pass (test_codex.py, test_fixer.py, test_health.py)
  • codex --version returns instantly from ~/.local/bin/codex
  • codex exec successfully creates sessions and returns results
  • New tests cover: stderr capture, bootstrap timeout, health cache hit/miss/invalidation, preflight-skip, retry-then-skip, bootstrap success path

Made with Cursor

… + soft-skip

PR #278's Gate Auto-Fix failure was caused by macOS Gatekeeper silently
blocking the codex binary in _dyld_start for 40 minutes (the full
CODEX_TIMEOUT_S). Gate had no stderr capture, no health probe, no retry,
and a single shared timeout for both bootstrap and long-running calls.

Changes:
- Add BOOTSTRAP_TIMEOUT_S (120s) separate from CODEX_TIMEOUT_S (2400s)
- Capture stderr in bootstrap_codex (now returns 3-tuple with stderr_tail)
- Add codex_health_check() with 5-min TTL cache and macOS Gatekeeper
  quarantine/spctl diagnostics
- Wire preflight + one retry + soft-skip into fix pipeline — when codex
  is unavailable, return FixResult(success=True, pushed=False) to
  preserve the original review verdict instead of going red
- Remove senior-without-delegation degraded path
- Add check_codex_cli + check_claude_cli to periodic health checks
- Add notify.codex_unavailable with 1-hour throttle
- gate doctor prints Gatekeeper remediation hint on macOS
- Document the dyld/Gatekeeper failure mode in health-and-recovery.md

Co-authored-by: Cursor <cursoragent@cursor.com>
@openlawbot openlawbot force-pushed the fix/codex-bootstrap-reliability branch from 39db282 to 68b0127 Compare May 1, 2026 20:37
@openlawbot
Copy link
Copy Markdown
Collaborator Author

Gate Review ✅

Approved with notes — The PR successfully hardens the Codex bootstrap path with stderr capture, a 120 s bootstrap timeout, a cached health probe with macOS Gatekeeper diagnostics, a preflight gate, one retry, and a soft-skip FixResult when Codex is unavailable. Build is clean: 1102 tests pass, lint is green, and no type errors were found. The postcondition audit confirms the return-type change from 2-tuple to 3-tuple in bootstrap_codex is correctly propagated to all callers (fixer.py, fixer_polish.py). Two warnings require author attention before merging: the new _last_codex_alert_ts float in notify.py is read and written without a threading.Lock, creating a TOCTOU race that can break the 1-hour alert-throttle under concurrent reviews; and check_claude_cli in health.py spawns a live subprocess on every health cycle with no caching, while the sibling check_codex_cli properly delegates to a 5-minute TTL cache — an asymmetry the PR's own motivation argues against. Three informational notes are also included but do not block merging.

Warnings

  • gate/notify.py:210 (pattern match) — Module-level _last_codex_alert_ts float is read and written in codex_unavailable() without a threading.Lock. Gate's ThreadPoolExecutor runs concurrent PR reviews; two threads can simultaneously read a stale timestamp, both pass the cooldown guard, and both fire an alert — breaking the 1-hour throttle guarantee. The pattern is inconsistent with _health_lock already used in gate/codex.py.

    Fix: Add _codex_alert_lock = threading.Lock() alongside the timestamp and wrap the read-check-write in with _codex_alert_lock:. This matches the pattern already in gate/codex.py for _health_lock.

  • gate/health.py:556 (pattern match) — check_claude_cli calls subprocess.run(["claude", "--version"], timeout=10) directly on every health-check invocation with no caching, while the sibling check_codex_cli delegates to codex_health_check() which carries a 5-minute TTL cache keyed on binary path + mtime. A Gatekeeper-blocked or hung claude binary will block the health worker thread for the full 10 s timeout on every cycle — the same failure mode this PR was written to prevent for Codex.

    Fix: Introduce a claude_health_check(force: bool = False) -> tuple[bool, str] function in gate/codex.py with a matching 5-minute TTL cache, and have check_claude_cli delegate to it. This also makes macOS Gatekeeper diagnostics available for the Claude binary through the same code path.

Notes

  • gate/codex.py:404 — In the subprocess.TimeoutExpired branch of codex_health_check, the message always includes 'likely Gatekeeper block' even on non-macOS platforms where Gatekeeper does not exist. _diagnose_macos_gatekeeper correctly returns '' on non-macOS, but the surrounding phrase is unconditional.
  • gate/notify.py:226notify.codex_unavailable, health.check_codex_cli, health.check_claude_cli, and setup._gatekeeper_hint lack direct unit tests. The cooldown logic in codex_unavailable (the 3600 s throttle updating _last_codex_alert_ts) is the sole guard against alert-fatigue spam; a future edit to the set-before-vs-after-send order would silently regress without a dedicated test.
  • gate/orchestrator.py:722 — The orchestrator's is_no_op branch unconditionally sets the Auto-Fix check title to 'skipped (no mechanical changes needed)', which now also fires for the new codex-unavailable soft-skip path introduced by this PR. The summary field correctly explains the real reason, but the title is misleading for operators who only see check names.

Build Results

  • ruff: ✅ (0 warnings)
  • pytest: ✅ (1102/1108 passed)

2 warnings, 3 notes across 5 stages (871s, confidence: high)

Findings fixed:
- gate/notify.py — Added `_codex_alert_lock = threading.Lock()` next to the timestamp constants and wrapped the read-check-write of `_last_codex_alert_ts` in `with _codex_alert_lock:` inside `codex_unavailable()`. Network I/O (notify, notify_discord) stays outside the lock so a slow ntfy POST cannot block other workers. Added `import threading` to the module's import block.
- gate/health.py — Added `claude_health_check(force=False) -> tuple[bool, str]` in gate/codex.py mirroring `codex_health_check`: 5-minute TTL cache keyed on (path, mtime), reuse of `_diagnose_macos_gatekeeper` for Gatekeeper diagnostics. Added a parallel `_claude_binary_key()` helper. Refactored `_store_health` to take a cache-key name parameter so both probes share the same cache dict under separate keys. Replaced `check_claude_cli` body in gate/health.py with a thin delegation to `claude_health_check`. Added 2 parity tests (`TestClaudeHealthCheck.test_healthy_claude`, `test_not_in_path`) in tests/test_codex.py.
@openlawbot
Copy link
Copy Markdown
Collaborator Author

Gate Auto-Fix Applied

Fixed 2/2 findings in 1 iteration(s) (d7ce291).

Fixed:

  • gate/notify.py — Added _codex_alert_lock = threading.Lock() next to the timestamp constants and wrapped the read-check-write of _last_codex_alert_ts in with _codex_alert_lock: inside codex_unavailable(). Network I/O (notify, notify_discord) stays outside the lock so a slow ntfy POST cannot block other workers. Added import threading to the module's import block.
  • gate/health.py — Added claude_health_check(force=False) -> tuple[bool, str] in gate/codex.py mirroring codex_health_check: 5-minute TTL cache keyed on (path, mtime), reuse of _diagnose_macos_gatekeeper for Gatekeeper diagnostics. Added a parallel _claude_binary_key() helper. Refactored _store_health to take a cache-key name parameter so both probes share the same cache dict under separate keys. Replaced check_claude_cli body in gate/health.py with a thin delegation to claude_health_check. Added 2 parity tests (TestClaudeHealthCheck.test_healthy_claude, test_not_in_path) in tests/test_codex.py.

@openlawbot
Copy link
Copy Markdown
Collaborator Author

Gate Review ✅

Approved — Build is fully clean (1104 passed, 0 failed, lint/typecheck pass). Both prior warnings have been resolved: the threading race on _last_codex_alert_ts is fixed with _codex_alert_lock, and check_claude_cli now delegates to the cached claude_health_check probe instead of calling subprocess directly on every health cycle. Per the re-review graduation rule, all prior warnings are resolved and only three info-level carryovers remain (misleading 'likely Gatekeeper block' wording on non-macOS in the new claude_health_check, missing timeout/error-path tests for TestClaudeHealthCheck, and no direct unit test for the codex_unavailable cooldown), none of which block a clean approve. The fix pipeline's soft-skip behavior on bootstrap failure is correct and well-tested; postconditions for the 3-tuple return type, health probe caching, and FixResult semantics are all satisfied.

Notes

  • gate/codex.py:470 — The new claude_health_check timeout branch unconditionally appends '— likely Gatekeeper block' even on non-macOS hosts where Gatekeeper does not exist. _diagnose_macos_gatekeeper correctly returns '' on Linux/Windows, but the surrounding phrase is hardcoded. This is the same parity gap that exists in the sibling codex_health_check at line 415.
  • tests/test_codex.py:401TestClaudeHealthCheck only covers the healthy path and binary-not-in-PATH. The sibling TestCodexHealthCheck has 8 cases covering nonzero exit, OSError, cache hit, force=True bypass, mtime invalidation, and TimeoutExpired. The timeout branch (precisely the Gatekeeper scenario this PR targets) is unexercised for claude.
  • gate/notify.py:216 — The 1-hour throttle in codex_unavailable still has no direct unit test. The PR correctly added _codex_alert_lock, but there is no test asserting that two back-to-back calls fire the underlying notify exactly once and advance _last_codex_alert_ts exactly once.

Resolved since last review

  • gate/notify.py — Module-level _last_codex_alert_ts was read and written in codex_unavailable() without a threading.Lock, allowing two concurrent pool workers to both pass the cooldown guard and double-fire the alert. (fixed)
  • gate/health.pycheck_claude_cli called subprocess.run(["claude", "--version"], timeout=10) directly on every health-check invocation with no caching, unlike check_codex_cli which delegates to the 5-minute TTL codex_health_check probe. (fixed)

Build Results

  • ruff: ✅ (0 warnings)
  • pytest: ✅ (1104/1110 passed)

3 notes across 6 stages (376s, confidence: high)

@openlawbot openlawbot merged commit 7586183 into main May 1, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant