fix(codex): harden bootstrap with stderr capture, health probe, retry + soft-skip#28
Conversation
… + soft-skip PR #278's Gate Auto-Fix failure was caused by macOS Gatekeeper silently blocking the codex binary in _dyld_start for 40 minutes (the full CODEX_TIMEOUT_S). Gate had no stderr capture, no health probe, no retry, and a single shared timeout for both bootstrap and long-running calls. Changes: - Add BOOTSTRAP_TIMEOUT_S (120s) separate from CODEX_TIMEOUT_S (2400s) - Capture stderr in bootstrap_codex (now returns 3-tuple with stderr_tail) - Add codex_health_check() with 5-min TTL cache and macOS Gatekeeper quarantine/spctl diagnostics - Wire preflight + one retry + soft-skip into fix pipeline — when codex is unavailable, return FixResult(success=True, pushed=False) to preserve the original review verdict instead of going red - Remove senior-without-delegation degraded path - Add check_codex_cli + check_claude_cli to periodic health checks - Add notify.codex_unavailable with 1-hour throttle - gate doctor prints Gatekeeper remediation hint on macOS - Document the dyld/Gatekeeper failure mode in health-and-recovery.md Co-authored-by: Cursor <cursoragent@cursor.com>
39db282 to
68b0127
Compare
Gate Review ✅Approved with notes — The PR successfully hardens the Codex bootstrap path with stderr capture, a 120 s bootstrap timeout, a cached health probe with macOS Gatekeeper diagnostics, a preflight gate, one retry, and a soft-skip FixResult when Codex is unavailable. Build is clean: 1102 tests pass, lint is green, and no type errors were found. The postcondition audit confirms the return-type change from 2-tuple to 3-tuple in bootstrap_codex is correctly propagated to all callers (fixer.py, fixer_polish.py). Two warnings require author attention before merging: the new _last_codex_alert_ts float in notify.py is read and written without a threading.Lock, creating a TOCTOU race that can break the 1-hour alert-throttle under concurrent reviews; and check_claude_cli in health.py spawns a live subprocess on every health cycle with no caching, while the sibling check_codex_cli properly delegates to a 5-minute TTL cache — an asymmetry the PR's own motivation argues against. Three informational notes are also included but do not block merging. Warnings
Notes
Build Results
2 warnings, 3 notes across 5 stages (871s, confidence: high) |
Findings fixed: - gate/notify.py — Added `_codex_alert_lock = threading.Lock()` next to the timestamp constants and wrapped the read-check-write of `_last_codex_alert_ts` in `with _codex_alert_lock:` inside `codex_unavailable()`. Network I/O (notify, notify_discord) stays outside the lock so a slow ntfy POST cannot block other workers. Added `import threading` to the module's import block. - gate/health.py — Added `claude_health_check(force=False) -> tuple[bool, str]` in gate/codex.py mirroring `codex_health_check`: 5-minute TTL cache keyed on (path, mtime), reuse of `_diagnose_macos_gatekeeper` for Gatekeeper diagnostics. Added a parallel `_claude_binary_key()` helper. Refactored `_store_health` to take a cache-key name parameter so both probes share the same cache dict under separate keys. Replaced `check_claude_cli` body in gate/health.py with a thin delegation to `claude_health_check`. Added 2 parity tests (`TestClaudeHealthCheck.test_healthy_claude`, `test_not_in_path`) in tests/test_codex.py.
Gate Auto-Fix AppliedFixed 2/2 findings in 1 iteration(s) (d7ce291). Fixed:
|
Gate Review ✅Approved — Build is fully clean (1104 passed, 0 failed, lint/typecheck pass). Both prior warnings have been resolved: the threading race on Notes
Resolved since last review
Build Results
3 notes across 6 stages (376s, confidence: high) |
Summary
Gate Auto-Fix: failedwas caused by macOS Gatekeeper silently blocking the codex binary in_dyld_startfor 40 minutes. Gate had no stderr capture, no health probe, no retry, and a single shared 2400s timeout for both bootstrap and long-running calls.BOOTSTRAP_TIMEOUT_S(120s), stderr capture,codex_health_check()with 5-min TTL cache + macOS Gatekeeper diagnostics, preflight + one retry + soft-skip in the fix pipeline,check_codex_cli+check_claude_cliin periodic health checks,notify.codex_unavailablewith 1-hour throttle,gate doctorGatekeeper hints, and docs.FixResult(success=True, pushed=False)— preserving the original review verdict instead of turning the Auto-Fix check red.Test plan
test_codex.py,test_fixer.py,test_health.py)codex --versionreturns instantly from~/.local/bin/codexcodex execsuccessfully creates sessions and returns resultsMade with Cursor