fix(qa): timeout-guard the scorer claude -p call (stop intermittent hangs)#1039
Conversation
…angs blocking forever)
score.sh retries on unparseable/transient output but had NO wall-clock bound on claude -p, so a
genuine hang (stuck stream / slow response that never returns) blocked the entire run forever —
seen repeatedly on combat-sprint + north-star scoring (the run fights+gates GREEN, then scoring
hangs). Wrap claude -p in `timeout ${WORLDOS_SCORE_TIMEOUT:-300}` so a hang becomes empty output
that the existing retry loop catches. Healthy scores are ~60-150s → never fires on a good call.
Unblocks the whole QA feedback loop (every scored run depends on this).
|
Warning Review limit reached
More reviews will be available in 13 minutes and 53 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4676a3bea
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # fires on a good call — it only rescues a genuine hang. | ||
| printf '%s' "$INPUT" | env -u ANTHROPIC_BASE_URL -u ANTHROPIC_API_KEY -u ANTHROPIC_AUTH_TOKEN \ | ||
| -u API_TIMEOUT_MS -u CLAUDE_CONFIG_DIR claude -p \ | ||
| -u API_TIMEOUT_MS -u CLAUDE_CONFIG_DIR timeout "${WORLDOS_SCORE_TIMEOUT:-300}" claude -p \ |
There was a problem hiding this comment.
Use worldos_timeout instead of bare timeout
On stock macOS this newly added bare timeout command is not available, and this scorer is called from the local QA lanes such as run_duo.sh and run_combat_sprint.sh; in that environment env ... timeout ... exits 127 before claude starts, leaving $RAW empty on every retry and making every score fail. This reintroduces the same coreutils dependency the repo already fixed for play lanes with worldos_timeout/python fallback, so the scorer timeout guard needs to use that shim or an equivalent fallback.
Useful? React with 👍 / 👎.
…ty (#1038) + scorer timeout-guard (#1039) (#1041) Test-proven code checkpoint on rc2. Still NOT a GA — the GLM re-measure under the current ruler is deferred (scorer hangs on combat-sprint transcripts, #1040). Story above bar at depth (old ruler), mech the real gap, satisfaction green. Co-authored-by: Eva <arncalso@gmail.com>
The Angry-DM/lens scorer (
qa/score.sh) retries on unparseable output but had no timeout onclaude -p— so an intermittent hang (stuck stream / slow response) blocked the entire run forever. Seen repeatedly: the run fights + gates GREEN, then scoring hangs (combat-sprint, north-star). One-line fix: wrapclaude -pintimeout ${WORLDOS_SCORE_TIMEOUT:-300}so a hang becomes empty output the existing retry loop catches. Healthy scores are ~60–150s, so it never fires on a good call. Unblocks the QA feedback loop. Shell-only (bash -nclean).