Skip to content

test(#100): fix e2e flakiness — condition-based polling instead of fixed sleeps#102

Merged
lis186 merged 3 commits into
mainfrom
worktree-fix-100-e2e-flaky
Jun 21, 2026
Merged

test(#100): fix e2e flakiness — condition-based polling instead of fixed sleeps#102
lis186 merged 3 commits into
mainfrom
worktree-fix-100-e2e-flaky

Conversation

@lis186

@lis186 lis186 commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Closes #100.

Root cause (5 Whys)

The proxy writes logs from a spawned child process. Several e2e tests read index.ndjson / log files after a fixed await setTimeout(500) and then assert. Under the parallel full-suite run (node --test test/*.test.js), many e2e files each spawn a real server, saturating CPU/IO, so the child's async write sometimes lands after the guessed delay → the read finds nothing → expected OpenAI index entry and friends. Each test passes in isolation; the suite failed a different subtest run-to-run.

The codebase already had the right pattern (waitForPort polls /_api/health; waitForIndexEntry polls the index) — it just wasn't applied to these sleep-then-read sites.

Change

  • Add poll-until helpers: waitFor + readIndex (startup.test.js), waitForIndexEntries (websocket-proxy.test.js), modeled on the existing pollers.
  • Convert the sleep-then-read sites to wait for the actual condition (index entry present, file count reached, sysprompt API / SSE snapshot state) instead of a fixed delay.
  • Bonus: faster fast-path — an early write returns immediately rather than always sleeping 500ms.

Scope: only the proven read-gating waits. A few unrelated fixed delays remain (error-recovery health checks, inter-turn WS pacing, teardown settle) — they aren't the observed flaky class and the terminal condition-poll now absorbs upstream timing.

Verification

CCXRAY_HOME=$(mktemp -d) node --test test/*.test.js6 consecutive runs, 947 pass / 0 fail, zero flaky, ~48s each. Files also pass in isolation (startup 62/62, ws 17/17).

Note: while verifying I hit an unrelated artifact — running the suite back-to-back through the npm/shell-pipe wrapper occasionally failed to return even though node --test itself completed and printed its summary. Running node --test directly exits cleanly every time. Not caused by this change; flagging in case it recurs in CI.

🤖 Generated with Claude Code

Justin Lee and others added 2 commits June 21, 2026 13:04
The proxy writes logs from a spawned child process, so tests can only
observe completion by polling the filesystem/API. Reading index.ndjson or
log files after a fixed `setTimeout(500)` flaked under parallel-suite load:
the guessed delay sometimes elapsed before the cross-process write landed,
e.g. "expected OpenAI index entry" in startup.test.js. The suite failed a
different e2e subtest run-to-run; each passed in isolation.

Adds poll-until helpers (waitFor/readIndex in startup.test.js,
waitForIndexEntries in websocket-proxy.test.js) modeled on the existing
waitForPort/waitForIndexEntry, and converts the sleep-then-read sites to
wait for the actual condition (entry present, file count, API/SSE state).
This also speeds up the fast path — a write that lands early returns
immediately instead of always waiting 500ms.

Verified: CCXRAY_HOME=$(mktemp -d) node --test test/*.test.js, 6 consecutive
runs, 947/0, zero flaky, ~48s each.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iew)

Codex review of #102 flagged that waitFor awaited fn() before checking the
deadline, so an async predicate that never settles (e.g. httpGet to a dead
proxy) could hang the test instead of timing out. Race fn() against the
remaining budget and clear the loser timer so it doesn't linger on the event
loop. Also give httpGet its own 5s socket timeout so a silent no-response
rejects (and frees the socket) rather than blocking forever.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lis186

lis186 commented Jun 21, 2026

Copy link
Copy Markdown
Owner Author

Codex 二審:no blocker. One should-fix addressed in f46cbea.

  • should-fixwaitFor awaited fn() before checking the deadline, so an async predicate that never settles (e.g. httpGet to a dead proxy) could hang the test instead of timing out. Fixed by racing fn() against the remaining budget (with clearTimeout so the loser timer doesn't linger), plus a 5s socket timeout on httpGet so a silent no-response rejects and frees the socket.

Re-verified: 9 consecutive full-suite runs via node --test test/*.test.js (empty CCXRAY_HOME), all 947/0, zero flaky.

@lis186 lis186 merged commit b9d6531 into main Jun 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: replace fixed setTimeout waits with condition-based polling to fix e2e flakiness

1 participant