Skip to content

Add heartbeat_max_test_duration to cap per-test heartbeat duration#397

Merged
ianks merged 2 commits intomainfrom
heartbeat-max-test-duration
Apr 17, 2026
Merged

Add heartbeat_max_test_duration to cap per-test heartbeat duration#397
ianks merged 2 commits intomainfrom
heartbeat-max-test-duration

Conversation

@ianks
Copy link
Copy Markdown
Contributor

@ianks ianks commented Apr 9, 2026

After #384 removed the heartbeat countdown (to fix the poisoning bug), stuck tests
ended up heartbeating forever -- other workers could never reclaim them via
reserve_lost. This PR adds a per-test heartbeat cap so the entry eventually goes
stale and can be stolen.

What changed

  • --heartbeat-max-test-duration SECONDS CLI flag -- once a test has been running
    for this long, the heartbeat thread stops ticking. The ZSET score then goes stale
    after max_missed_heartbeat_seconds and another worker can pick it up.
  • Defaults to timeout * 10 when heartbeat is enabled, so existing setups get
    reasonable behavior out of the box.

Timing gotcha

There was a subtle bug I hit while building this: the heartbeat thread can miss the
initial :tick broadcast (the condition variable fires before the thread calls
wait), so the first tick can land up to 1 second late. If you measure elapsed from
"when the thread first woke up", the stale threshold ends up skewed by that 1 second
-- which can put it right at the moment the stuck test naturally finishes, leaving no
steal window.

The fix is to pass started_at = Process.clock_gettime(CLOCK_MONOTONIC) through the
tick state from with_heartbeat, so the elapsed calculation is always anchored to
when the test actually started.

closes #395

@ianks ianks force-pushed the heartbeat-max-test-duration branch from c0af2a3 to 344a97c Compare April 12, 2026 03:20
ianks added 2 commits April 17, 2026 15:27
…ost-test reclamation

A stuck test would heartbeat forever (since PR #384 removed the countdown),
preventing other workers from reclaiming it via reserve_lost.

Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat
thread stops ticking after N seconds per test. Once ticking stops the ZSET
score goes stale and reserve_lost can steal the entry.

The started_at timestamp is passed through the tick state from with_heartbeat
so elapsed is measured from when the test actually started, not from when the
heartbeat thread woke up (which could be up to 1 second late due to the
@cond.wait(1) timeout causing a skewed stale threshold).

Fixes #395
@ianks ianks force-pushed the heartbeat-max-test-duration branch from 8b1a520 to c321e91 Compare April 17, 2026 19:27
@ianks ianks merged commit 470b2b0 into main Apr 17, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add max_test_duration cap to heartbeat thread

1 participant