Skip to content

Don't count requeued test failures toward circuit breaker#378

Merged
ianks merged 2 commits intomainfrom
ianks/fix-circuit-breaker-requeue-interaction
Mar 25, 2026
Merged

Don't count requeued test failures toward circuit breaker#378
ianks merged 2 commits intomainfrom
ianks/fix-circuit-breaker-requeue-interaction

Conversation

@ianks
Copy link
Copy Markdown
Contributor

@ianks ianks commented Mar 25, 2026

Summary

the circuit breaker (max_consecutive_failures) was counting requeued test failures as real failures. this meant a worker that hit a few flaky tests in a row would get killed even though the tests were being retried, and if that was the last active worker, those requeued tests just sat in the queue forever. build would finish reporting 0 failures.

before (max_consecutive_failures=3):

                        breaker sees     reality
  test A fails          failure #1       requeued (handled)
  test B fails          failure #2       requeued (handled)
  test C fails          failure #3       requeued (handled)
                        KILL worker

  worker dead. tests are back in queue but no workers left.
  build finishes: 0 failures.

after:

                        breaker sees     reality
  test A fails          (nothing)        requeued (handled)
  test B fails          (nothing)        requeued (handled)
  test C fails          (nothing)        requeued (handled)
  ...
  test A fails again    failure #1       requeue exhausted
  test B fails again    failure #2       requeue exhausted
  test C fails again    failure #3       requeue exhausted
                        KILL worker

  worker dead, but 3 failures are actually recorded.

Changes

  • moved report_failure! / report_success! after the requeue check in handle_test_result
  • requeued tests are now invisible to the circuit breaker. they dont increment or reset the consecutive failure counter
  • only tests that actually fail (requeue exhausted or not attempted) count

Tradeoff

The tradeoff here is that workers which are legitimately in a corrupted state will requeue more tests than before, making a test suite more likely to fail rather than a worker being killed. Ultimately, the tradeoff is that this "less resilient to worker corruption" behavior is better than the current "can silently skip failing tests" behavior.

@ianks ianks force-pushed the ianks/fix-circuit-breaker-requeue-interaction branch 2 times, most recently from 1b5962d to 4599469 Compare March 25, 2026 03:37
@ianks ianks requested review from ChrisBr and gmalette March 25, 2026 03:41
@ianks ianks force-pushed the ianks/fix-circuit-breaker-requeue-interaction branch 4 times, most recently from ab800f4 to 6f05279 Compare March 25, 2026 03:57
@ianks ianks marked this pull request as ready for review March 25, 2026 04:06
Copy link
Copy Markdown

@dpetrick dpetrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasoning sounds good to me, I'd prefer someone with internal knowledge to confirm if possible, but I think it's good to go.

ianks added 2 commits March 25, 2026 15:37
report_failure! was called before the requeue check, so successfully
requeued tests incremented the consecutive failure counter. With
max_consecutive_failures=N and N+ deterministic failures that are all
requeued, the circuit breaker fired prematurely and the worker exited,
stranding requeued tests in the queue with no worker to process them.

Now report_failure!/report_success! is only called when a test genuinely
fails or passes. Requeued tests are invisible to the circuit breaker —
they don't increment or reset the consecutive failure counter.
@ianks ianks force-pushed the ianks/fix-circuit-breaker-requeue-interaction branch from ce68f99 to 3f25c0c Compare March 25, 2026 19:37
@ianks ianks merged commit 5372458 into main Mar 25, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants