test(e2e): auto-retry transient python-e2e flakes (pytest-rerunfailures)#278
Merged
Conversation
The python-e2e suite drives a real server + real LLM, so individual tests flake nondeterministically on transient conditions — workflow still RUNNING at the client timeout, tool-call batches not returning, LLM phrasing variance. A single transient failure currently fails the whole job; observed runs failed a *different* unrelated test each time (test_after_tool_callback_executes, test_http_lifecycle@credentials, ...). Add pytest-rerunfailures and mark every e2e item flaky(reruns=2, reruns_delay=5) via the e2e conftest. A genuinely broken test still fails all 3 attempts; a one-off flake recovers. Configured in conftest (not the CI yaml) so it also covers local e2e runs and needs no workflow-file change. dev extra + uv.lock updated (pytest-rerunfailures 16.3).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
python-e2eCI job drives a real server + real LLM, so individual tests flake nondeterministically — and a single transient failure fails the whole job. While verifying an unrelated fix (#277), consecutive runs each failed a different, unrelated test:test_after_tool_callback_executesRUNNINGat client timeouttest_http_lifecycle@credentialsThese are transient infra/LLM-latency flakes, not regressions.
Fix
Add
pytest-rerunfailuresand mark every e2e itemflaky(reruns=2, reruns_delay=5)via the e2econftest.py(pytest_collection_modifyitems, scoped to items carrying thee2emarker). A genuinely broken test still fails all 3 attempts — no real regression is masked — while a one-off flake recovers.Configured in
conftestrather than the CI YAML so it also applies to local e2e runs (and needs no workflow-file change).Verification
pytest-rerunfailuresimports; the dynamically-addedflakymarker carries{reruns: 2, reruns_delay: 5}.pytest e2e/ --collect-onlycollects 115 tests with the hook in place (the 2 unrelated collection errors are a missing local-onlymcp_test_serverdep that CI installs separately).e2emarker, so unit suites are unaffected.Note
This is the real fix for "
python-e2eflakes on a different test each run", complementing #277 (which fixes a genuine, deterministic guardrail bug). Recommend merging #277 first, then this.