feat(gate): strict pre-push verification before commit_and_push#35
Conversation
build_verify already runs test_cmds, but compare_builds accepts test failures whose count did not increase (pre-existing tolerance). Husky-style pre-push hooks run strict and reject any failure, so fixes that Gate's tolerant view approved were being rejected at git push time (PR #399 motivation). This change adds a strict pass/fail gate between re-review pass and _commit_and_finish that mirrors what the remote hook will face, with hybrid classification: - Pass: short-circuit to commit (zero extra cost — reuses cached build_verify result). - Regression (count increased, missing baseline, parse failure, or generic-test project_type): resume Codex with the failing-test context and retry once within the same iteration; on continued failure fall through to the next iteration WITHOUT reverting so the partial fix is preserved. - Pre-existing (count unchanged on a reliable parser): fail fast with a clear "fix manually" message — Gate does not auto-fix tests it did not break. Wiring: - build_verify gains a strict_pass field (and the raw build_result) alongside its tolerance-applied pass field. - pre_push_verify reuses strict_pass when pre_push_verify_cmds is unset, and runs the explicit list otherwise (replacement of test_cmds — covers repo hooks that do more than tests, e.g. contract drift checks). - MAX_ITERATIONS bumped 2 → 3 (config-driven via new get_fix_max_iterations) to absorb a worst-case third iteration when the pre-push gate trips. Configuration (under [repos.build]): - pre_push_strict (default: true) - pre_push_disable (default: false — escape hatch) - pre_push_verify_cmds (default: [] — falls back to test_cmds) - pre_push_timeout_s (default: 600) Strict-by-default activates only when test_cmds (or pre_push_verify_cmds) is configured, so OSS users without tests are unaffected. Documented in gate.toml.example. Co-authored-by: Cursor <cursoragent@cursor.com>
Gate Review ❌Changes requested — The PR introduces a well-structured pre-push verification gate with solid test coverage (1308/1308 passing, 0 failures), clean lint/typecheck, and no security issues. However, the Logic stage confirmed a test-validated bug on the explicit-cmds path: Errors
Warnings
Notes
Build Results
1 error, 2 warnings, 2 notes across 6 stages (1096s, confidence: high) |
Gate Review SummaryVerdict: ❌ Changes requested · confidence The PR introduces a well-structured pre-push verification gate with solid test coverage (1308/1308 passing, 0 failures), clean lint/typecheck, and no security issues. However, the Logic stage confirmed a test-validated bug on the explicit-cmds path: Blocking findings (1)
Non-blocking notes (4)
Build
Stats1 blocking · 4 other · 6 stages run Next steps
How to override
Updated 2026-05-18 16:56 UTC · sticky comment, replaced in-place on each review cycle. |
…h path Address Gate review of PR #35. The default path of pre_push_verify correctly derives has_parse_failure from build_result["tests"] ["parse_failure"] (set by compile_build's synthesized-block at builder.py:161-189). The explicit-cmds path bypasses compile_build and called _parse_test / _parse_pytest / _parse_generic_test directly, then hardcoded has_parse_failure=False. The asymmetry meant a crashed test runner (segfault, OOM, syntax error in setup) that exited non-zero with no parseable failure count would compare as ``after.failed (0) <= baseline.failed (0)`` and route to pre_existing → fail fast. Correct behavior is to route to regression (safer side) since we genuinely don't know whether the original broke or Gate's edit did. Fix: synthesize the parse_failure signal from the parser output on the explicit-cmds path using the same heuristic compile_build uses — non-zero exit AND zero parseable failures AND zero parseable total. This matches the default path's behavior exactly without coupling either to compile_build's internal block. Adds two regression tests: - test_unparseable_crash_routes_to_regression: crash scenario, baseline=0, after=parseable-zero → must be regression. - test_explicit_cmds_pre_existing_when_counts_match: counterfactual, parser extracted real counts that match baseline → pre_existing (confirms the heuristic is precise, not over-broad). Co-authored-by: Cursor <cursoragent@cursor.com>
Review response — commit 48dff54Error (parse_failure asymmetry) — FIXEDReal defect, test-confirmed by Gate. The default path of Fix synthesizes the same signal from the parser output on the explicit-cmds path: Added two regression tests:
Warnings (gate init discoverability) — DEFERREDBoth
This PR doesn't make those gaps worse — the new keys are documented in Will open a follow-up issue for Note 1 (push without re-rereview) — ACCEPTED RISK, documentedThe plan's design decision was explicit: same-iteration retry only, no second re-review on the retry path. Rationale in the plan: re-running re-review on every retry doubles quota cost on the common case for a narrow class of risk (a Codex retry going broadly off-script when given a targeted test-fix prompt). The plan calls this out as a deliberate trade-off. Note 2 (misleading iter N+1 resume prompt) — DEFERREDReal artifact issue. Fixing it cleanly requires cross-iteration state (e.g., persisting the pre-push error context to a file the iter 2+ prompt selector reads), which the plan explicitly avoided to keep the design simple. Codex behaves correctly in practice because session context carries the prior pre-push prompt forward; only the operator-facing Will track as a follow-up — the right fix is probably writing a Verification on the fix commit:
|
Summary
_commit_and_finishthat mirrors what a remote pre-push hook (e.g. Husky) will face atgit pushtime. Fixes the class of bug PR #399 hit:build_verifyaccepted pre-existing test failures under its tolerance, then Husky'svitest runrejected the push because hooks don't have that tolerance.build_verifynow exposesstrict_passalongsidepass, andpre_push_verifyreuses it by default. Subprocess only fires whenpre_push_verify_cmdsis explicitly set (e.g. to add contract-drift checks beyondtest_cmds).Motivation
PR #399 activity log shows the exact divergence:
Gate already had the data to detect this earlier —
build_result["overall_pass"]is computed strictly before tolerance is applied atfixer.py:296-302. It was then discarded.Design
Classification rules
strict_pass == Truepassstrict_pass == FalseAND (original_buildmissing OR parse failure OR project_type uses_parse_generic_test)regression(uncertain → safer side)strict_pass == FalseANDafter.failed > baseline.failedregressionstrict_pass == FalseANDafter.failed <= baseline.failed(reliable parser)pre_existingIteration handling
pre_push_verify. If still failing, fall through to next iteration without reverting (partial fix preserved). Mirrors the existingbuild_verifysame-iteration retry pattern.Configuration
New keys under
[repos.build](all optional, documented ingate.toml.example):pre_push_strict(default:true)pre_push_disable(default:false) — escape hatchpre_push_verify_cmds(default:[]) — when set, replacestest_cmdsfor the gate. Used for hooks that do more than tests (e.g. contract drift).pre_push_timeout_s(default:600)New keys under
[fix_pipeline]:max_iterations(default:3, with per-repo override[[repos]].max_fix_iterations)Open-source posture
Strict-by-default activates only when
test_cmds(orpre_push_verify_cmds) is configured. Repos without tests are unaffected. Repos with tests can opt out viapre_push_disable = true. Replacement-vs-additive semantics forpre_push_verify_cmdsare documented explicitly so users don't silently drop coverage when overriding.Test plan
tests/test_fixer.py,tests/test_fix_pipeline_config.py)tests/test_fix_pipeline_pre_push.py)1314 passed in 184smax_iter=3_fail_pre_existing_testspathRollback
pre_push_disable = truein[repos.build]. Zero-code revert.strict_passfield onbuild_verifyis additive and harmless to leave in.Out of scope (follow-up)
compare_buildsonly checkspassbooleans for tc/lint (not counts) — same class of bug exists there but is caught earlier by thebuild_verifyretry loop, so user-visible impact is smaller. Separate plan.Made with Cursor