You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The self-hosted AI checks workflow shipped in #122 is LLM-based, which means every prompt edit (.ai-checks/*.md), runner change (.github/scripts/run_ai_check.py), and model upgrade (AI_CHECK_MODEL) is a quiet regression risk. There's currently no automated way to know whether a prompt tweak that fixes one rule's false negative breaks the other eight, or whether bumping from claude-sonnet-4-6 to a future model holds the verdict matrix.
The manual one-time validation in #123 surfaced this gap concretely: 7 / 9 rules fired correctly on planted violations, 2 missed (asgi-wsgi-scott PASS, branching-and-pr-strategy SKIP — see #123 (comment)). The fixes for those misses live in #125. But after #125 lands we have no way to confirm those rules now fire correctly and the other seven still do — short of re-doing the manual eval PR every time, which won't happen.
This issue is the eval harness for the AI checks themselves: development-level tooling, distinct from #115 (which is the eval harness for application-runtime LLM behaviour — the Fetch Details agent, Scott retrieval quality, etc.). Different concern, different lane, deliberately separate harness.
Why a v0 dataset is already sitting here
The eval branch from #123 (eval/ai-checks-violations) contains exactly the dataset shape we want: one commit per rule, each carrying a planted violation that the rule should fire fail on. The 9 tuples are:
Rule
Planted-violation commit
Expected verdict
pipeline-step-sync
new VERIFYING step in episodes/models.py; README/doc untouched
expected.yaml — verdict: fail, optional must_cite: [<file paths or symbols the details body should reference>], optional must_not_cite: [<files belonging to other rules' fixtures, to detect cross-fires>].
Plus episodes/tests/fixtures/ai_checks/cases/clean-pr/ with an empty/no-violation diff and expected.verdict: pass (or skip).
Backfill the initial dataset from the eval PR (#127) before closing it, so the diffs and PR description are captured exactly as the workflow saw them.
2. Runner
episodes/tests/test_ai_checks_eval.py — pytest, parametrised over (rule, case) pairs. For each pair:
Load the rule body from .ai-checks/<rule>.md.
Load the case's diff.patch + metadata.yaml.
Invoke a thin wrapper around .github/scripts/run_ai_check.py's evaluate() function (refactor it slightly so the diff and metadata can be passed in directly, instead of being read from git diff and env vars).
Assert verdict == expected.verdict.
Assert each entry in must_cite appears in details.
Assert no entry in must_not_cite appears in details (cross-fire detection).
Behind a @pytest.mark.ai_checks_eval marker so it doesn't run on every commit (LLM cost). On-demand + scheduled CI only.
3. CI workflow
.github/workflows/ai-checks-eval.yml, separate from the existing ai-checks.yml:
Triggers: workflow_dispatch, plus pull_request filtered to .ai-checks/** and .github/scripts/run_ai_check.py and .github/workflows/ai-checks*.yml paths only — so prompt and runner changes get a regression check, but normal feature PRs don't burn LLM budget.
Matrix over AI_CHECK_MODEL candidates (current default + at least one alternate) so we can compare verdict matrices across models when bumping. Stretch goal, can ship single-model first.
Posts a summary table to $GITHUB_STEP_SUMMARY: per (rule × case × model) verdict + pass/fail vs expected.
4. Cost / hermeticity
LLM calls are live (no mocking — the whole point is to test the actual model's judgement). Per-run cost is bounded: 9 rules × ~10 cases × N models. With claude-sonnet-4-6 and current dataset size that's well under $1/run.
No HTTP fixtures needed — the AI checks runner only talks to the LLM provider, not to GitHub. Unlike #115, no VCR cassettes required.
Background
The self-hosted AI checks workflow shipped in #122 is LLM-based, which means every prompt edit (
.ai-checks/*.md), runner change (.github/scripts/run_ai_check.py), and model upgrade (AI_CHECK_MODEL) is a quiet regression risk. There's currently no automated way to know whether a prompt tweak that fixes one rule's false negative breaks the other eight, or whether bumping fromclaude-sonnet-4-6to a future model holds the verdict matrix.The manual one-time validation in #123 surfaced this gap concretely: 7 / 9 rules fired correctly on planted violations, 2 missed (
asgi-wsgi-scottPASS,branching-and-pr-strategySKIP — see #123 (comment)). The fixes for those misses live in #125. But after #125 lands we have no way to confirm those rules now fire correctly and the other seven still do — short of re-doing the manual eval PR every time, which won't happen.This issue is the eval harness for the AI checks themselves: development-level tooling, distinct from #115 (which is the eval harness for application-runtime LLM behaviour — the Fetch Details agent, Scott retrieval quality, etc.). Different concern, different lane, deliberately separate harness.
Why a v0 dataset is already sitting here
The eval branch from #123 (
eval/ai-checks-violations) contains exactly the dataset shape we want: one commit per rule, each carrying a planted violation that the rule should firefailon. The 9 tuples are:pipeline-step-syncVERIFYINGstep inepisodes/models.py; README/doc untouchedenv-var-syncos.getenv("RAGTIME_FAKE_THING")read;.env.sample/configure.pyuntouchedqdrant-payload-slimepisode_titleadded to_build_payloadsdictentity-creation-race-safetyEntity.objects.create(...)outside_get_or_create_entitycomment-discipline# Added for issue #117 to handle the X casenext to a functiongh-api-shell-escapinggh api -f bodywith bare backticks, not in heredocasgi-wsgi-scottchat/views.py; PR description mentions onlymanage.py runserverbranching-and-pr-strategyfeature-pr-docsPlus a 10th implicit case: a clean PR (no planted violations) where every rule should
passorskip. That's the no-false-positives signal.Proposal
1. Dataset format
Capture each tuple as a structured fixture instead of a live git branch — git branches rot, and we want the harness to be hermetic.
episodes/tests/fixtures/ai_checks/cases/<rule-slug>/:diff.patch— the unified diff for the planted violation (whatgit diff base...HEADwould produce).metadata.yaml— branch name, base ref, PR title, PR body (mock), commit log (git log --graph --oneline base..HEAD). Necessary for rules likebranching-and-pr-strategythat depend on PR/branch metadata once AI checks driver: pass PR metadata, surface model used, investigate Feature PR Docs false positive #125 lands.expected.yaml—verdict: fail, optionalmust_cite: [<file paths or symbols the details body should reference>], optionalmust_not_cite: [<files belonging to other rules' fixtures, to detect cross-fires>].Plus
episodes/tests/fixtures/ai_checks/cases/clean-pr/with an empty/no-violation diff andexpected.verdict: pass(orskip).Backfill the initial dataset from the eval PR (#127) before closing it, so the diffs and PR description are captured exactly as the workflow saw them.
2. Runner
episodes/tests/test_ai_checks_eval.py— pytest, parametrised over(rule, case)pairs. For each pair:.ai-checks/<rule>.md.diff.patch+metadata.yaml..github/scripts/run_ai_check.py'sevaluate()function (refactor it slightly so the diff and metadata can be passed in directly, instead of being read fromgit diffand env vars).verdict == expected.verdict.must_citeappears indetails.must_not_citeappears indetails(cross-fire detection).Behind a
@pytest.mark.ai_checks_evalmarker so it doesn't run on every commit (LLM cost). On-demand + scheduled CI only.3. CI workflow
.github/workflows/ai-checks-eval.yml, separate from the existingai-checks.yml:workflow_dispatch, pluspull_requestfiltered to.ai-checks/**and.github/scripts/run_ai_check.pyand.github/workflows/ai-checks*.ymlpaths only — so prompt and runner changes get a regression check, but normal feature PRs don't burn LLM budget.AI_CHECK_MODELcandidates (current default + at least one alternate) so we can compare verdict matrices across models when bumping. Stretch goal, can ship single-model first.$GITHUB_STEP_SUMMARY: per (rule × case × model) verdict + pass/fail vs expected.4. Cost / hermeticity
LLM calls are live (no mocking — the whole point is to test the actual model's judgement). Per-run cost is bounded: 9 rules × ~10 cases × N models. With
claude-sonnet-4-6and current dataset size that's well under $1/run.No HTTP fixtures needed — the AI checks runner only talks to the LLM provider, not to GitHub. Unlike #115, no VCR cassettes required.
Acceptance criteria
episodes/tests/fixtures/ai_checks/cases/<rule>/populated with the 9 planted-violation tuples from eval: validate AI checks with planted-violations PR (#123) #127, plus aclean-pr/case.pytest -m ai_checks_evalruns the matrix locally and prints verdict-vs-expected per case..github/workflows/ai-checks-eval.ymlruns on PRs touching.ai-checks/**or the runner, posts a verdict-matrix step summary.asgi-wsgi-scottandbranching-and-pr-strategycases now firefail.doc/plans/,doc/features/,doc/sessions/, CHANGELOG.Out of scope
Depends on / relates to