Skip to content

AI checks eval harness: regression-test rule prompts against planted-violation dataset #128

@rafacm

Description

@rafacm

Background

The self-hosted AI checks workflow shipped in #122 is LLM-based, which means every prompt edit (.ai-checks/*.md), runner change (.github/scripts/run_ai_check.py), and model upgrade (AI_CHECK_MODEL) is a quiet regression risk. There's currently no automated way to know whether a prompt tweak that fixes one rule's false negative breaks the other eight, or whether bumping from claude-sonnet-4-6 to a future model holds the verdict matrix.

The manual one-time validation in #123 surfaced this gap concretely: 7 / 9 rules fired correctly on planted violations, 2 missed (asgi-wsgi-scott PASS, branching-and-pr-strategy SKIP — see #123 (comment)). The fixes for those misses live in #125. But after #125 lands we have no way to confirm those rules now fire correctly and the other seven still do — short of re-doing the manual eval PR every time, which won't happen.

This issue is the eval harness for the AI checks themselves: development-level tooling, distinct from #115 (which is the eval harness for application-runtime LLM behaviour — the Fetch Details agent, Scott retrieval quality, etc.). Different concern, different lane, deliberately separate harness.

Why a v0 dataset is already sitting here

The eval branch from #123 (eval/ai-checks-violations) contains exactly the dataset shape we want: one commit per rule, each carrying a planted violation that the rule should fire fail on. The 9 tuples are:

Rule Planted-violation commit Expected verdict
pipeline-step-sync new VERIFYING step in episodes/models.py; README/doc untouched fail
env-var-sync os.getenv("RAGTIME_FAKE_THING") read; .env.sample/configure.py untouched fail
qdrant-payload-slim episode_title added to _build_payloads dict fail
entity-creation-race-safety bare Entity.objects.create(...) outside _get_or_create_entity fail
comment-discipline # Added for issue #117 to handle the X case next to a function fail
gh-api-shell-escaping gh api -f body with bare backticks, not in heredoc fail
asgi-wsgi-scott edits chat/views.py; PR description mentions only manage.py runserver fail
branching-and-pr-strategy merge commit injected into branch history fail
feature-pr-docs feature-shaped PR with no plan/feature/sessions/CHANGELOG fail

Plus a 10th implicit case: a clean PR (no planted violations) where every rule should pass or skip. That's the no-false-positives signal.

Proposal

1. Dataset format

Capture each tuple as a structured fixture instead of a live git branch — git branches rot, and we want the harness to be hermetic.

episodes/tests/fixtures/ai_checks/cases/<rule-slug>/:

  • diff.patch — the unified diff for the planted violation (what git diff base...HEAD would produce).
  • metadata.yaml — branch name, base ref, PR title, PR body (mock), commit log (git log --graph --oneline base..HEAD). Necessary for rules like branching-and-pr-strategy that depend on PR/branch metadata once AI checks driver: pass PR metadata, surface model used, investigate Feature PR Docs false positive #125 lands.
  • expected.yamlverdict: fail, optional must_cite: [<file paths or symbols the details body should reference>], optional must_not_cite: [<files belonging to other rules' fixtures, to detect cross-fires>].

Plus episodes/tests/fixtures/ai_checks/cases/clean-pr/ with an empty/no-violation diff and expected.verdict: pass (or skip).

Backfill the initial dataset from the eval PR (#127) before closing it, so the diffs and PR description are captured exactly as the workflow saw them.

2. Runner

episodes/tests/test_ai_checks_eval.py — pytest, parametrised over (rule, case) pairs. For each pair:

  • Load the rule body from .ai-checks/<rule>.md.
  • Load the case's diff.patch + metadata.yaml.
  • Invoke a thin wrapper around .github/scripts/run_ai_check.py's evaluate() function (refactor it slightly so the diff and metadata can be passed in directly, instead of being read from git diff and env vars).
  • Assert verdict == expected.verdict.
  • Assert each entry in must_cite appears in details.
  • Assert no entry in must_not_cite appears in details (cross-fire detection).

Behind a @pytest.mark.ai_checks_eval marker so it doesn't run on every commit (LLM cost). On-demand + scheduled CI only.

3. CI workflow

.github/workflows/ai-checks-eval.yml, separate from the existing ai-checks.yml:

  • Triggers: workflow_dispatch, plus pull_request filtered to .ai-checks/** and .github/scripts/run_ai_check.py and .github/workflows/ai-checks*.yml paths only — so prompt and runner changes get a regression check, but normal feature PRs don't burn LLM budget.
  • Matrix over AI_CHECK_MODEL candidates (current default + at least one alternate) so we can compare verdict matrices across models when bumping. Stretch goal, can ship single-model first.
  • Posts a summary table to $GITHUB_STEP_SUMMARY: per (rule × case × model) verdict + pass/fail vs expected.

4. Cost / hermeticity

LLM calls are live (no mocking — the whole point is to test the actual model's judgement). Per-run cost is bounded: 9 rules × ~10 cases × N models. With claude-sonnet-4-6 and current dataset size that's well under $1/run.

No HTTP fixtures needed — the AI checks runner only talks to the LLM provider, not to GitHub. Unlike #115, no VCR cassettes required.

Acceptance criteria

Out of scope

  • Eval harness for application-runtime agents (Fetch Details, Scott). That's Fetch Details agent: LLM evaluation framework (DeepEval, dataset, CI workflow) #115 — different concern, different lane.
  • Adding new AI-checks rules. The 9 existing rules are the test bed.
  • Cost / token aggregation across the matrix. Useful, but follow-on once the harness works.
  • Promoting any AI check to required-status on branch protection. Per-rule decision, separate.

Depends on / relates to

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions