AI checks eval harness: regression-test rule prompts against planted-violation dataset

## Background

The self-hosted AI checks workflow shipped in #122 is LLM-based, which means every prompt edit (`.ai-checks/*.md`), runner change (`.github/scripts/run_ai_check.py`), and model upgrade (`AI_CHECK_MODEL`) is a quiet regression risk. There's currently no automated way to know whether a prompt tweak that fixes one rule's false negative breaks the other eight, or whether bumping from `claude-sonnet-4-6` to a future model holds the verdict matrix.

The manual one-time validation in #123 surfaced this gap concretely: 7 / 9 rules fired correctly on planted violations, 2 missed (`asgi-wsgi-scott` PASS, `branching-and-pr-strategy` SKIP — see https://github.com/rafacm/ragtime/issues/123#issuecomment-4358607430). The fixes for those misses live in #125. But after #125 lands we have no way to confirm those rules now fire correctly *and* the other seven still do — short of re-doing the manual eval PR every time, which won't happen.

This issue is the eval harness for the AI checks themselves: development-level tooling, distinct from #115 (which is the eval harness for application-runtime LLM behaviour — the Fetch Details agent, Scott retrieval quality, etc.). Different concern, different lane, deliberately separate harness.

## Why a v0 dataset is already sitting here

The eval branch from #123 (`eval/ai-checks-violations`) contains exactly the dataset shape we want: one commit per rule, each carrying a planted violation that the rule should fire `fail` on. The 9 tuples are:

| Rule | Planted-violation commit | Expected verdict |
|---|---|---|
| `pipeline-step-sync` | new `VERIFYING` step in `episodes/models.py`; README/doc untouched | fail |
| `env-var-sync` | `os.getenv("RAGTIME_FAKE_THING")` read; `.env.sample`/`configure.py` untouched | fail |
| `qdrant-payload-slim` | `episode_title` added to `_build_payloads` dict | fail |
| `entity-creation-race-safety` | bare `Entity.objects.create(...)` outside `_get_or_create_entity` | fail |
| `comment-discipline` | `# Added for issue #117 to handle the X case` next to a function | fail |
| `gh-api-shell-escaping` | `gh api -f body` with bare backticks, not in heredoc | fail |
| `asgi-wsgi-scott` | edits `chat/views.py`; PR description mentions only `manage.py runserver` | fail |
| `branching-and-pr-strategy` | merge commit injected into branch history | fail |
| `feature-pr-docs` | feature-shaped PR with no plan/feature/sessions/CHANGELOG | fail |

Plus a 10th implicit case: a clean PR (no planted violations) where every rule should `pass` or `skip`. That's the no-false-positives signal.

## Proposal

### 1. Dataset format

Capture each tuple as a structured fixture instead of a live git branch — git branches rot, and we want the harness to be hermetic.

`episodes/tests/fixtures/ai_checks/cases/<rule-slug>/`:

- `diff.patch` — the unified diff for the planted violation (what `git diff base...HEAD` would produce).
- `metadata.yaml` — branch name, base ref, PR title, PR body (mock), commit log (`git log --graph --oneline base..HEAD`). Necessary for rules like `branching-and-pr-strategy` that depend on PR/branch metadata once #125 lands.
- `expected.yaml` — `verdict: fail`, optional `must_cite: [<file paths or symbols the details body should reference>]`, optional `must_not_cite: [<files belonging to other rules' fixtures, to detect cross-fires>]`.

Plus `episodes/tests/fixtures/ai_checks/cases/clean-pr/` with an empty/no-violation diff and `expected.verdict: pass` (or `skip`).

Backfill the initial dataset from the eval PR (#127) before closing it, so the diffs and PR description are captured exactly as the workflow saw them.

### 2. Runner

`episodes/tests/test_ai_checks_eval.py` — pytest, parametrised over `(rule, case)` pairs. For each pair:

- Load the rule body from `.ai-checks/<rule>.md`.
- Load the case's `diff.patch` + `metadata.yaml`.
- Invoke a thin wrapper around `.github/scripts/run_ai_check.py`'s `evaluate()` function (refactor it slightly so the diff and metadata can be passed in directly, instead of being read from `git diff` and env vars).
- Assert `verdict == expected.verdict`.
- Assert each entry in `must_cite` appears in `details`.
- Assert no entry in `must_not_cite` appears in `details` (cross-fire detection).

Behind a `@pytest.mark.ai_checks_eval` marker so it doesn't run on every commit (LLM cost). On-demand + scheduled CI only.

### 3. CI workflow

`.github/workflows/ai-checks-eval.yml`, separate from the existing `ai-checks.yml`:

- Triggers: `workflow_dispatch`, plus `pull_request` filtered to `.ai-checks/**` and `.github/scripts/run_ai_check.py` and `.github/workflows/ai-checks*.yml` paths only — so prompt and runner changes get a regression check, but normal feature PRs don't burn LLM budget.
- Matrix over `AI_CHECK_MODEL` candidates (current default + at least one alternate) so we can compare verdict matrices across models when bumping. Stretch goal, can ship single-model first.
- Posts a summary table to `$GITHUB_STEP_SUMMARY`: per (rule × case × model) verdict + pass/fail vs expected.

### 4. Cost / hermeticity

LLM calls are live (no mocking — the whole point is to test the actual model's judgement). Per-run cost is bounded: 9 rules × ~10 cases × N models. With `claude-sonnet-4-6` and current dataset size that's well under $1/run.

No HTTP fixtures needed — the AI checks runner only talks to the LLM provider, not to GitHub. Unlike #115, no VCR cassettes required.

## Acceptance criteria

- [ ] `episodes/tests/fixtures/ai_checks/cases/<rule>/` populated with the 9 planted-violation tuples from #127, plus a `clean-pr/` case.
- [ ] `pytest -m ai_checks_eval` runs the matrix locally and prints verdict-vs-expected per case.
- [ ] `.github/workflows/ai-checks-eval.yml` runs on PRs touching `.ai-checks/**` or the runner, posts a verdict-matrix step summary.
- [ ] Validates the #125 fixes: after #125 ships, re-running the harness shows `asgi-wsgi-scott` and `branching-and-pr-strategy` cases now fire `fail`.
- [ ] Documentation: `doc/plans/`, `doc/features/`, `doc/sessions/`, CHANGELOG.

## Out of scope

- Eval harness for application-runtime agents (Fetch Details, Scott). That's #115 — different concern, different lane.
- Adding new AI-checks rules. The 9 existing rules are the test bed.
- Cost / token aggregation across the matrix. Useful, but follow-on once the harness works.
- Promoting any AI check to required-status on branch protection. Per-rule decision, separate.

## Depends on / relates to

- #122 — runner and rules being evaluated.
- #123 — manual one-time validation; produced the v0 dataset.
- #125 — driver enhancements (PR metadata injection); harness must validate they work.
- #126 — extracting the runner into a standalone repo; if that lands first, this harness moves with it.
- #115 — *not* a dependency. Different harness, different concern. Listed only to record the deliberate separation.


Rule	Planted-violation commit	Expected verdict
`pipeline-step-sync`	new `VERIFYING` step in `episodes/models.py`; README/doc untouched	fail
`env-var-sync`	`os.getenv("RAGTIME_FAKE_THING")` read; `.env.sample`/`configure.py` untouched	fail
`qdrant-payload-slim`	`episode_title` added to `_build_payloads` dict	fail
`entity-creation-race-safety`	bare `Entity.objects.create(...)` outside `_get_or_create_entity`	fail
`comment-discipline`	`# Added for issue #117 to handle the X case` next to a function	fail
`gh-api-shell-escaping`	`gh api -f body` with bare backticks, not in heredoc	fail
`asgi-wsgi-scott`	edits `chat/views.py`; PR description mentions only `manage.py runserver`	fail
`branching-and-pr-strategy`	merge commit injected into branch history	fail
`feature-pr-docs`	feature-shaped PR with no plan/feature/sessions/CHANGELOG	fail

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI checks eval harness: regression-test rule prompts against planted-violation dataset #128

Background

Why a v0 dataset is already sitting here

Proposal

1. Dataset format

2. Runner

3. CI workflow

4. Cost / hermeticity

Acceptance criteria

Out of scope

Depends on / relates to

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

AI checks eval harness: regression-test rule prompts against planted-violation dataset #128

Description

Background

Why a v0 dataset is already sitting here

Proposal

1. Dataset format

2. Runner

3. CI workflow

4. Cost / hermeticity

Acceptance criteria

Out of scope

Depends on / relates to

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions