Skip to content

AI checks driver: hallucination patterns not covered by #125 (4 reproducers) #137

@rafacm

Description

@rafacm

Background

The AI Checks driver shipped in #122 has been running on real PRs for a few weeks. The 2026-05-04 PR batch (#129/#131/#133) surfaced three distinct false-positive patterns that all share the same meta-shape: the model produces a verdict that references content not present in the actual file tree at the PR head. #125 already tracks one driver-hardening bundle (pass PR metadata, surface model used, fix the Feature PR Docs bundle check), but its scope and proposed fixes ("feed the model the branch name") don't address the patterns below — the model is hallucinating diff/file content, not missing PR metadata.

This issue collects those three reproducers as a separate driver-hardening bundle.

Reproducers

1. Branching & PR Strategy reads the synthetic merge commit (#129)

Verdict (model):

The PR branch is not up-to-date with the current main branch; it does not appear to be rebased onto main, which is required for compliance.

Ground truth: The branch was rebased onto main immediately before push. The runner's actions/checkout@v4 defaults to refs/pull/N/merge, which GitHub materializes as a two-parent merge commit even when the PR is fully fast-forwardable. From the run log:

fetch ... +e8b57f8...:refs/remotes/pull/129/merge
checkout refs/remotes/pull/129/merge
HEAD is now at e8b57f8 Merge 9d1897f... into 69a008...

The model inspects git log on the runner, sees a Merge X into Y commit on top, and concludes "not rebased." This fires on every rebased PR, regardless of whether main actually moved underneath the branch. (Was intermittent on the next run for #131 — model output is non-deterministic.)

This is a different mechanism from #125's premise ("the model can't see the branch name"). Even if PR metadata is passed in, the model is still reading the synthetic merge commit from git log on the checked-out tree.

Detail: PR #129 comment

2. Comment Discipline reads diff-removed lines as still present (#131)

Verdict (model):

There are several comments in the episodes/apps.py file that paraphrase what the code does […] For example, comments like # Only initialize DBOS for server/worker entrypoints, plus the management commands that need to talk to the queue. and # Decorators must register before DBOS.launch(); enqueues against late-registered workflows silently fail. can be removed because the functionality is clear from the code itself.

Ground truth: The first comment cited (# Only initialize DBOS for server/worker entrypoints…) was deleted by this PR — it appears in the unified diff as a - line. The model is treating - lines as if they're still present in the file. The second comment is a legitimate non-obvious WHY (silent-failure mode of late-registered DBOS workflows) that the agent kept deliberately per the rule's own "keep WHY" carve-out.

The driver currently passes git diff $BASE_REF…HEAD directly to the model. There's no signal that - means "removed" beyond the standard unified-diff convention, which gpt-4o-mini regularly mishandles on multi-hunk diffs.

Detail: verdict artifact Job 74235387227

3. RAGTIME_* Env Var Sync invents env var names from PR title (#133)

Verdict artifact:

{
  \"verdict\": \"fail\",
  \"summary\": \"New environment variables were added but .env.sample and configure.py are not updated accordingly.\",
  \"details\": \"The code introduces the variable RAGTIME_SHOW_NAME, which is not referenced in .env.sample or core/management/commands/configure.py.\"
}

Ground truth:

$ git grep RAGTIME_SHOW_NAME origin/rafacm/download-show-name-fix
(no output)

$ git diff origin/main...origin/rafacm/download-show-name-fix | grep RAGTIME_
     @override_settings(RAGTIME_PODCAST_AGGREGATORS=\"\")

RAGTIME_SHOW_NAME does not exist anywhere in the branch. The PR adds Episode.show_name (a Django model field). The model invented an env var name by combining the rule's RAGTIME_ prefix with the PR's branch/title token show_name. Title-priming hallucination — the same mechanism that produced the #122-era Feature PR Docs confabulation, but on a different rule.

Detail: PR #133 comment and verdict artifact

4. Pipeline Step Documentation Sync misreads rule scope and claims doc isn't updated when it is (#133, post-rebase)

Verdict (model):

The diff adds a new field show_name to the Episode model in episodes/models.py and modifies the download process, but it does not update the documentation in README.md or doc/README.md to reflect these changes.

Two problems, layered:

  1. Diff-reading miss. doc/README.md is updated in the PR — show_name is added to the details field list and the overwrite-semantics paragraph gains an additive-only exception. The model claims the file isn't touched. This is the inverse of pattern Update README Features with emojis and Episode Indexing #2: instead of reading - lines as still-present, it misses + lines that are present.
  2. Rule-scope misread. Even if README.md lacked any mention of show_name, the rule wouldn't require it. AGENTS.md scopes the rule to "When adding, removing, or changing a pipeline step, update the summary table in README.md and the detailed step descriptions in doc/README.md." This PR doesn't add/remove/change a pipeline step — the steps list is unchanged; only the behavior inside the existing fetch_details step changed. The README summary table at the step-level granularity ("🕷️ Fetch Details ... extracts metadata") is correct and shouldn't enumerate every field. The model expanded "changing a step" to mean "any change touching a step's source file."

This is a scope-creep + diff-miss double hallucination. The driver-side fix that addresses pattern #2 (extracting added vs. removed hunks before handing to the model) helps here too, but the rule-scope-misread component is independent — better-prompted rule guidance or a stronger model is needed for that half.

Detail: PR #133 — Pipeline Step Documentation Sync verdict

Why this is separate from #125

#125's proposal is to pass more context into the prompt (PR metadata, model name in summary, telemetry). That's strictly additive — none of it tells the model "do not invent file content that isn't in the diff." The patterns above are about the model fabricating or misreading the input it already has:

All four share "the verdict references content not present in the PR" or "the verdict treats content as absent that is present." Worth a coordinated bundle.

Why we're not just upgrading the default model

Reasonable counter-question: if all three failures look like small-model hallucinations, why not just bump the default from openai/gpt-4o-mini to a Sonnet-class model and call it a day? Per-pattern susceptibility:

Pattern What the model did wrong Stronger-model fix likelihood
#1 synthetic merge commit Read git log and didn't realize HEAD was a synthetic GHA merge ref, not the PR head Medium — GPT-4o / Sonnet 4.6 often recognize refs/pull/N/merge semantics, but it's an environmental ambiguity even strong models can trip on. Driver fix (proposal A) eliminates regardless of model.
#2 diff - lines Treated removed lines as if still present High — reading unified-diff +/- correctly is basic comprehension; gpt-4o-mini specifically gets sloppy on multi-hunk diffs.
#3 invented env var Combined the rule's RAGTIME_ prefix with the PR title token show_name and produced a fact High — classic small-model under-grounding-in-input + over-grounding-in-surface-tokens.
#4 missed + lines + scope-creep Claimed doc/README.md wasn't updated when the diff clearly adds lines to it; also stretched "changing a pipeline step" to mean "any change touching a step's source file" High for the diff-miss half; Medium for the scope-creep half (rule prompt could be tightened independently of model strength).

So 3 of 4 are model-capacity issues a Sonnet-class model would almost certainly avoid; #1 is a mixed environmental + model issue that proposal A fixes regardless of model.

Why not wholesale-upgrade anyway? Cost. Sonnet 4.6 is ~10–20× the per-PR cost of gpt-4o-mini, multiplied across 9 rules × every PR. The framing of the proposals below preserves the cheap default for the routine pass case:

  • Proposal B (deterministic prechecks) makes 2 of 3 patterns disappear without spending any model tokens — pure regex on the diff.
  • Proposal C (auto-retry on stronger model when first verdict is fail) bounds the cost: pay Sonnet rates only when there's a verdict worth double-checking. False-positive rate drops, true-negative rate stays cheap.
  • Proposal D (telemetry on overturned verdicts) gives us the data to revisit the default-model choice in 3–6 months with evidence rather than vibes.

If telemetry shows the cheap default has an unacceptable false-positive rate even after B and C land, that's the moment to bump the default — not now.

Proposal

A. Pin actions/checkout to the PR head SHA (fixes #1)

Set ref: \${{ github.event.pull_request.head.sha }} on the check job's actions/checkout step. The runner will then check out 9d1897f directly (the rebased head) instead of the synthetic refs/pull/N/merge commit. git log on the runner reflects the PR's actual linear history.

Caveat: the diff is still computed via git diff \$BASE_REF…HEAD, where \$BASE_REF is fetched separately — this part is unaffected.

B. Deterministic precheck for path-/symbol-based rules (fixes #2 and #3, plus extends #125's Option B)

For rules whose verdict rests on "does symbol/path X appear in the diff?", do the check in Python before the model call:

Rule Precheck
ragtime-env-var-sync Regex \b(RAGTIME_[A-Z_]+)\b against added lines in the diff; pass the resulting set to the model as Detected new RAGTIME_* references: {…} (empty set ⇒ verdict pass deterministically, no model call needed).
feature-pr-docs Already covered by #125 Option B.
comment-discipline Extract added comment lines (^\+\s*#) from the diff, hand the model only those lines as a list. Strips diff-format ambiguity entirely.
pipeline-step-doc-sync Two-part precheck: (a) detect whether the PR actually adds/removes/changes a pipeline step (regex on _PIPELINE_DISPATCH, Episode.Status choices, or new @DBOS.step decorators) — if not, deterministic pass with summary "PR does not modify the pipeline-step list." (b) If a step is modified, pre-extract whether README.md and doc/README.md appear in the diff's changed-files list and pass that fact to the model so it can't claim a touched file is untouched.

This is more invasive than #125 Option B because it touches four rules instead of one, but each is mechanical (one regex per rule). Worth doing as one bundle.

C. Fall back to a stronger model for any rule that fails (optional)

If a rule's first-pass verdict is fail with the cheap default (openai/gpt-4o-mini), automatically retry with openai/gpt-4o or anthropic/claude-sonnet-4-6. If the second model also says fail, post that verdict; if it says pass, the original was a hallucination — log it and pass.

Cost is bounded: only failures retry, so true negatives stay cheap. Helps catch any hallucination pattern not specifically deterministic-precheckable.

D. Telemetry to detect ongoing false-positive rate

Already partially covered by #125 ("surface which model was used"). Add: when a verdict is overturned by either #C or by a maintainer comment, log the rule + model + verdict for offline review. Six months from now, this is the data we'd need to decide whether to keep the cheap-model default.

Acceptance criteria

  • Open a fresh rebased PR with no real rule violations. Branching & PR Strategy passes deterministically.
  • PR that deletes a comment (no comment additions) passes Comment Discipline deterministically.
  • PR that touches a model field whose name happens to match a RAGTIME_* pattern (e.g. a hypothetical Episode.api_key field) passes RAGTIME_* Env Var Sync deterministically.
  • PR that touches a pipeline-step file but doesn't add/remove/change a step in _PIPELINE_DISPATCH passes Pipeline Step Documentation Sync deterministically.
  • PR that changes a pipeline step and updates only doc/README.md (not README.md) is correctly judged based on whether the README summary table needs updating — i.e., the model sees the actual diff, not a hallucinated absence.
  • No regressions on the substantive rules that already pass correctly (validate against the Validate AI checks with a planted-violations PR #123 planted-violations PR).

Out of scope

Depends on / relates to

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions