Observation
The self-judge can return the same letter grade across multiple runs on identical input but flag different specific issues each time. Empirically the top issues recur (~66% overlap on highest-priority items), but lower-priority picks are noisy.
Why it matters
When a user runs strict mode and the loop iterates against the bar, the moving target makes "keep iterating until A" expensive — every attempt may surface a new issue the previous attempt didn't. Letter grade alone is not always a reliable lift signal.
Possible directions
- Median-of-N grading. Run the judge 2-3 times on the baseline, take the median grade and the union of high-priority issues. Stabilizes at the cost of N× tokens on the first call only.
- Issue-priority threshold. Only iterate on items the judge marks as high-priority; ignore minor flags. Reduces target drift.
- Calibration mode. Optional config flag that runs the judge twice on first call and reports variance to the user before iterating.
Acceptance criteria
A reproducible fix or config option that reduces "different issues each run" behavior on identical input. Demonstrate via a synthetic test case in tests/.
Observation
The self-judge can return the same letter grade across multiple runs on identical input but flag different specific issues each time. Empirically the top issues recur (~66% overlap on highest-priority items), but lower-priority picks are noisy.
Why it matters
When a user runs strict mode and the loop iterates against the bar, the moving target makes "keep iterating until A" expensive — every attempt may surface a new issue the previous attempt didn't. Letter grade alone is not always a reliable lift signal.
Possible directions
Acceptance criteria
A reproducible fix or config option that reduces "different issues each run" behavior on identical input. Demonstrate via a synthetic test case in
tests/.