Skip to content

feat: puzzle quality system — writer-judge loop#232

Open
delebedev wants to merge 9 commits intomainfrom
feat/puzzle-quality-system
Open

feat: puzzle quality system — writer-judge loop#232
delebedev wants to merge 9 commits intomainfrom
feat/puzzle-quality-system

Conversation

@delebedev
Copy link
Owner

Summary

Writer-judge pattern for producing high-quality .pzl puzzles with structured reasoning, scoring, and telemetry.

  • Puzzle judge subagent (.claude/agents/puzzle-judge.md) — 5-dimension scoring rubric (Focus, Determinism, Signal, Minimality, Documentation), structured verdicts (PASS/NEEDS_REVISION/REJECT)
  • Upgraded write-puzzle skill — 7-step reasoning chain before writing any .pzl content, design narrative embedded as # comments, auto-dispatches judge with 3-round revision loop, REJECT recovery (fresh attempt from step 1)
  • Findings log (docs/puzzle-findings.md) — quality telemetry with token tracking per judge dispatch
  • Writer self-reflection (docs/puzzle-writer-learnings.md) — friction/surprise/improvement notes accumulate into skill improvements
  • Card palette — curated known-good cards by role + "prefer recording-referenced cards" guidance
  • Calibration — judge validated against 5 existing puzzles, all scores within expected ranges
  • First puzzle from the loopdeathtouch-kill.pzl scored 14/15 PASS

Artifacts

File What
.claude/agents/puzzle-judge.md Judge subagent definition
.claude/skills/write-puzzle/SKILL.md Upgraded writer skill
docs/puzzle-findings.md Findings log with calibration + first entry
docs/puzzle-writer-learnings.md Writer self-reflection log
matchdoor/src/test/resources/puzzles/deathtouch-kill.pzl First puzzle from the loop
docs/superpowers/specs/2026-03-23-puzzle-quality-system-design.md Design spec
docs/superpowers/plans/2026-03-23-puzzle-quality-system.md Implementation plan

Test plan

🤖 Generated with Claude Code

delebedev and others added 9 commits March 22, 2026 22:25
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --label ralph to skill PR template and CLAUDE.md agent policy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Writer-judge pattern for .pzl puzzles: structured reasoning chain
for the writer, 5-dimension scoring rubric for the judge, 3-round
revision loop, and findings log for telemetry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5-task plan: judge agent, upgraded writer skill, findings log,
calibration against existing puzzles, end-to-end validation.

Also fixes bolt-face.pzl expected score range in spec (12-14 → 10-13).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dings log

- puzzle-judge: 5-dimension scoring rubric (Focus, Determinism, Signal,
  Minimality, Documentation), structured verdicts, read-only evaluation
- write-puzzle: 7-step reasoning chain, narrative comments, auto judge
  dispatch with 3-round revision loop, findings log telemetry
- puzzle-findings.md: append-only quality telemetry log

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All scores within expected ranges — rubric is well-calibrated.
legend-rule: 14/15 PASS, bolt-face: 10/15 NEEDS_REVISION,
fdn-keyword-combat: 7/15 REJECT, lands-only: 5/15 REJECT,
prowess-buff: 7/15 REJECT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Score: 14/15 PASS. Typhoid Rats (1/1 deathtouch) blocks Juggernaut
(5/3 must-attack) at 3 life. Single win path, zero AI decisions,
full design narrative embedded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Writer skill now instructs the caller to extract total_tokens from
judge agent dispatch results and record per-round + total in the
findings log entry. Enables cost visibility per puzzle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

- Card selection guidance: prefer recording-referenced cards, verify
  before designing, curated palette for common roles
- Step 7: writer appends friction/surprise/improvement notes to
  docs/puzzle-writer-learnings.md after each puzzle
- Learnings log: self-improving feedback loop — 3+ repeats graduate
  into skill rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

Test Results

  242 files  ±0  242 suites  ±0   1m 57s ⏱️ -3s
1 006 tests ±0  980 ✅ ±0   26 💤 ±0  0 ❌ ±0 
1 795 runs  ±0  995 ✅ ±0  800 💤 ±0  0 ❌ ±0 

Results for commit 6ebe352. ± Comparison against base commit da6ad8a.

@github-actions
Copy link

CI Report — Gate

Tests: 825/825 passed (190 skipped)

Coverage: 11.0% (642/5861 lines)

Module Coverage Lines
tooling 🔴 11% 642/5861
Slow tests (>3s): 1
  • InstanceIdReallocTest.Limbo grows across multiple plays (5.5s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant