feat: puzzle quality system — writer-judge loop by delebedev · Pull Request #232 · delebedev/leyline

delebedev · 2026-03-24T00:13:54Z

Summary

Writer-judge pattern for producing high-quality .pzl puzzles with structured reasoning, scoring, and telemetry.

Puzzle judge subagent (.claude/agents/puzzle-judge.md) — 5-dimension scoring rubric (Focus, Determinism, Signal, Minimality, Documentation), structured verdicts (PASS/NEEDS_REVISION/REJECT)
Upgraded write-puzzle skill — 7-step reasoning chain before writing any .pzl content, design narrative embedded as # comments, auto-dispatches judge with 3-round revision loop, REJECT recovery (fresh attempt from step 1)
Findings log (docs/puzzle-findings.md) — quality telemetry with token tracking per judge dispatch
Writer self-reflection (docs/puzzle-writer-learnings.md) — friction/surprise/improvement notes accumulate into skill improvements
Card palette — curated known-good cards by role + "prefer recording-referenced cards" guidance
Calibration — judge validated against 5 existing puzzles, all scores within expected ranges
First puzzle from the loop — deathtouch-kill.pzl scored 14/15 PASS

Artifacts

File	What
`.claude/agents/puzzle-judge.md`	Judge subagent definition
`.claude/skills/write-puzzle/SKILL.md`	Upgraded writer skill
`docs/puzzle-findings.md`	Findings log with calibration + first entry
`docs/puzzle-writer-learnings.md`	Writer self-reflection log
`matchdoor/src/test/resources/puzzles/deathtouch-kill.pzl`	First puzzle from the loop
`docs/superpowers/specs/2026-03-23-puzzle-quality-system-design.md`	Design spec
`docs/superpowers/plans/2026-03-23-puzzle-quality-system.md`	Implementation plan

Test plan

Judge calibrated against 5 existing puzzles — scores in expected ranges
End-to-end: wrote deathtouch-kill.pzl through full loop, 14/15 PASS
just puzzle-check passes on new puzzle
Next session: puzzle-judge agent type available for dispatch (registered at session start)
Future: write puzzles for conformance issues SBA deaths use wrong TransferCategory (Destroy instead of SBA_ZeroToughness/SBA_Damage) #170, Mill ZoneTransfer annotations emit affectorId=0 #176, Scry annotation: wrong detail keys and affectedIds #167, DFC / Saga flip wire: ZoneTransfer pair instead of in-place mutation #191

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add --label ralph to skill PR template and CLAUDE.md agent policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Writer-judge pattern for .pzl puzzles: structured reasoning chain for the writer, 5-dimension scoring rubric for the judge, 3-round revision loop, and findings log for telemetry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

5-task plan: judge agent, upgraded writer skill, findings log, calibration against existing puzzles, end-to-end validation. Also fixes bolt-face.pzl expected score range in spec (12-14 → 10-13). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…dings log - puzzle-judge: 5-dimension scoring rubric (Focus, Determinism, Signal, Minimality, Documentation), structured verdicts, read-only evaluation - write-puzzle: 7-step reasoning chain, narrative comments, auto judge dispatch with 3-round revision loop, findings log telemetry - puzzle-findings.md: append-only quality telemetry log Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All scores within expected ranges — rubric is well-calibrated. legend-rule: 14/15 PASS, bolt-face: 10/15 NEEDS_REVISION, fdn-keyword-combat: 7/15 REJECT, lands-only: 5/15 REJECT, prowess-buff: 7/15 REJECT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Score: 14/15 PASS. Typhoid Rats (1/1 deathtouch) blocks Juggernaut (5/3 must-attack) at 3 life. Single win path, zero AI decisions, full design narrative embedded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Writer skill now instructs the caller to extract total_tokens from judge agent dispatch results and record per-round + total in the findings log entry. Enables cost visibility per puzzle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion - Card selection guidance: prefer recording-referenced cards, verify before designing, curated palette for common roles - Step 7: writer appends friction/surprise/improvement notes to docs/puzzle-writer-learnings.md after each puzzle - Learnings log: self-improving feedback loop — 3+ repeats graduate into skill rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-24T00:15:57Z

Test Results

242 files ±0 242 suites ±0 1m 57s ⏱️ -3s
1 006 tests ±0 980 ✅ ±0 26 💤 ±0 0 ❌ ±0
1 795 runs ±0 995 ✅ ±0 800 💤 ±0 0 ❌ ±0

Results for commit 6ebe352. ± Comparison against base commit da6ad8a.

github-actions · 2026-03-24T00:15:59Z

CI Report — Gate

Tests: 825/825 passed (190 skipped)

Coverage: 11.0% (642/5861 lines)

Module	Coverage	Lines
tooling	🔴 11%	642/5861

Slow tests (>3s): 1

InstanceIdReallocTest.Limbo grows across multiple plays (5.5s)

delebedev and others added 9 commits March 22, 2026 22:25

chore: ignore .claude/worktrees/

dd5e275

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: tag ralph PRs with ralph label

969f71f

Add --label ralph to skill PR template and CLAUDE.md agent policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: puzzle quality system — writer-judge loop#232

feat: puzzle quality system — writer-judge loop#232
delebedev wants to merge 9 commits intomainfrom
feat/puzzle-quality-system

delebedev commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

delebedev commented Mar 24, 2026

Summary

Artifacts

Test plan

Uh oh!

github-actions bot commented Mar 24, 2026

Test Results

Uh oh!

github-actions bot commented Mar 24, 2026

CI Report — Gate

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant