Refine an Agent Skill via the skill-forge judge → hitl loop, in a sandboxed Docker container.
export ANTHROPIC_API_KEY=sk-...
npx @jumptag/refine-skill ./path/to/skillDefault: 3 iterations max, claude-sonnet-4-5, telemetry written to <path>/.refine/log.json.
- Node 20+ (for
npx). - Docker (Engine 20.10+, daemon running).
- An API key for one of: Anthropic, OpenAI, Google, xAI, Mistral, Groq, OpenRouter — matched to your
--modelchoice. See MODELS.md for the full list of supported models, env vars, and where to get keys.
npx @jumptag/refine-skill <path> [options]
| Option | Default | Effect |
|---|---|---|
--iterations N |
3 | Max passes before stopping at the cap |
--model M |
claude-sonnet-4-5 |
Any model pi.dev supports |
--image TAG |
ghcr.io/barryroodt/refine-skill:<pkg-version> |
Override the image |
--pull POLICY |
missing |
always / never / missing |
--no-log |
off | Skip writing .refine/log.json |
--dry-run |
off | Print the docker invocation and exit |
--verbose |
off | Stream pi output uncut |
--pi-timeout SECS |
600 | Per-pi-call timeout |
The loop exits at the first matching rule:
- Pass 1 → never stops.
- Judge produces zero items →
all_obsolete. - All items match "already satisfied / no-op / superseded" →
all_obsolete. - Score change from previous pass < 2 →
delta_below_threshold. - All items match trade-off / diminishing / LOW priority →
tradeoff_floor. - Pass >
--iterations→max_iterations(exit 1, still successful).
.refine/log.json contains per-pass score/grade/delta, per-item commit messages + diffs, stop reason, model, image tag, timestamps. Disable with --no-log.
docker run --rm -i \
-v "$PWD/path/to/skill:/work" \
-e ANTHROPIC_API_KEY \
ghcr.io/barryroodt/refine-skill:latest \
/work --iterations 3 --model claude-sonnet-4-5| Code | Meaning |
|---|---|
| 0 | Natural convergence (any of rules 2-5) |
| 1 | Max iterations reached (still successful) |
| 2 | Bad path / missing SKILL.md |
| 3 | Missing / mismatched API key |
| 4 | Docker not available |
| 10 | Pi crash |
| 11 | Judge output malformed |
| 12 | Hitl partial apply |
| 13 | Disk full / OOM |
| 14 | Another refine running on the same path |
| 130 | SIGINT |
| 143 | SIGTERM |
specs/2026-05-20-deftly-refine-cli-design.md
refine-skill is a thin orchestration harness around two existing pieces of work:
- Skill Forge by @WrathZA —
skill-forge-judge+skill-forge-hitlprovide all the actual refinement logic (scoring rubric, per-item HITL loop). Apache 2.0; pinned tag2026.04.30; baked into the image at build time and copied verbatim. SeeNOTICE. - pi.dev coding agent by @mariozechner — provider-agnostic LLM harness that runs the two skills inside the container.
This project (@jumptag/refine-skill, MIT) just wires them together: Node CLI + bash outer loop + deterministic stop rules + telemetry.