@jumptag/refine-skill

Refine an Agent Skill via the skill-forge judge → hitl loop, in a sandboxed Docker container.

Quick start

export ANTHROPIC_API_KEY=sk-...
npx @jumptag/refine-skill ./path/to/skill

Default: 3 iterations max, claude-sonnet-4-5, telemetry written to <path>/.refine/log.json.

Requirements

Node 20+ (for npx).
Docker (Engine 20.10+, daemon running).
An API key for one of: Anthropic, OpenAI, Google, xAI, Mistral, Groq, OpenRouter — matched to your --model choice. See MODELS.md for the full list of supported models, env vars, and where to get keys.

Usage

npx @jumptag/refine-skill <path> [options]

Option	Default	Effect
`--iterations N`	3	Max passes before stopping at the cap
`--model M`	`claude-sonnet-4-5`	Any model pi.dev supports
`--image TAG`	`ghcr.io/barryroodt/refine-skill:<pkg-version>`	Override the image
`--pull POLICY`	`missing`	`always` / `never` / `missing`
`--no-log`	off	Skip writing `.refine/log.json`
`--dry-run`	off	Print the docker invocation and exit
`--verbose`	off	Stream pi output uncut
`--pi-timeout SECS`	600	Per-pi-call timeout

How stopping works

The loop exits at the first matching rule:

Pass 1 → never stops.
Judge produces zero items → all_obsolete.
All items match "already satisfied / no-op / superseded" → all_obsolete.
Score change from previous pass < 2 → delta_below_threshold.
All items match trade-off / diminishing / LOW priority → tradeoff_floor.
Pass > --iterations → max_iterations (exit 1, still successful).

Telemetry

.refine/log.json contains per-pass score/grade/delta, per-item commit messages + diffs, stop reason, model, image tag, timestamps. Disable with --no-log.

Running without `npx`

docker run --rm -i \
  -v "$PWD/path/to/skill:/work" \
  -e ANTHROPIC_API_KEY \
  ghcr.io/barryroodt/refine-skill:latest \
  /work --iterations 3 --model claude-sonnet-4-5

Exit codes

Code	Meaning
0	Natural convergence (any of rules 2-5)
1	Max iterations reached (still successful)
2	Bad path / missing SKILL.md
3	Missing / mismatched API key
4	Docker not available
10	Pi crash
11	Judge output malformed
12	Hitl partial apply
13	Disk full / OOM
14	Another refine running on the same path
130	SIGINT
143	SIGTERM

Spec

specs/2026-05-20-deftly-refine-cli-design.md

Credits

refine-skill is a thin orchestration harness around two existing pieces of work:

Skill Forge by @WrathZA — skill-forge-judge + skill-forge-hitl provide all the actual refinement logic (scoring rubric, per-item HITL loop). Apache 2.0; pinned tag 2026.04.30; baked into the image at build time and copied verbatim. See NOTICE.
pi.dev coding agent by @mariozechner — provider-agnostic LLM harness that runs the two skills inside the container.

This project (@jumptag/refine-skill, MIT) just wires them together: Node CLI + bash outer loop + deterministic stop rules + telemetry.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
bin		bin
docker		docker
e2e		e2e
specs		specs
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MODELS.md		MODELS.md
NEXT_STEPS.md		NEXT_STEPS.md
NOTICE		NOTICE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@jumptag/refine-skill

Quick start

Requirements

Usage

How stopping works

Telemetry

Running without `npx`

Exit codes

Spec

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@jumptag/refine-skill

Quick start

Requirements

Usage

How stopping works

Telemetry

Running without npx

Exit codes

Spec

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Running without `npx`

Packages