ser is a CLI test runner for SKILL.md files. It runs a skill in a fresh workspace, sends the
test prompt to an agent, and checks the observable result: files, stdout, stderr, exit code, JSON,
commands, duration, token usage, and the agent's final response.
It is useful once "I tried the prompt once and it looked fine" stops being enough. Skills tend to become part of real workflows, and those workflows do not always fail loudly. An agent can give a convincing answer while forgetting to create a file, putting it in the wrong place, or dropping an important rule after a small edit to the instructions.
- Finds
*.skill-test.ymland*.skill-test.yamlfiles next to skills. - Creates a sandbox for each test case.
- Passes the
SKILL.mdcontent and test prompt to an adapter. - Lets the agent work inside the sandbox through file and command tools.
- Verifies the result with assertions.
- Writes reports for people and CI: console, JSON, HTML, JUnit, Markdown, and GitHub Actions annotations.
The short version: ser is pytest for AI Skills.
A manual chat run is a weak regression test. Everything can look reasonable in the conversation while the real contract of the skill sits somewhere else: created files, directory structure, commands, JSON output, exit codes, and the absence of errors.
ser makes those expectations explicit. You describe what the skill should do, then run the same
check locally, in a pull request, or in CI.
Current version: 0.1.0.
The runner can already execute real suite files, run in dry-run mode, use Claude, use the codex
adapter over the OpenAI API, connect to OpenAI-compatible endpoints, run tmpdir and Docker
sandboxes, save artifacts, redact secrets, retry failed calls, and generate several report formats.
The public surface is still small, so pin the version in CI and upgrade deliberately.
Requirements:
- Node.js
>=20 - Docker, only when using
--sandbox docker - API keys, only for adapters that call a model
npm install -g skill-eval-runner
ser doctorFor local development from the repository:
npm install
npm run build
npm run dev -- doctorStart with dry-run. It does not call an LLM and does not create files on behalf of an agent. Its
job is simpler: validate discovery, YAML, config, and reporting without tokens or API keys.
Create skill.skill-test.yml next to SKILL.md:
schema_version: '1.0'
name: sample-skill
tags: [sample, smoke]
skill: ./SKILL.md
adapter: dry-run
tests:
- name: dry-run-smoke
prompt: 'Explain what you would do in {{WORKSPACE}}.'
assertions:
- type: exit_code
expected: 0
- type: stderr_empty
- type: response_contains
contains: '[DRY RUN]'
- type: token_usage_under
max_total: 1Run it:
ser run --dry-run --report console,json,html .Once the suite shape is right, switch to a real adapter and assert on what the agent actually did in the workspace:
schema_version: '1.0'
name: migration-skill
tags: [critical]
skill: ./SKILL.md
adapter: claude
tests:
- name: creates-user-migration
prompt: 'Create a user table migration in {{WORKSPACE}}.'
assertions:
- type: exit_code
expected: 0
- type: stderr_empty
- type: file_exists
path: db/migrations/001_create_users.sql
- type: file_contains
path: db/migrations/001_create_users.sql
contains: CREATE TABLE usersExample run with a provider key:
SER_ANTHROPIC_API_KEY=... ser run . --adapter claude --save-artifactsA minimal suite contains a path to the skill and a list of tests:
schema_version: '1.0'
name: docs-skill
skill: ./SKILL.md
tags: [docs]
setup_files:
- src: ./fixtures/package.json
dest: package.json
tests:
- name: updates-readme
tags: [smoke]
prompt: 'Use {{WORKSPACE}} and update the README.'
assertions:
- type: file_exists
path: README.md
- type: response_not_contains
contains: 'I cannot'Useful fields:
| Field | Scope | Purpose |
|---|---|---|
tags |
suite, test | Select fast, slow, critical, or experimental checks. |
setup_files |
suite, test | Copy fixtures into the sandbox before the agent runs. |
setup_commands |
suite, test | Prepare a workspace: install, generate, migrate, build. |
teardown_commands |
suite, test | Clean up temporary state after a case when needed. |
env |
suite, test, config | Pass environment variables to setup commands and adapters. |
timeout_seconds |
suite, test, config | Override the shared timeout for long or short cases. |
adapter |
suite, config, CLI | Choose dry-run, claude, codex, or openai-compat. |
sandbox |
suite, config, CLI | Choose tmpdir or docker. |
sandbox_config |
suite, config | Configure network, memory, CPU, process limits, or Dockerfile details. |
{{WORKSPACE}} in a prompt is replaced with the sandbox path. This is handy when a skill should work
with a test project rather than the runner repository itself.
ser run [path] # find suites, run tests, write reports
ser init [path] # create example.skill-test.yml and .skilleval.yml
ser validate [path] # validate YAML, skill paths, and suite structure
ser list [path] # show discovered suites and cases
ser report <run.json> --format html
ser report <run.json> --format junit
ser report <run.json> --format markdown
ser doctor # check Node, Docker, keys, and report directories
ser completion bash # completion for bash, zsh, or fishCommon run options:
ser run . --filter 'migration-skill::creates-*'
ser run . --tags critical,not:slow
ser run . --adapter claude --model <model-name>
ser run . --adapter codex --model <model-name>
ser run . --adapter openai-compat --base-url http://localhost:11434/v1
ser run . --sandbox docker --save-artifacts
ser run . --max-cost 5
ser run . --report console,json,html,junit,markdown --report-dir .skilleval-reportsExit codes:
| Code | Meaning |
|---|---|
0 |
All error-severity assertions passed. |
1 |
At least one test failed. |
2 |
Parsing or configuration error. |
3 |
Runtime error, adapter error, timeout, interruption, or max-cost guardrail triggered. |
Assert on behavior you can observe. If a skill should create a file, check the file. If it should return structured data, check the JSON. Response text assertions are useful, but they are usually weaker than workspace assertions.
| Group | Assertions |
|---|---|
| Files | file_exists, file_not_exists, file_contains, file_matches_regex, file_count, file_diff, dir_structure |
| Process | exit_code, stdout_contains, stdout_matches_regex, stderr_contains, stderr_empty, command_ran, duration_under, custom_command, no_exec_errors |
| JSON | json_schema, json_path_equals |
| Agent response | response_contains, response_not_contains, token_usage_under, semantic |
Any assertion can be a warning:
- type: response_contains
contains: 'used project conventions'
severity: warningWarnings appear in reports but do not fail the test case.
semantic uses an LLM judge. It is useful for nuanced quality checks, but CI should lean on
deterministic assertions: files, commands, JSON, and other concrete outputs.
| Adapter | Use it when |
|---|---|
dry-run |
You want to check suite parsing, discovery, config, and reports. |
claude |
You want to run the skill through Anthropic Messages API with tools. |
codex |
You want to run the skill through the OpenAI API with the same sandbox. |
openai-compat |
You want to use an OpenAI-compatible endpoint, local model, or gateway. |
API key priority:
--api-keySER_ANTHROPIC_API_KEYorSER_OPENAI_API_KEYANTHROPIC_API_KEYorOPENAI_API_KEYapi_keyin.skilleval.yml
For openai-compat, you usually also need --base-url or base_url in config.
tmpdir is fast and convenient locally. It creates a fresh temporary workspace, but it does not
provide strict network or resource isolation.
docker is slower, but better suited for CI and untrusted skills. It can restrict network, memory,
CPU, and process count.
Example:
sandbox: docker
docker_image: node:20-slim
sandbox_config:
network: none
memory_limit: '1g'
cpu_limit: 1
process_limit: 128ser run . --report console,json,html,junit,markdownFormats:
| Format | Purpose |
|---|---|
console |
Short local summary in the terminal. |
json |
Full machine-readable run result. |
html |
Readable report for review and debug. |
junit |
CI test report integration. |
markdown |
PR comments or job summaries. |
github-actions |
Workflow annotations in stdout. |
You can rebuild reports from a saved JSON run:
ser report .skilleval-reports/<timestamp>.json --format html
ser report .skilleval-reports/<timestamp>.json --format junit
ser report .skilleval-reports/<timestamp>.json --format markdownUse --save-artifacts when debugging failures. Failed and errored cases save sandbox contents,
adapter responses, and assertion details into the artifacts directory.
ser looks for .skilleval.yml upward from the given path:
adapter: claude
sandbox: tmpdir
timeout_seconds: 120
concurrency: 1
report_formats:
- console
- json
- junit
report_dir: .skilleval-reports
save_artifacts: false
artifacts_dir: .skilleval-artifacts
redact_secrets: true
retry:
max_attempts: 2
backoff: exponential
delay_ms: 30000
retry_on: [rate_limit, 5xx, network]
discovery_patterns:
- '**/*.skill-test.yml'
- '**/*.skill-test.yaml'CLI flags take precedence over config. That keeps baseline settings in the repository while a CI job chooses the model, key, concurrency, and report formats.
Example GitHub Actions workflow:
name: skill-evals
on:
pull_request:
push:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm install -g skill-eval-runner
- run: ser run . --adapter claude --report console,junit --report-dir .skilleval-reports
env:
SER_ANTHROPIC_API_KEY: ${{ secrets.SER_ANTHROPIC_API_KEY }}For the first CI run, start with ser run . --dry-run --report console,junit. That checks discovery,
config, and reports before you spend tokens.
npm install
npm run typecheck
npm test
npm run buildUseful local commands:
npm run dev -- list fixtures/sample-skill
npm run dev -- validate fixtures/sample-skill
npm run dev -- run --dry-run fixtures/sample-skill
npm run dev -- run --dry-run --report console,json,html fixtures/sample-skillser doctorCommon issues:
| Symptom | Check |
|---|---|
| No suites found | File names must match *.skill-test.yml or *.skill-test.yaml. |
SKILL.md not found |
The skill path is resolved relative to the suite file. |
| Docker fails | Start the Docker daemon or use --sandbox tmpdir. |
| Missing API key | Set SER_ANTHROPIC_API_KEY, SER_OPENAI_API_KEY, or --api-key. |
| Flaky test | Move the check from response text to a file, command, JSON, or another deterministic fact. |
| Secrets appear in logs | Keep redact_secrets: true; it is enabled by default unless SER_REDACT_SECRETS=0. |
MIT. See LICENSE.