Skip to content

StanislavBG/stepproof

Repository files navigation

stepproof

Part of Preflight Tests License

Regression testing for multi-step AI workflows. Not observability.


You upgraded to gpt-4o-mini. Your LangSmith traces look fine. Three days later a customer reports your extraction step stopped working. You found out from a Slack message, not a test.

stepproof is what you run before you deploy.

npm install -g stepproof

30-second quickstart

Write a scenario:

# classify.yaml
name: "Intent classification"
iterations: 10

steps:
  - id: classify
    provider: anthropic
    model: claude-sonnet-4-6
    prompt: "Classify the intent of this message: {{input}}"
    variables:
      input: "I need to cancel my subscription"
    min_pass_rate: 0.90
    assertions:
      - type: contains
        value: "cancel"
      - type: json_schema
        schema: ./schemas/intent.json

  - id: respond
    provider: openai
    model: gpt-4o
    prompt: "Given intent '{{classify.output}}', write a helpful reply to: {{input}}"
    min_pass_rate: 0.80
    assertions:
      - type: llm_judge
        prompt: "Is this response helpful and on-topic? Answer yes/no."
        pass_on: "yes"

Run it:

stepproof run classify.yaml

Output:

stepproof v0.2.0 — running "Intent classification" (10 iterations)

  step: classify
    ✓ 9/10 passed (90.0%) — threshold: 90% ✓

  step: respond
    ✓ 8/10 passed (80.0%) — threshold: 80% ✓

All steps passed. Exit 0.

Now break it — swap to a cheaper model, lower the pass rate. It fails:

  step: classify
    ✗ 5/10 passed (50.0%) — threshold: 90% ✗

1 step failed. Exit 1.

Commands

stepproof run <scenario>

Run a scenario file or directory of scenarios.

stepproof run classify.yaml
stepproof run scenarios/
stepproof run scenarios/ --format sarif --output results.sarif
stepproof run scenarios/ --format junit --output results.xml

Flags:

  • --format <format> — output format: terminal (default), sarif, junit
  • --output <file> — write output to file instead of stdout

stepproof init [dir]

Scaffold a starter scenario in the target directory. Defaults to ./scenarios/.

stepproof init
# Creates: ./scenarios/first-test.yaml

stepproof init my-tests
# Creates: ./my-tests/first-test.yaml

The generated first-test.yaml is a working example you can edit and run immediately.


Environment Variables

Variable Required Purpose
ANTHROPIC_API_KEY For Anthropic steps Authenticates calls to Claude models
OPENAI_API_KEY For OpenAI steps Authenticates calls to GPT models

Only the keys for the providers you use in your scenarios are required.


CI integration

# .github/workflows/ai-regression.yml
name: AI regression tests
on: [push, pull_request]

jobs:
  stepproof:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install -g stepproof
      - run: stepproof run scenarios/classify.yaml
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Exit code 1 on regression. PR blocked. Done.


Assertions

Type What it checks
contains Output includes this string
not_contains Output does not include this string
regex Output matches this pattern
json_schema Output is valid JSON matching this schema
llm_judge A second LLM call evaluates the output (boolean verdict)

Structured reports (v0.2.0)

stepproof outputs machine-readable SARIF 2.1.0 and JUnit XML for CI pipeline integration.

SARIF — GitHub Advanced Security / GitLab / Azure DevOps

# Write SARIF to stdout
stepproof run classify.yaml --format sarif

# Write SARIF to file
stepproof run classify.yaml --format sarif --output results.sarif

Integrate with GitHub Advanced Security:

# .github/workflows/ai-regression.yml
- name: Run stepproof
  run: stepproof run scenarios/ --format sarif --output results.sarif

- name: Upload to GitHub Security tab
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: results.sarif
  if: always()

JUnit XML — Jenkins / CircleCI / TeamCity

stepproof run classify.yaml --format junit
stepproof run classify.yaml --format junit --output results.xml
# .github/workflows/ai-regression.yml
- name: Run stepproof
  run: stepproof run scenarios/ --format junit --output test-results.xml

- name: Publish test results
  uses: actions/upload-artifact@v4
  with:
    name: test-results
    path: test-results.xml
  if: always()

Default output (no --format flag) is unchanged — human-readable terminal output.

Migration note (v0.2.x → v0.3.0): --report still works but is deprecated and will print a warning. Switch to --format at your next convenience. --report will be removed at v1.0.0.


How this is different from LangSmith / Braintrust / Langfuse

stepproof LangSmith / Braintrust
When it runs Before deploy (CI) After deploy (production)
What it answers "Is my pipeline still correct?" "What did my pipeline do?"
Output Pass/fail with exit code Traces and dashboards
Use case Regression testing Observability

They tell you what happened. We tell you whether to deploy.

These are different jobs. Use both.


Troubleshooting

Error: "scenarios/" is a directory

stepproof run ./scenarios/first-test.yaml   # ← run a specific file
stepproof run ./scenarios/                  # ← or run the whole dir (note trailing slash)

Error parsing scenario: ...

Your YAML has a syntax error. Common culprits: inconsistent indentation, unquoted {{vars}}, or a missing steps: key. Run node -e "require('fs').readFileSync('./your.yaml')" to catch basic issues.

API errors (401 Unauthorized, 403 Forbidden)

Set the API key for whichever provider your scenario uses:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

Only the keys for providers you use in your scenarios are required.

Steps failing when they should pass

Check min_pass_rate in your scenario. The default is not 100% — if you set min_pass_rate: 0.90 you expect 1-in-10 to fail. Lower it, or improve your prompt.

--format must be "sarif" or "junit"

Only sarif and junit are valid format values. For terminal output, omit the --format flag entirely.

Pro features blocked (SARIF / JUnit output)

SARIF and JUnit formats require a Team license. Set your key:

export PREFLIGHT_LICENSE_KEY=preflight_...
stepproof run scenarios/ --format sarif --output results.sarif

Get a license at the Preflight pricing page.


Scenarios

See /examples for copy-paste ready scenarios:


Roadmap

  • v0.2.0 (current): YAML scenarios, N iterations, 5 assertion types, exit code 1 on failure, OpenAI + Anthropic, SARIF 2.1.0 + JUnit XML reporters, stepproof init scaffolding
  • v0.3.0 (next): Baseline comparison (fail on regression from last run), GitHub Actions native action, provider comparison mode — run the same scenario against two models and diff the results
  • Cloud dashboard (month 3–6): Persistent history, trend charts, team workspaces — never in the CLI

Contributing

Issues and PRs welcome. See CONTRIBUTING.md for dev setup and guidelines. The tool is and will remain free. Cloud features are the business model, not the CLI.


Part of the Preflight suite

stepproof is one tool in the Preflight AI Agent DevOps suite — local-first CLIs covering the full lifecycle from pre-deploy validation to production observability:

Tool Purpose Install
stepproof Behavioral regression testing npm install -g stepproof
agent-comply EU AI Act compliance scanning npm install -g agent-comply
agent-gate Unified pre-deploy CI gate npm install -g agent-gate
agent-shift Config versioning + environment promotion npm install -g agent-shift
agent-trace Local observability — OTel traces in SQLite npm install -g agent-trace

Install the full suite:

npm install -g agent-gate stepproof agent-comply agent-shift agent-trace

stepproof — because "I checked manually before the deploy" is not a test.


Legal

About

Regression testing CLI for AI agents — define expected behaviors in YAML, run in CI, fail deploys on behavioral drift

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors