A 4-stage AI-assisted wizard that guides you through building benchmark.yaml specification files for the Benchy evaluation engine. Describe your LLM task, configure scoring, set up test data, and export a ready-to-run benchmark — all in a single web UI.
The wizard walks you through four stages:
- Define — describe your task and specify the expected input/output format
- Score — configure how model outputs are graded (per-field weights, pass/fail, semantic scoring)
- Config — set up your test data source and choose the target model/API
- Export — review the assembled
benchmark.yaml, validate it, and download
An AI chat assistant (powered by Together AI) is available in every stage. Skill buttons provide guided prompts for each step.
bud-clone (React/Vite) artifacts/api-server (Express) ~/benchy (Python CLI)
Port 21707 → Port 8080 → benchy eval / validate
Wizard UI, notepads Chat agent, YAML assembly
localStorage state Data synthesis (Together AI)
The frontend proxies all /api/* requests to the Express server. The server assembles notepads into a benchmark.yaml and optionally calls benchy validate via subprocess.
| Requirement | Check |
|---|---|
| Node.js 20+ | node --version |
| pnpm | pnpm --version |
| Python 3.12+ | python3 --version |
| Together AI API key | platform.together.ai |
| Git | git --version |
Install pnpm if you don't have it:
npm i -g pnpmThe wizard integrates with the Benchy Python CLI for validation and evaluation. Clone the feat/compiler branch:
git clone --branch feat/compiler https://github.com/surus-lat/benchy.git ~/benchySet up the Python virtual environment:
cd ~/benchy
bash setup.shThis creates ~/benchy/.venv with benchy installed. Verify it works:
source ~/benchy/.venv/bin/activate
benchy --helpThe wizard expects benchy to live at
~/benchy. If you clone it elsewhere, update thesourcepath in the run instructions below accordingly.
git clone <this-repo-url> ~/Benchy-agent
cd ~/Benchy-agentFrom the repo root (installs all workspace packages at once):
pnpm installCopy the example env file for the API server and fill in your key:
cp artifacts/api-server/.env.example artifacts/api-server/.envEdit artifacts/api-server/.env:
PORT=8080
TOGETHER_API_KEY=your-key-here
The frontend already has its port set — no changes needed there.
Open a terminal and activate the benchy venv first (this puts benchy on PATH so the server can call benchy validate):
source ~/benchy/.venv/bin/activate
cd ~/Benchy-agent/artifacts/api-server
pnpm devYou should see: Server listening on port 8080
Why activate the venv here? Venv activation is shell-wide. Once you run
source ~/benchy/.venv/bin/activate, thebenchycommand is available to any subprocess spawned from that shell — including the Express server's calls tobenchy validate. You nevercdinto the benchy repo; just activate and stay here.If
benchyisn't on PATH, the validator falls back to a TypeScript stub — the app won't crash, but validation messages won't come from the real compiler.
Open a second terminal (no venv needed):
cd ~/Benchy-agent/artifacts/bud-clone
pnpm devYou should see: Local: http://localhost:21707/
Navigate to http://localhost:21707
Use the AI chat at the bottom of the screen, or click the skill buttons for guided prompts at each stage.
After completing the wizard and downloading your benchmark.yaml, run it with the benchy CLI (venv must still be active from Step 5's terminal):
# Quick smoke test — 5 examples, fast feedback
benchy eval --benchmark path/to/your-benchmark.yaml --limit 5 --exit-policy smoke
# Full run
benchy eval --benchmark path/to/your-benchmark.yaml --exit-policy strictResults are written to:
~/benchy/outputs/benchmark_outputs/<run_id>/<benchmark_name>/
├── run_outcome.json # overall status: passed / degraded / failed
└── run_summary.json # per-field scores
| Variable | Required | Purpose |
|---|---|---|
PORT |
Yes | API server port (default: 8080) |
TOGETHER_API_KEY |
Yes | Powers the chat agent and data synthesis |
These are used by the benchy CLI when it calls model APIs during evaluation — not by the wizard itself:
| Variable | Provider |
|---|---|
OPENAI_API_KEY |
OpenAI models |
ANTHROPIC_API_KEY |
Anthropic / Claude models |
GOOGLE_API_KEY |
Google Gemini models |
TOGETHER_API_KEY |
Together AI models (same key as above) |
Benchy-agent/
├── artifacts/
│ ├── api-server/ # Express.js REST API (port 8080)
│ │ └── src/
│ │ ├── routes/ # /api/chat, /api/benchmark/*
│ │ └── lib/ # yaml-assembly, data-generator, benchmark-compiler
│ └── bud-clone/ # React/Vite frontend (port 21707)
│ └── src/
│ └── pages/home.tsx # Wizard UI (stages, notepads, chat, export)
├── lib/
│ ├── api-zod/ # Zod schemas for API contracts
│ ├── api-spec/ # OpenAPI spec + code generation
│ ├── api-client-react/ # React Query hooks
│ └── db/ # Drizzle ORM schema (placeholder)
├── docs/ # Architecture notes
├── RUNNING.md # Condensed run instructions
└── FULL-INTEGRATION-PLAN.md # Roadmap for deeper benchy CLI integration
benchy: command not found when starting the API server
Activate the venv before starting the server: source ~/benchy/.venv/bin/activate
Frontend loads but chat returns errors
Check that the API server is running on port 8080 and that TOGETHER_API_KEY is set in artifacts/api-server/.env.
pnpm install fails
Make sure you're running Node.js 20+. Run node --version to confirm.
benchy setup.sh fails
Ensure Python 3.12+ is installed. On macOS: brew install python@3.12