Benchy Agent

A 4-stage AI-assisted wizard that guides you through building benchmark.yaml specification files for the Benchy evaluation engine. Describe your LLM task, configure scoring, set up test data, and export a ready-to-run benchmark — all in a single web UI.

What it does

The wizard walks you through four stages:

Define — describe your task and specify the expected input/output format
Score — configure how model outputs are graded (per-field weights, pass/fail, semantic scoring)
Config — set up your test data source and choose the target model/API
Export — review the assembled benchmark.yaml, validate it, and download

An AI chat assistant (powered by Together AI) is available in every stage. Skill buttons provide guided prompts for each step.

Architecture

bud-clone (React/Vite)       artifacts/api-server (Express)       ~/benchy (Python CLI)
      Port 21707           →       Port 8080                →       benchy eval / validate
   Wizard UI, notepads          Chat agent, YAML assembly
   localStorage state           Data synthesis (Together AI)

The frontend proxies all /api/* requests to the Express server. The server assembles notepads into a benchmark.yaml and optionally calls benchy validate via subprocess.

Prerequisites

Requirement	Check
Node.js 20+	`node --version`
pnpm	`pnpm --version`
Python 3.12+	`python3 --version`
Together AI API key	platform.together.ai
Git	`git --version`

Install pnpm if you don't have it:

npm i -g pnpm

Step 1 — Clone the Benchy engine

The wizard integrates with the Benchy Python CLI for validation and evaluation. Clone the feat/compiler branch:

git clone --branch feat/compiler https://github.com/surus-lat/benchy.git ~/benchy

Set up the Python virtual environment:

cd ~/benchy
bash setup.sh

This creates ~/benchy/.venv with benchy installed. Verify it works:

source ~/benchy/.venv/bin/activate
benchy --help

The wizard expects benchy to live at ~/benchy. If you clone it elsewhere, update the source path in the run instructions below accordingly.

Step 2 — Clone this repo

git clone <this-repo-url> ~/Benchy-agent
cd ~/Benchy-agent

Step 3 — Install JS dependencies

From the repo root (installs all workspace packages at once):

pnpm install

Step 4 — Configure environment variables

Copy the example env file for the API server and fill in your key:

cp artifacts/api-server/.env.example artifacts/api-server/.env

Edit artifacts/api-server/.env:

PORT=8080
TOGETHER_API_KEY=your-key-here

The frontend already has its port set — no changes needed there.

Step 5 — Start the API server

Open a terminal and activate the benchy venv first (this puts benchy on PATH so the server can call benchy validate):

source ~/benchy/.venv/bin/activate
cd ~/Benchy-agent/artifacts/api-server
pnpm dev

You should see: Server listening on port 8080

Why activate the venv here? Venv activation is shell-wide. Once you run source ~/benchy/.venv/bin/activate, the benchy command is available to any subprocess spawned from that shell — including the Express server's calls to benchy validate. You never cd into the benchy repo; just activate and stay here.

If benchy isn't on PATH, the validator falls back to a TypeScript stub — the app won't crash, but validation messages won't come from the real compiler.

Step 6 — Start the frontend

Open a second terminal (no venv needed):

cd ~/Benchy-agent/artifacts/bud-clone
pnpm dev

You should see: Local: http://localhost:21707/

Step 7 — Open the wizard

Navigate to http://localhost:21707

Use the AI chat at the bottom of the screen, or click the skill buttons for guided prompts at each stage.

Step 8 — Run your benchmark

After completing the wizard and downloading your benchmark.yaml, run it with the benchy CLI (venv must still be active from Step 5's terminal):

# Quick smoke test — 5 examples, fast feedback
benchy eval --benchmark path/to/your-benchmark.yaml --limit 5 --exit-policy smoke

# Full run
benchy eval --benchmark path/to/your-benchmark.yaml --exit-policy strict

Results are written to:

~/benchy/outputs/benchmark_outputs/<run_id>/<benchmark_name>/
├── run_outcome.json   # overall status: passed / degraded / failed
└── run_summary.json   # per-field scores

Environment variables reference

API server (`artifacts/api-server/.env`)

Variable	Required	Purpose
`PORT`	Yes	API server port (default: 8080)
`TOGETHER_API_KEY`	Yes	Powers the chat agent and data synthesis

Provider keys for `benchy eval` (set in your shell, not the `.env`)

These are used by the benchy CLI when it calls model APIs during evaluation — not by the wizard itself:

Variable	Provider
`OPENAI_API_KEY`	OpenAI models
`ANTHROPIC_API_KEY`	Anthropic / Claude models
`GOOGLE_API_KEY`	Google Gemini models
`TOGETHER_API_KEY`	Together AI models (same key as above)

Project structure

Benchy-agent/
├── artifacts/
│   ├── api-server/          # Express.js REST API (port 8080)
│   │   └── src/
│   │       ├── routes/      # /api/chat, /api/benchmark/*
│   │       └── lib/         # yaml-assembly, data-generator, benchmark-compiler
│   └── bud-clone/           # React/Vite frontend (port 21707)
│       └── src/
│           └── pages/home.tsx   # Wizard UI (stages, notepads, chat, export)
├── lib/
│   ├── api-zod/             # Zod schemas for API contracts
│   ├── api-spec/            # OpenAPI spec + code generation
│   ├── api-client-react/    # React Query hooks
│   └── db/                  # Drizzle ORM schema (placeholder)
├── docs/                    # Architecture notes
├── RUNNING.md               # Condensed run instructions
└── FULL-INTEGRATION-PLAN.md # Roadmap for deeper benchy CLI integration

Troubleshooting

benchy: command not found when starting the API server Activate the venv before starting the server: source ~/benchy/.venv/bin/activate

Frontend loads but chat returns errors Check that the API server is running on port 8080 and that TOGETHER_API_KEY is set in artifacts/api-server/.env.

pnpm install fails Make sure you're running Node.js 20+. Run node --version to confirm.

benchy setup.sh fails Ensure Python 3.12+ is installed. On macOS: brew install python@3.12

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.agents		.agents
artifacts		artifacts
attached_assets		attached_assets
docs		docs
lib		lib
scripts		scripts
.gitignore		.gitignore
.npmrc		.npmrc
.replit		.replit
.replitignore		.replitignore
BENCHY_INTEGRATION_MANUAL.md		BENCHY_INTEGRATION_MANUAL.md
FULL-INTEGRATION-PLAN.md		FULL-INTEGRATION-PLAN.md
README.md		README.md
RUNNING.md		RUNNING.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
replit.md		replit.md
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchy Agent

What it does

Architecture

Prerequisites

Step 1 — Clone the Benchy engine

Step 2 — Clone this repo

Step 3 — Install JS dependencies

Step 4 — Configure environment variables

Step 5 — Start the API server

Step 6 — Start the frontend

Step 7 — Open the wizard

Step 8 — Run your benchmark

Environment variables reference

API server (`artifacts/api-server/.env`)

Provider keys for `benchy eval` (set in your shell, not the `.env`)

Project structure

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchy Agent

What it does

Architecture

Prerequisites

Step 1 — Clone the Benchy engine

Step 2 — Clone this repo

Step 3 — Install JS dependencies

Step 4 — Configure environment variables

Step 5 — Start the API server

Step 6 — Start the frontend

Step 7 — Open the wizard

Step 8 — Run your benchmark

Environment variables reference

API server (artifacts/api-server/.env)

Provider keys for benchy eval (set in your shell, not the .env)

Project structure

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

API server (`artifacts/api-server/.env`)

Provider keys for `benchy eval` (set in your shell, not the `.env`)

Packages