An open benchmark for evaluating AI coding agents on data engineering tasks across five sequential quality gates.
DEC Bench extends evaluation beyond functional correctness — a pipeline that runs is necessary but insufficient. The benchmark measures correctness, robustness, performance, and production readiness as distinct, ordered dimensions. Gates are sequential: a correct-but-fragile implementation does not score as robust.
Version 0.1 includes 37 scenarios evaluated against Postgres, Redpanda, and ClickHouse. All evaluation runs execute in identical containerized environments.
git clone https://github.com/514-labs/agent-evals.git
cd agent-evals
curl -fsSL https://decbench.ai/install.sh | sh
export ANTHROPIC_API_KEY=your-key-here
dec-bench build --scenario foo-bar-csv-ingest
dec-bench run --scenario foo-bar-csv-ingest
dec-bench results --latest --scenario foo-bar-csv-ingestOpen the repo in Claude Code, Cursor, or Codex and the DEC Bench agent skills auto-load. Ask "get me started" and your agent will walk you through the steps above. See AGENTS.md for the full first-run guide and examples/ for a tour of the worked scenarios.
Each evaluation run is scored against five sequential quality gates:
- Functional — output executes without errors
- Correct — produces expected results on all test cases
- Robust — handles errors and boundary conditions
- Performant — meets latency and throughput targets
- Production — code quality and safety fit for release
Each gate must pass before the next is evaluated. Assertions are deterministic — no LLM-as-judge.
Three agents supported in v0.1:
| Agent | Provider | API Key Variable |
|---|---|---|
| Claude Code | Anthropic | ANTHROPIC_API_KEY |
| Codex | OpenAI | OPENAI_API_KEY |
| Cursor | Cursor | CURSOR_API_KEY |
dec-bench build --scenario foo-bar-csv-ingest --agent codex
dec-bench run --scenario foo-bar-csv-ingest --agent codexTo run a matrix in parallel (--parallel > 1), set the plural form
(ANTHROPIC_API_KEYS / OPENAI_API_KEYS / CURSOR_API_KEYS) to a
comma-separated list. The CLI assigns one key per container in round-robin
order:
export ANTHROPIC_API_KEYS=sk-ant-key1,sk-ant-key2
dec-bench run --matrix --parallel 2Important: Anthropic rate limits apply per organization, not per key — and the same is true for most major providers. Two keys from the same organization share a single rate-limit pool, so multi-key rotation only adds real capacity when the keys come from different Anthropic organizations (e.g., a personal account and a team account).
For more capacity within a single organization, raise your usage tier or
reduce --parallel. If only the singular key is set when --parallel > 1,
the CLI prints a warning and proceeds.
The harness determines the tooling environment available to the agent during evaluation. The same scenario across different harnesses measures whether tooling improves agent performance.
| Harness | Scaffolding | Measures |
|---|---|---|
bare |
None — databases and CLIs only | First-principles reasoning |
classic-de |
dbt, Airflow, Spark | Applied competency with standard tooling |
olap-for-swe |
MooseStack | Code-first OLAP framework leverage |
dec-bench build --scenario foo-bar-csv-ingest --harness classic-de
dec-bench run --scenario foo-bar-csv-ingest --harness classic-de48 scenarios in the preview today, including 36 Foo Bar scenarios plus additional domain coverage across advertising, B2B SaaS, B2C SaaS, e-commerce, infra, and UGC.
dec-bench listA scenario is a matrix, not a single run: 1 scenario × N harnesses × 2 personas = 2N evaluations. Each (scenario, harness) pair owns its own prompts and may carry its own seed data; scenario.json, root init/, assertions/, and supervisord.conf are shared across harnesses.
scenarios/<id>/
scenario.json # declares harnesses[]
supervisord.conf # services (shared)
init/ # shared seed data
assertions/ # shared gates
harnesses/<harness-id>/
prompts/{baseline,informed}.md # required per pair
init/ # optional; only this pair
install.sh # optional; only this pair
See AGENTS.md for why prompts and init can differ per harness.
Every scenario runs against real databases, not mocks:
| Component | Role |
|---|---|
| Postgres | Transactional source of truth — schema migrations, referential integrity |
| Redpanda | High-throughput event streaming — topic management, consumer groups |
| ClickHouse | Columnar analytics — materialized views, real-time aggregation |
- Git
- Docker (running)
- An API key for your chosen agent
dec-bench list is the fastest install smoke test and does not require Docker or an API key. build and run do.
| Command | Purpose |
|---|---|
dec-bench list |
List available scenarios |
dec-bench build |
Build the evaluation image |
dec-bench run |
Run the evaluation |
dec-bench results |
View run results |
dec-bench audit export |
Create audit bundles |
dec-bench audit open |
Open the audit interface |
dec-bench create |
Scaffold a new scenario |
dec-bench validate |
Validate scenario structure |
Add an agent: Create docker/agents/<agent>/run.sh with the invocation logic.
Add a harness: Create apps/web/data/harnesses/<harness>.json with tools and install script.
Add a scenario: dec-bench create --name my-eval --domain ugc --tier tier-1, then define the prompt and assertions. Open the repo in Claude Code, Cursor, or Codex and the dec-bench-create-scenario skill walks through the rest. Worked scenarios live in examples/.
Skills are checked into the repo at .claude/skills/ and .agents/skills/ and auto-load when you open the repo with Claude Code, Cursor, or Codex. No install step.
| Skill | When it fires |
|---|---|
dec-bench-quickstart |
"get started", "install", "first run" |
dec-bench-run |
"run scenario", "benchmark", "compare agents" |
dec-bench-create-scenario |
"create scenario", "add eval", "write a benchmark for" |
dec-bench-local-override |
"test a local moose-cli / ClickHouse / skill build before release" |
dec-bench-local-override guides contributors through substituting a locally-built artifact (moose-cli, moose-lib, ClickHouse, a Claude skill, etc.) into a scenario image to test changes before a release.
The canonical source for skill content is .claude/skills/ (Claude Code's docs-sanctioned discovery path). The .agents/skills/dec-bench-* entries are symlinks back into it, so the npx skills add 514-labs/agent-evals external install path keeps working without a separate copy.
cargo test --manifest-path apps/cli/Cargo.toml
pnpm --filter @dec-bench/eval-core test
pnpm --filter web test:dataWe welcome contributions across scenarios, harnesses, evaluation logic, documentation, and tooling. Contributions should meet the quality criteria described in the contributor guidelines.
Open source under MIT. See LICENSE.
DEC Bench is an open research effort by 514 Labs.