DEC Bench

An open benchmark for evaluating AI coding agents on data engineering tasks across five sequential quality gates.

DEC Bench extends evaluation beyond functional correctness — a pipeline that runs is necessary but insufficient. The benchmark measures correctness, robustness, performance, and production readiness as distinct, ordered dimensions. Gates are sequential: a correct-but-fragile implementation does not score as robust.

Version 0.1 includes 37 scenarios evaluated against Postgres, Redpanda, and ClickHouse. All evaluation runs execute in identical containerized environments.

Quick Start

git clone https://github.com/514-labs/agent-evals.git
cd agent-evals
curl -fsSL https://decbench.ai/install.sh | sh

export ANTHROPIC_API_KEY=your-key-here

dec-bench build --scenario foo-bar-csv-ingest
dec-bench run --scenario foo-bar-csv-ingest
dec-bench results --latest --scenario foo-bar-csv-ingest

Open the repo in Claude Code, Cursor, or Codex and the DEC Bench agent skills auto-load. Ask "get me started" and your agent will walk you through the steps above. See AGENTS.md for the full first-run guide and examples/ for a tour of the worked scenarios.

The Five-Gate Model

Each evaluation run is scored against five sequential quality gates:

Functional — output executes without errors
Correct — produces expected results on all test cases
Robust — handles errors and boundary conditions
Performant — meets latency and throughput targets
Production — code quality and safety fit for release

Each gate must pass before the next is evaluated. Assertions are deterministic — no LLM-as-judge.

Agents

Three agents supported in v0.1:

Agent	Provider	API Key Variable
Claude Code	Anthropic	`ANTHROPIC_API_KEY`
Codex	OpenAI	`OPENAI_API_KEY`
Cursor	Cursor	`CURSOR_API_KEY`

dec-bench build --scenario foo-bar-csv-ingest --agent codex
dec-bench run --scenario foo-bar-csv-ingest --agent codex

Running with multiple keys

To run a matrix in parallel (--parallel > 1), set the plural form (ANTHROPIC_API_KEYS / OPENAI_API_KEYS / CURSOR_API_KEYS) to a comma-separated list. The CLI assigns one key per container in round-robin order:

export ANTHROPIC_API_KEYS=sk-ant-key1,sk-ant-key2
dec-bench run --matrix --parallel 2

Important: Anthropic rate limits apply per organization, not per key — and the same is true for most major providers. Two keys from the same organization share a single rate-limit pool, so multi-key rotation only adds real capacity when the keys come from different Anthropic organizations (e.g., a personal account and a team account).

For more capacity within a single organization, raise your usage tier or reduce --parallel. If only the singular key is set when --parallel > 1, the CLI prints a warning and proceeds.

Evaluation Harnesses

The harness determines the tooling environment available to the agent during evaluation. The same scenario across different harnesses measures whether tooling improves agent performance.

Harness	Scaffolding	Measures
`bare`	None — databases and CLIs only	First-principles reasoning
`classic-de`	dbt, Airflow, Spark	Applied competency with standard tooling
`olap-for-swe`	MooseStack	Code-first OLAP framework leverage

dec-bench build --scenario foo-bar-csv-ingest --harness classic-de
dec-bench run --scenario foo-bar-csv-ingest --harness classic-de

Scenarios

48 scenarios in the preview today, including 36 Foo Bar scenarios plus additional domain coverage across advertising, B2B SaaS, B2C SaaS, e-commerce, infra, and UGC.

dec-bench list

A scenario is a matrix, not a single run: 1 scenario × N harnesses × 2 personas = 2N evaluations. Each (scenario, harness) pair owns its own prompts and may carry its own seed data; scenario.json, root init/, assertions/, and supervisord.conf are shared across harnesses.

scenarios/<id>/
  scenario.json                     # declares harnesses[]
  supervisord.conf                  # services (shared)
  init/                             # shared seed data
  assertions/                       # shared gates
  harnesses/<harness-id>/
    prompts/{baseline,informed}.md  # required per pair
    init/                           # optional; only this pair
    install.sh                      # optional; only this pair

See AGENTS.md for why prompts and init can differ per harness.

Infrastructure

Every scenario runs against real databases, not mocks:

Component	Role
Postgres	Transactional source of truth — schema migrations, referential integrity
Redpanda	High-throughput event streaming — topic management, consumer groups
ClickHouse	Columnar analytics — materialized views, real-time aggregation

Prerequisites

Git
Docker (running)
An API key for your chosen agent

dec-bench list is the fastest install smoke test and does not require Docker or an API key. build and run do.

CLI Reference

Command	Purpose
`dec-bench list`	List available scenarios
`dec-bench build`	Build the evaluation image
`dec-bench run`	Run the evaluation
`dec-bench results`	View run results
`dec-bench audit export`	Create audit bundles
`dec-bench audit open`	Open the audit interface
`dec-bench create`	Scaffold a new scenario
`dec-bench validate`	Validate scenario structure

Extending the Benchmark

Add an agent: Create docker/agents/<agent>/run.sh with the invocation logic.

Add a harness: Create apps/web/data/harnesses/<harness>.json with tools and install script.

Add a scenario: dec-bench create --name my-eval --domain ugc --tier tier-1, then define the prompt and assertions. Open the repo in Claude Code, Cursor, or Codex and the dec-bench-create-scenario skill walks through the rest. Worked scenarios live in examples/.

Agent Skills

Skills are checked into the repo at .claude/skills/ and .agents/skills/ and auto-load when you open the repo with Claude Code, Cursor, or Codex. No install step.

Skill	When it fires
`dec-bench-quickstart`	"get started", "install", "first run"
`dec-bench-run`	"run scenario", "benchmark", "compare agents"
`dec-bench-create-scenario`	"create scenario", "add eval", "write a benchmark for"
`dec-bench-local-override`	"test a local moose-cli / ClickHouse / skill build before release"

dec-bench-local-override guides contributors through substituting a locally-built artifact (moose-cli, moose-lib, ClickHouse, a Claude skill, etc.) into a scenario image to test changes before a release.

The canonical source for skill content is .claude/skills/ (Claude Code's docs-sanctioned discovery path). The .agents/skills/dec-bench-* entries are symlinks back into it, so the npx skills add 514-labs/agent-evals external install path keeps working without a separate copy.

Testing

cargo test --manifest-path apps/cli/Cargo.toml
pnpm --filter @dec-bench/eval-core test
pnpm --filter web test:data

Contributing

We welcome contributions across scenarios, harnesses, evaluation logic, documentation, and tooling. Contributions should meet the quality criteria described in the contributor guidelines.

License

Open source under MIT. See LICENSE.

DEC Bench is an open research effort by 514 Labs.

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
.agents/skills		.agents/skills
.claude-plugin		.claude-plugin
.claude		.claude
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
.vscode		.vscode
apps		apps
assets		assets
docker		docker
examples		examples
packages		packages
results		results
scenarios		scenarios
scripts		scripts
tools		tools
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.npmrc		.npmrc
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
install.sh		install.sh
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
skills-lock.json		skills-lock.json
tsconfig.json		tsconfig.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DEC Bench

Quick Start

The Five-Gate Model

Agents

Running with multiple keys

Evaluation Harnesses

Scenarios

Infrastructure

Prerequisites

CLI Reference

Extending the Benchmark

Agent Skills

Testing

Contributing

License

About

Uh oh!

Releases 220

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DEC Bench

Quick Start

The Five-Gate Model

Agents

Running with multiple keys

Evaluation Harnesses

Scenarios

Infrastructure

Prerequisites

CLI Reference

Extending the Benchmark

Agent Skills

Testing

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 220

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages