Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
830a836
feat(benchmark): scaffold SlopBench package + report contracts
cursoragent Jun 9, 2026
e26c16a
feat(benchmark): React Doctor scanner (offline, diff-scoped) + passin…
cursoragent Jun 9, 2026
c659625
feat(benchmark): deterministic local slop scorer + weights + golden t…
cursoragent Jun 9, 2026
88ad71d
feat(benchmark): TS-strictness + composition + deslop AST checks (gap…
cursoragent Jun 9, 2026
51270da
feat(benchmark): slop-verify orchestrator + CLI + diff collection + e…
cursoragent Jun 9, 2026
d292b64
feat(benchmark): Harbor/Pier task harness — base image, shared grader…
cursoragent Jun 9, 2026
c24dae6
feat(benchmark): task format-money-util (validated) + --no-dead-code …
cursoragent Jun 9, 2026
92ae8a7
feat(benchmark): task scaffolding + patch generators; add typed-stora…
cursoragent Jun 9, 2026
22af7e5
feat(benchmark): handle-slop task group-by-extend (validated)
cursoragent Jun 9, 2026
5a9eb7c
feat(benchmark): React/vitest task status-pill-variants (validated)
cursoragent Jun 9, 2026
899060e
feat(benchmark): notification-list task (validated; demonstrates reac…
cursoragent Jun 9, 2026
bd4f842
feat(benchmark): format-duration-util task + validate-all script (all…
cursoragent Jun 9, 2026
fd4d160
feat(benchmark): README + results aggregator + methodology docs; scop…
cursoragent Jun 9, 2026
d11756f
feat(benchmark): dashboard-loader async-waterfall task (Next server l…
cursoragent Jun 9, 2026
b1a3000
feat(benchmark): add slugify, parse-query, format-list-extend tasks (…
cursoragent Jun 9, 2026
c214279
feat(benchmark): add truncate-middle + avatar-initials tasks (12 task…
cursoragent Jun 9, 2026
85a8405
ci(benchmark): wire SlopBench task self-test into CI + add benchmark:…
cursoragent Jun 9, 2026
d39dbfe
docs(benchmark): reference pnpm benchmark:validate in README
cursoragent Jun 9, 2026
6d9b7ea
feat(benchmark): icon-button-a11y task (covers accessibility dimensio…
cursoragent Jun 9, 2026
3bea864
feat(benchmark): chunk-util task (covers bundle dimension; 14 tasks; …
cursoragent Jun 9, 2026
b73466a
feat(benchmark): route-handler-json Next App Router task (query filte…
cursoragent Jun 9, 2026
f8a7e9a
feat(benchmark): comment-thread-extend handle-slop React task (16 tas…
cursoragent Jun 9, 2026
df32dbf
feat(benchmark): add paginate-util + retry-async-util tasks (18 tasks…
cursoragent Jun 9, 2026
41fbcb0
feat(benchmark): add unique-by + title-case tasks (20 tasks; full agr…
cursoragent Jun 9, 2026
00781f0
fix(benchmark): regenerate task patches after formatting so all 20 ap…
cursoragent Jun 9, 2026
d8acdb4
docs(benchmark): note to format before generating task patches
cursoragent Jun 9, 2026
885a13a
docs(benchmark): clean up authoring note (pnpm format before gen-task…
cursoragent Jun 9, 2026
6a031d4
test(benchmark): cover results aggregator (pass-rate, means, per-dime…
cursoragent Jun 9, 2026
d9f356e
fix(benchmark): raise vitest timeout for React Doctor-spawning integr…
cursoragent Jun 9, 2026
aa59fb9
fix(docs): format docs/SLOPBENCH.md (CI format:check)
cursoragent Jun 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
9 changes: 9 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,12 @@ jobs:
- name: Smoke test interactive TTY prompt
if: ${{ matrix.os == 'ubuntu-latest' && matrix.node-version == '22.18.0' }}
run: pnpm smoke:tty-prompt

# SlopBench task self-test: every benchmark task's clean reference solution
# must still pass its functional gate AND score reward > 0. Guards the
# corpus against drift in the verifier, the scoring profile, or React
# Doctor. The prior `pnpm build` step left react-doctor + the verifier
# built; React-component tasks install their own dev deps during grading.
- name: Validate SlopBench task reference solutions
if: ${{ matrix.os == 'ubuntu-latest' && matrix.node-version == '22.18.0' }}
run: pnpm benchmark:validate
105 changes: 105 additions & 0 deletions docs/SLOPBENCH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# SlopBench — methodology

SlopBench (in [`packages/benchmark`](../packages/benchmark)) measures how good a
model is at frontend engineering, with a deliberate focus on **how much React /
TypeScript slop it emits**. It extends the DeepSWE / Harbor approach with a
second, continuous quality axis.

## Why two axes

Correctness-only benchmarks reward a working feature regardless of how it was
built. Real frontend review cares about both: does it work, _and_ is it clean?
SlopBench keeps a hard **functional gate** (hidden behavioral tests) and adds a
**slop score** computed purely by static analysis on the diff:

```
reward = functional_pass × (slop_score / 100)
```

- `functional_pass ∈ {0,1}` — the DeepSWE-style gate.
- `slop_score ∈ [0,100]` — higher = cleaner.

Reporting both separately (plus per-dimension subscores) lets a leaderboard rank
by correctness, by cleanliness, or by the product. Setting the slop weight to
zero recovers a pure correctness benchmark.

## How the slop score is computed

The verifier (`slop-verify`, the `@react-doctor/benchmark` package) runs
**offline** over the agent's diff against the task's base commit:

1. **React Doctor** (`--json --no-score --no-dead-code`) — the canonical React
diagnostic engine, scoped to the files the agent changed. Its five categories
map to the `react-correctness`, `react-performance`, `accessibility`, and
`maintainability` dimensions; specific bundle/waterfall rules are rerouted to
the `bundle` and `async-waterfall` dimensions.
2. **TypeScript strictness** (AST, no type-checker needed) — explicit `any`,
`as` casts, non-null `!`, and `@ts-ignore`/`@ts-nocheck`/`@ts-expect-error`.
3. **Composition** (AST, distilled from Vercel's composition-patterns) —
boolean-prop soup and function-valued render props.
4. **deslop heuristic** — nested ternaries.

Each finding is weighted `severity × category × rule-impact`, the per-dimension
penalty is **size-normalized** by the diff's added lines (so large legitimate
features are not punished as hard as the same violations in a tiny diff), and
each dimension scores `clamp(100 − penalty, 0, 100)`. The composite is the
profile-weighted mean across dimensions.

Every number lives in [`scoring-profiles/default.json`](../packages/benchmark/scoring-profiles/default.json)
(mirrored by `src/constants.ts`); the `scoringVersion` is stamped into every
report so scores are reproducible and comparable.

### Why local scoring (not the react.doctor score API)

React Doctor's canonical 0–100 score is a remote API call. Benchmark grading is
**air-gapped** (`allow_internet = false`), so SlopBench computes its own
deterministic score from the offline `diagnostics[]`. The remote API is never on
the grading path.

## Reference influences

The dimensions and checks are grounded in:

- **React Doctor rules** — the React correctness/performance/a11y/security engine.
- **deslop skill** — indirection, dead code, nested ternaries, near-duplicates.
- **Vercel [react-best-practices]** — waterfalls, bundle, re-render, rendering tiers.
- **Vercel [composition-patterns]** — boolean-prop soup, render-props, compound components.
- **Vercel [next-best-practices]** — RSC boundaries, async APIs, `next/image`, bundling.

To avoid double-counting, [`rule-overlap.md`](../packages/benchmark/rule-overlap.md)
records which tool owns each signal; SlopBench only adds checks for gaps React
Doctor does not already cover (TS strictness + composition).

[react-best-practices]: https://github.com/vercel-labs/agent-skills/tree/main/skills/react-best-practices
[composition-patterns]: https://github.com/vercel-labs/agent-skills/tree/main/skills/composition-patterns
[next-best-practices]: https://github.com/vercel-labs/next-skills#next-best-practices

## Task families

- **produce-clean** — implement a working feature; slop is measured on the diff.
Measures the slop a model emits _unprompted_ (the instruction never mentions
quality).
- **handle-slop** — the seed ships working-but-sloppy code; a small change is
requested. Measures whether the model _adds_ slop or cleans what it touches.
- **explicit-deslop** _(v2)_ — the instruction asks to clean up while preserving
behavior; isolates capability from inclination.

## Anti-gaming

- Scanners run over the whole diff, not a fixed file the agent can target.
- Suppression escape hatches (`@ts-ignore`, eslint-disable-style comments) are
themselves scored as slop.
- Tests, fixtures, generated files, and lockfiles are excluded from grading, so
an agent neither earns credit for tests nor is charged for vendored slop.
- Hidden tests are applied only at grade time.

## Reproducibility

- React Doctor + the verifier are installed from a single pinned checkout in the
base image (`tasks/_base/Dockerfile`); pin `REACT_DOCTOR_REF` for a release.
- `doctorVersion` + `scoringVersion` are recorded in every `slop-report.json`.
- `scripts/validate-all.sh` asserts every task's reference solution still passes
and scores `reward > 0` — run it before cutting a benchmark release.

See [`packages/benchmark/README.md`](../packages/benchmark/README.md) for the run
and authoring workflow.
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
"release": "pnpm build && pnpm check:published-deps && node scripts/sentry-sourcemaps.mjs && changeset publish",
"check:published-deps": "node --experimental-strip-types --no-warnings scripts/check-published-deps.ts",
"smoke:json-report": "node --experimental-strip-types --no-warnings scripts/smoke-json-report.ts",
"benchmark:validate": "bash packages/benchmark/scripts/validate-all.sh",
"smoke:tty-prompt": "python3 scripts/smoke-tty-prompt.py",
"build:schema": "node --experimental-strip-types --no-warnings scripts/generate-config-schema.ts"
},
Expand Down
149 changes: 149 additions & 0 deletions packages/benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# SlopBench

A benchmark for measuring how good individual models are at **frontend
engineering — and specifically how much React/TypeScript "slop" they produce**.

Unlike correctness-only SWE benchmarks, SlopBench scores **two axes** per task:

1. **Functional correctness** (gate) — hidden behavioral tests, exactly like
[DeepSWE](https://github.com/datacurve-ai/deep-swe). If the feature does not
work, the task is failed.
2. **Slop score** (0–100, continuous) — how clean the code the model wrote is,
measured **offline** by [React Doctor](https://react.doctor) plus a strict
TypeScript pass, Vercel-derived composition checks, and deslop heuristics.

A model can make the feature work and **still score poorly** for shipping slop
(inline components, array-index keys, `any`, type casts, `@ts-ignore`,
boolean-prop soup, …). The headline **reward** combines them:

```
reward = functional_pass × (slop_score / 100)
```

## Task format

SlopBench uses the [Harbor](https://www.harborframework.com/docs/tasks) task
format (so it runs under [Pier](https://github.com/datacurve-ai/pier) /
Harbor unchanged):

```text
tasks/<id>/
task.toml metadata: family, target_dimensions, base commit, image, limits
instruction.md the prompt the agent sees (no mention of "slop" / quality)
seed/ the starting project (committed as the base commit)
environment/Dockerfile reproduces the env (FROM slopbench-base)
tests/
test.sh thin wrapper -> `slopbench-grade` (functional gate + slop scan)
test.patch hidden tests, applied at grade time
solution/ reference clean solution (reviewer aid; never used at grading)
_authoring/ human-readable source for the patches (solved/ + hidden/)
```

The verifier writes `reward.txt` (the composite float) and a rich
`slop-report.json` artifact (per-dimension scores + every violation).

## Quickstart (Pier — swappable harness)

The task format is harness-agnostic. Pier drives `mini-swe-agent` (model-agnostic)
**and** the CLI agents directly — pass `--agent` to switch:

```bash
git clone https://github.com/millionco/react-doctor
uv tool install datacurve-pier

# Build the shared base image once (provides react-doctor + slop-verify + grader)
docker build -t slopbench-base:latest -f packages/benchmark/tasks/_base/Dockerfile .

# Claude Code as the harness
export ANTHROPIC_API_KEY=...
pier run -p packages/benchmark/tasks --agent claude-code --model anthropic/claude-opus-4-7

# Codex
export OPENAI_API_KEY=...
pier run -p packages/benchmark/tasks --agent codex --model openai/gpt-5.5

# Other harnesses Pier drives directly:
pier run -p packages/benchmark/tasks --agent gemini-cli --model google/gemini-2.5-pro
pier run -p packages/benchmark/tasks --agent opencode --model anthropic/claude-opus-4-7

# Model-agnostic harness (works with any provider)
pier run -p packages/benchmark/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7
```

Single task or a deterministic subset:

```bash
pier run -p packages/benchmark/tasks/notification-list --agent claude-code
pier run -p packages/benchmark/tasks --agent mini-swe-agent --n-tasks 3 --sample-seed 0
```

## Aggregating results into a scorecard

After a run, turn the per-task reports into one model scorecard:

```bash
node packages/benchmark/scripts/aggregate-results.mjs \
--logs <pier-logs-dir> --model claude-opus-4-7 \
--out packages/benchmark/results/claude-opus-4-7.json
```

It reports `functionalPassRate`, `meanSlopScore`, `meanReward`, and per-dimension
means — the shape a (v2) leaderboard renders. A web leaderboard is intentionally
out of scope for v1.

## Slop dimensions

Each violation maps to exactly one dimension (no double-counting — see
[`rule-overlap.md`](./rule-overlap.md)):

| Dimension | Owner |
| ---------------------------------------------------------------------------- | --------------------------------------------------------- |
| `react-correctness`, `react-performance`, `accessibility`, `maintainability` | React Doctor |
| `bundle`, `async-waterfall` | React Doctor (specific rules rerouted) |
| `ts-strictness` | SlopBench TS checks (`any`, casts, `!`, `@ts-ignore`) |
| `composition` | SlopBench Vercel checks (boolean-prop soup, render props) |

Weights live in [`scoring-profiles/default.json`](./scoring-profiles/default.json)
(mirrored by `src/constants.ts`); the active scoring version is stamped into
every report.

## Authoring a new task

```bash
cd packages/benchmark
# 1. scaffold boilerplate (task.toml, test.sh, Dockerfile, solve.sh)
scripts/scaffold-task.sh my-task produce-clean "ts-strictness" \
"node --experimental-strip-types --test tests/my-task.test.ts" \
"My task title" "One-line description"
# 2. author tasks/my-task/seed/, instruction.md,
# _authoring/solved/** (clean reference) and _authoring/hidden/** (hidden tests)
# 3. format first, THEN generate the patches (patches embed seed context,
# so formatting the seed after generating would make them stale)
pnpm format
scripts/gen-task-patches.sh tasks/my-task
# 4. validate end-to-end WITHOUT Docker (seed -> grade reference solution)
scripts/validate-task.sh tasks/my-task --expect-pass
```

Validate the whole corpus (reference solutions must pass + score reward>0):

```bash
scripts/validate-all.sh # from packages/benchmark
pnpm benchmark:validate # from the repo root (also run in CI)
```

Pure-TS tasks use Node's built-in test runner (`node --experimental-strip-types
--test`) and need no dependency install; React tasks use `vitest` +
`react-dom/server` (install happens at image-build time). Both run **air-gapped**
at agent time.

## The verifier CLI

`slop-verify` scores a graded diff directly (used by the grader, handy in dev):

```bash
slop-verify --root <project> --base <git-ref> --json
```

See `slop-verify --help` for all flags (`--profile`, `--functional-pass`,
`--out`, `--fail-under`, …).
4 changes: 4 additions & 0 deletions packages/benchmark/bin/slop-verify.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/usr/bin/env node
import { runCli } from "../dist/index.mjs";

runCli(process.argv.slice(2));
40 changes: 40 additions & 0 deletions packages/benchmark/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
"name": "@react-doctor/benchmark",
"version": "0.4.2",
"private": true,
"description": "Internal: SlopBench — a Harbor/Pier-compatible benchmark measuring how much React/TypeScript slop a model produces, scored through React Doctor plus a strict TypeScript pass, Vercel-derived AST checks, and deslop heuristics. Not published.",
"license": "MIT",
"bin": {
"slop-verify": "./bin/slop-verify.js"
},
"files": [
"bin/**",
"dist/**/*.mjs",
"dist/**/*.d.mts",
"scoring-profiles/**"
],
"type": "module",
"sideEffects": false,
"exports": {
".": {
"types": "./dist/index.d.mts",
"default": "./dist/index.mjs"
}
},
"scripts": {
"build": "node -e \"require('node:fs').rmSync('dist', { recursive: true, force: true })\" && cross-env NODE_ENV=production vp pack",
"test": "vp test run tests",
"typecheck": "tsc --noEmit"
},
"dependencies": {
"@react-doctor/core": "workspace:*",
"oxc-parser": "^0.132.0"
},
"devDependencies": {
"@types/node": "^25.6.0",
"react-doctor": "workspace:*"
},
"engines": {
"node": "^20.19.0 || >=22.12.0"
}
}
71 changes: 71 additions & 0 deletions packages/benchmark/rule-overlap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Rule overlap & ownership

SlopBench scores slop from multiple scanners. To avoid **double-counting** the
same defect, every slop signal has exactly one owner. This table is the single
source of truth: when adding a check, confirm React Doctor does not already
cover it — if it does, **defer** and (optionally) route its rule id into a finer
dimension instead of re-implementing detection.

## Ownership by dimension

| Dimension | Owner | How |
| ------------------- | ------------------------------- | ----------------------------------------------------------------------------------------------- |
| `react-correctness` | React Doctor | categories **Security**, **Bugs** |
| `react-performance` | React Doctor | category **Performance** (minus the rules rerouted below) |
| `accessibility` | React Doctor | category **Accessibility** |
| `maintainability` | React Doctor + deslop heuristic | category **Maintainability** (incl. the `ln`/deslop dead-code plugin) + `deslop/nested-ternary` |
| `bundle` | React Doctor (rerouted) | specific Performance-category rule ids → `bundle` |
| `async-waterfall` | React Doctor (rerouted) | specific Performance-category rule ids → `async-waterfall` |
| `ts-strictness` | SlopBench TS checks | React Doctor does **not** cover generic TS slop |
| `composition` | SlopBench Vercel checks | proliferation / render-prop not counted by React Doctor |

## React Doctor rules rerouted to finer dimensions

React Doctor files these under the broad **Performance** category; SlopBench
routes the exact rule ids into dedicated dimensions
(`REACT_DOCTOR_RULE_TO_DIMENSION` in `src/constants.ts`) so the leaderboard can
report them separately. Detection still belongs to React Doctor — we only
relabel the dimension.

- `react-doctor/no-barrel-import` → `bundle`
- `react-doctor/no-full-lodash-import` → `bundle`
- `react-doctor/no-moment` → `bundle`
- `react-doctor/no-undeferred-third-party` → `bundle`
- `react-doctor/prefer-dynamic-import` → `bundle`
- `react-doctor/no-dynamic-import-path` → `bundle`
- `react-doctor/use-lazy-motion` → `bundle`
- `react-doctor/server-sequential-independent-await` → `async-waterfall`
- `react-doctor/tanstack-start-loader-parallel-fetch` → `async-waterfall`

## Vercel rules deliberately DEFERRED to React Doctor (no custom check)

These Vercel best-practices map onto an existing React Doctor rule, so SlopBench
does **not** add a duplicate detector:

| Vercel rule | Covered by React Doctor |
| ---------------------------------- | ------------------------------------------------------------------------------ |
| `bundle-barrel-imports` | `react-doctor/no-barrel-import`, `no-full-lodash-import` |
| `bundle-dynamic-imports` | `react-doctor/prefer-dynamic-import`, `no-dynamic-import-path` |
| `async-parallel` / waterfalls | `react-doctor/server-sequential-independent-await` |
| `rerender-no-inline-components` | `react-doctor/no-nested-component-definition`, `no-unstable-nested-components` |
| `rerender-derived-state-no-effect` | React Doctor `state-and-effects` rules |
| `react19-no-forwardref` | `react-doctor/forward-ref-uses-ref`, `no-react19-deprecated-apis` |
| `rendering-*` (img, etc.) | `react-doctor/nextjs-no-img-element`, … |

## Signals SlopBench OWNS (custom checks — React Doctor gap)

TypeScript strictness (`src/checks/ts-*.ts`, dimension `ts-strictness`):

- `ts/no-explicit-any` — explicit `any` annotations
- `ts/no-non-null-assertion` — the `!` operator
- `ts/no-type-assertion` — `as Foo` / `<Foo>x` casts (`as const` exempt)
- `ts/ban-ts-comment` — `@ts-ignore` / `@ts-nocheck` / `@ts-expect-error` (scored as error)

Composition (`src/checks/vercel-*.ts`, dimension `composition`):

- `vercel/architecture-boolean-prop-soup` — `*Props` types with ≥ `BOOLEAN_PROP_SOUP_THRESHOLD` boolean flags
- `vercel/patterns-render-prop` — function-valued `render` / `renderX` props

deslop (`src/checks/deslop-*.ts`, dimension `maintainability`):

- `deslop/nested-ternary` — nested conditional expressions (one finding per chain)
Loading
Loading