millionco · NisargIO · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -108,3 +108,12 @@ jobs:
       - name: Smoke test interactive TTY prompt
         if: ${{ matrix.os == 'ubuntu-latest' && matrix.node-version == '22.18.0' }}
         run: pnpm smoke:tty-prompt
+
+      # SlopBench task self-test: every benchmark task's clean reference solution
+      # must still pass its functional gate AND score reward > 0. Guards the
+      # corpus against drift in the verifier, the scoring profile, or React
+      # Doctor. The prior `pnpm build` step left react-doctor + the verifier
+      # built; React-component tasks install their own dev deps during grading.
+      - name: Validate SlopBench task reference solutions
+        if: ${{ matrix.os == 'ubuntu-latest' && matrix.node-version == '22.18.0' }}
+        run: pnpm benchmark:validate
diff --git a/docs/SLOPBENCH.md b/docs/SLOPBENCH.md
@@ -0,0 +1,105 @@
+# SlopBench — methodology
+
+SlopBench (in [`packages/benchmark`](../packages/benchmark)) measures how good a
+model is at frontend engineering, with a deliberate focus on **how much React /
+TypeScript slop it emits**. It extends the DeepSWE / Harbor approach with a
+second, continuous quality axis.
+
+## Why two axes
+
+Correctness-only benchmarks reward a working feature regardless of how it was
+built. Real frontend review cares about both: does it work, _and_ is it clean?
+SlopBench keeps a hard **functional gate** (hidden behavioral tests) and adds a
+**slop score** computed purely by static analysis on the diff:
+
+```
+reward = functional_pass × (slop_score / 100)
+```
+
+- `functional_pass ∈ {0,1}` — the DeepSWE-style gate.
+- `slop_score ∈ [0,100]` — higher = cleaner.
+
+Reporting both separately (plus per-dimension subscores) lets a leaderboard rank
+by correctness, by cleanliness, or by the product. Setting the slop weight to
+zero recovers a pure correctness benchmark.
+
+## How the slop score is computed
+
+The verifier (`slop-verify`, the `@react-doctor/benchmark` package) runs
+**offline** over the agent's diff against the task's base commit:
+
+1. **React Doctor** (`--json --no-score --no-dead-code`) — the canonical React
+   diagnostic engine, scoped to the files the agent changed. Its five categories
+   map to the `react-correctness`, `react-performance`, `accessibility`, and
+   `maintainability` dimensions; specific bundle/waterfall rules are rerouted to
+   the `bundle` and `async-waterfall` dimensions.
+2. **TypeScript strictness** (AST, no type-checker needed) — explicit `any`,
+   `as` casts, non-null `!`, and `@ts-ignore`/`@ts-nocheck`/`@ts-expect-error`.
+3. **Composition** (AST, distilled from Vercel's composition-patterns) —
+   boolean-prop soup and function-valued render props.
+4. **deslop heuristic** — nested ternaries.
+
+Each finding is weighted `severity × category × rule-impact`, the per-dimension
+penalty is **size-normalized** by the diff's added lines (so large legitimate
+features are not punished as hard as the same violations in a tiny diff), and
+each dimension scores `clamp(100 − penalty, 0, 100)`. The composite is the
+profile-weighted mean across dimensions.
+
+Every number lives in [`scoring-profiles/default.json`](../packages/benchmark/scoring-profiles/default.json)
+(mirrored by `src/constants.ts`); the `scoringVersion` is stamped into every
+report so scores are reproducible and comparable.
+
+### Why local scoring (not the react.doctor score API)
+
+React Doctor's canonical 0–100 score is a remote API call. Benchmark grading is
+**air-gapped** (`allow_internet = false`), so SlopBench computes its own
+deterministic score from the offline `diagnostics[]`. The remote API is never on
+the grading path.
+
+## Reference influences
+
+The dimensions and checks are grounded in:
+
+- **React Doctor rules** — the React correctness/performance/a11y/security engine.
+- **deslop skill** — indirection, dead code, nested ternaries, near-duplicates.
+- **Vercel [react-best-practices]** — waterfalls, bundle, re-render, rendering tiers.
+- **Vercel [composition-patterns]** — boolean-prop soup, render-props, compound components.
+- **Vercel [next-best-practices]** — RSC boundaries, async APIs, `next/image`, bundling.
+
+To avoid double-counting, [`rule-overlap.md`](../packages/benchmark/rule-overlap.md)
+records which tool owns each signal; SlopBench only adds checks for gaps React
+Doctor does not already cover (TS strictness + composition).
+
+[react-best-practices]: https://github.com/vercel-labs/agent-skills/tree/main/skills/react-best-practices
+[composition-patterns]: https://github.com/vercel-labs/agent-skills/tree/main/skills/composition-patterns
+[next-best-practices]: https://github.com/vercel-labs/next-skills#next-best-practices
+
+## Task families
+
+- **produce-clean** — implement a working feature; slop is measured on the diff.
+  Measures the slop a model emits _unprompted_ (the instruction never mentions
+  quality).
+- **handle-slop** — the seed ships working-but-sloppy code; a small change is
+  requested. Measures whether the model _adds_ slop or cleans what it touches.
+- **explicit-deslop** _(v2)_ — the instruction asks to clean up while preserving
+  behavior; isolates capability from inclination.
+
+## Anti-gaming
+
+- Scanners run over the whole diff, not a fixed file the agent can target.
+- Suppression escape hatches (`@ts-ignore`, eslint-disable-style comments) are
+  themselves scored as slop.
+- Tests, fixtures, generated files, and lockfiles are excluded from grading, so
+  an agent neither earns credit for tests nor is charged for vendored slop.
+- Hidden tests are applied only at grade time.
+
+## Reproducibility
+
+- React Doctor + the verifier are installed from a single pinned checkout in the
+  base image (`tasks/_base/Dockerfile`); pin `REACT_DOCTOR_REF` for a release.
+- `doctorVersion` + `scoringVersion` are recorded in every `slop-report.json`.
+- `scripts/validate-all.sh` asserts every task's reference solution still passes
+  and scores `reward > 0` — run it before cutting a benchmark release.
+
+See [`packages/benchmark/README.md`](../packages/benchmark/README.md) for the run
+and authoring workflow.
diff --git a/package.json b/package.json
@@ -26,6 +26,7 @@
     "release": "pnpm build && pnpm check:published-deps && node scripts/sentry-sourcemaps.mjs && changeset publish",
     "check:published-deps": "node --experimental-strip-types --no-warnings scripts/check-published-deps.ts",
     "smoke:json-report": "node --experimental-strip-types --no-warnings scripts/smoke-json-report.ts",
+    "benchmark:validate": "bash packages/benchmark/scripts/validate-all.sh",
     "smoke:tty-prompt": "python3 scripts/smoke-tty-prompt.py",
     "build:schema": "node --experimental-strip-types --no-warnings scripts/generate-config-schema.ts"
   },

diff --git a/packages/benchmark/README.md b/packages/benchmark/README.md
@@ -0,0 +1,149 @@
+# SlopBench
+
+A benchmark for measuring how good individual models are at **frontend
+engineering — and specifically how much React/TypeScript "slop" they produce**.
+
+Unlike correctness-only SWE benchmarks, SlopBench scores **two axes** per task:
+
+1. **Functional correctness** (gate) — hidden behavioral tests, exactly like
+   [DeepSWE](https://github.com/datacurve-ai/deep-swe). If the feature does not
+   work, the task is failed.
+2. **Slop score** (0–100, continuous) — how clean the code the model wrote is,
+   measured **offline** by [React Doctor](https://react.doctor) plus a strict
+   TypeScript pass, Vercel-derived composition checks, and deslop heuristics.
+
+A model can make the feature work and **still score poorly** for shipping slop
+(inline components, array-index keys, `any`, type casts, `@ts-ignore`,
+boolean-prop soup, …). The headline **reward** combines them:
+
+```
+reward = functional_pass × (slop_score / 100)
+```
+
+## Task format
+
+SlopBench uses the [Harbor](https://www.harborframework.com/docs/tasks) task
+format (so it runs under [Pier](https://github.com/datacurve-ai/pier) /
+Harbor unchanged):
+
+```text
+tasks/<id>/
+  task.toml          metadata: family, target_dimensions, base commit, image, limits
+  instruction.md     the prompt the agent sees (no mention of "slop" / quality)
+  seed/              the starting project (committed as the base commit)
+  environment/Dockerfile   reproduces the env (FROM slopbench-base)
+  tests/
+    test.sh          thin wrapper -> `slopbench-grade` (functional gate + slop scan)
+    test.patch       hidden tests, applied at grade time
+  solution/          reference clean solution (reviewer aid; never used at grading)
+  _authoring/        human-readable source for the patches (solved/ + hidden/)
+```
+
+The verifier writes `reward.txt` (the composite float) and a rich
+`slop-report.json` artifact (per-dimension scores + every violation).
+
+## Quickstart (Pier — swappable harness)
+
+The task format is harness-agnostic. Pier drives `mini-swe-agent` (model-agnostic)
+**and** the CLI agents directly — pass `--agent` to switch:
+
+```bash
+git clone https://github.com/millionco/react-doctor
+uv tool install datacurve-pier
+
+# Build the shared base image once (provides react-doctor + slop-verify + grader)
+docker build -t slopbench-base:latest -f packages/benchmark/tasks/_base/Dockerfile .
+
+# Claude Code as the harness
+export ANTHROPIC_API_KEY=...
+pier run -p packages/benchmark/tasks --agent claude-code --model anthropic/claude-opus-4-7
+
+# Codex
+export OPENAI_API_KEY=...
+pier run -p packages/benchmark/tasks --agent codex --model openai/gpt-5.5
+
+# Other harnesses Pier drives directly:
+pier run -p packages/benchmark/tasks --agent gemini-cli --model google/gemini-2.5-pro
+pier run -p packages/benchmark/tasks --agent opencode  --model anthropic/claude-opus-4-7
+
+# Model-agnostic harness (works with any provider)
+pier run -p packages/benchmark/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7
+```
+
+Single task or a deterministic subset:
+
+```bash
+pier run -p packages/benchmark/tasks/notification-list --agent claude-code
+pier run -p packages/benchmark/tasks --agent mini-swe-agent --n-tasks 3 --sample-seed 0
+```
+
+## Aggregating results into a scorecard
+
+After a run, turn the per-task reports into one model scorecard:
+
+```bash
+node packages/benchmark/scripts/aggregate-results.mjs \
+  --logs <pier-logs-dir> --model claude-opus-4-7 \
+  --out packages/benchmark/results/claude-opus-4-7.json
+```
+
+It reports `functionalPassRate`, `meanSlopScore`, `meanReward`, and per-dimension
+means — the shape a (v2) leaderboard renders. A web leaderboard is intentionally
+out of scope for v1.
+
+## Slop dimensions
+
+Each violation maps to exactly one dimension (no double-counting — see
+[`rule-overlap.md`](./rule-overlap.md)):
+
+| Dimension                                                                    | Owner                                                     |
+| ---------------------------------------------------------------------------- | --------------------------------------------------------- |
+| `react-correctness`, `react-performance`, `accessibility`, `maintainability` | React Doctor                                              |
+| `bundle`, `async-waterfall`                                                  | React Doctor (specific rules rerouted)                    |
+| `ts-strictness`                                                              | SlopBench TS checks (`any`, casts, `!`, `@ts-ignore`)     |
+| `composition`                                                                | SlopBench Vercel checks (boolean-prop soup, render props) |
+
+Weights live in [`scoring-profiles/default.json`](./scoring-profiles/default.json)
+(mirrored by `src/constants.ts`); the active scoring version is stamped into
+every report.
+
+## Authoring a new task
+
+```bash
+cd packages/benchmark
+# 1. scaffold boilerplate (task.toml, test.sh, Dockerfile, solve.sh)
+scripts/scaffold-task.sh my-task produce-clean "ts-strictness" \
+  "node --experimental-strip-types --test tests/my-task.test.ts" \
+  "My task title" "One-line description"
+# 2. author tasks/my-task/seed/, instruction.md,
+#    _authoring/solved/** (clean reference) and _authoring/hidden/** (hidden tests)
+# 3. format first, THEN generate the patches (patches embed seed context,
+#    so formatting the seed after generating would make them stale)
+pnpm format
+scripts/gen-task-patches.sh tasks/my-task
+# 4. validate end-to-end WITHOUT Docker (seed -> grade reference solution)
+scripts/validate-task.sh tasks/my-task --expect-pass
+```
+
+Validate the whole corpus (reference solutions must pass + score reward>0):
+
+```bash
+scripts/validate-all.sh        # from packages/benchmark
+pnpm benchmark:validate        # from the repo root (also run in CI)
+```
+
+Pure-TS tasks use Node's built-in test runner (`node --experimental-strip-types
+--test`) and need no dependency install; React tasks use `vitest` +
+`react-dom/server` (install happens at image-build time). Both run **air-gapped**
+at agent time.
+
+## The verifier CLI
+
+`slop-verify` scores a graded diff directly (used by the grader, handy in dev):
+
+```bash
+slop-verify --root <project> --base <git-ref> --json
+```
+
+See `slop-verify --help` for all flags (`--profile`, `--functional-pass`,
+`--out`, `--fail-under`, …).
diff --git a/packages/benchmark/bin/slop-verify.js b/packages/benchmark/bin/slop-verify.js
@@ -0,0 +1,4 @@
+#!/usr/bin/env node
+import { runCli } from "../dist/index.mjs";
+
+runCli(process.argv.slice(2));
diff --git a/packages/benchmark/package.json b/packages/benchmark/package.json
@@ -0,0 +1,40 @@
+{
+  "name": "@react-doctor/benchmark",
+  "version": "0.4.2",
+  "private": true,
+  "description": "Internal: SlopBench — a Harbor/Pier-compatible benchmark measuring how much React/TypeScript slop a model produces, scored through React Doctor plus a strict TypeScript pass, Vercel-derived AST checks, and deslop heuristics. Not published.",
+  "license": "MIT",
+  "bin": {
+    "slop-verify": "./bin/slop-verify.js"
+  },
+  "files": [
+    "bin/**",
+    "dist/**/*.mjs",
+    "dist/**/*.d.mts",
+    "scoring-profiles/**"
+  ],
+  "type": "module",
+  "sideEffects": false,
+  "exports": {
+    ".": {
+      "types": "./dist/index.d.mts",
+      "default": "./dist/index.mjs"
+    }
+  },
+  "scripts": {
+    "build": "node -e \"require('node:fs').rmSync('dist', { recursive: true, force: true })\" && cross-env NODE_ENV=production vp pack",
+    "test": "vp test run tests",
+    "typecheck": "tsc --noEmit"
+  },
+  "dependencies": {
+    "@react-doctor/core": "workspace:*",
+    "oxc-parser": "^0.132.0"
+  },
+  "devDependencies": {
+    "@types/node": "^25.6.0",
+    "react-doctor": "workspace:*"
+  },
+  "engines": {
+    "node": "^20.19.0 || >=22.12.0"
+  }
+}
diff --git a/packages/benchmark/rule-overlap.md b/packages/benchmark/rule-overlap.md
@@ -0,0 +1,71 @@
+# Rule overlap & ownership
+
+SlopBench scores slop from multiple scanners. To avoid **double-counting** the
+same defect, every slop signal has exactly one owner. This table is the single
+source of truth: when adding a check, confirm React Doctor does not already
+cover it — if it does, **defer** and (optionally) route its rule id into a finer
+dimension instead of re-implementing detection.
+
+## Ownership by dimension
+
+| Dimension           | Owner                           | How                                                                                             |
+| ------------------- | ------------------------------- | ----------------------------------------------------------------------------------------------- |
+| `react-correctness` | React Doctor                    | categories **Security**, **Bugs**                                                               |
+| `react-performance` | React Doctor                    | category **Performance** (minus the rules rerouted below)                                       |
+| `accessibility`     | React Doctor                    | category **Accessibility**                                                                      |
+| `maintainability`   | React Doctor + deslop heuristic | category **Maintainability** (incl. the `ln`/deslop dead-code plugin) + `deslop/nested-ternary` |
+| `bundle`            | React Doctor (rerouted)         | specific Performance-category rule ids → `bundle`                                               |
+| `async-waterfall`   | React Doctor (rerouted)         | specific Performance-category rule ids → `async-waterfall`                                      |
+| `ts-strictness`     | SlopBench TS checks             | React Doctor does **not** cover generic TS slop                                                 |
+| `composition`       | SlopBench Vercel checks         | proliferation / render-prop not counted by React Doctor                                         |
+
+## React Doctor rules rerouted to finer dimensions
+
+React Doctor files these under the broad **Performance** category; SlopBench
+routes the exact rule ids into dedicated dimensions
+(`REACT_DOCTOR_RULE_TO_DIMENSION` in `src/constants.ts`) so the leaderboard can
+report them separately. Detection still belongs to React Doctor — we only
+relabel the dimension.
+
+- `react-doctor/no-barrel-import` → `bundle`
+- `react-doctor/no-full-lodash-import` → `bundle`
+- `react-doctor/no-moment` → `bundle`
+- `react-doctor/no-undeferred-third-party` → `bundle`
+- `react-doctor/prefer-dynamic-import` → `bundle`
+- `react-doctor/no-dynamic-import-path` → `bundle`
+- `react-doctor/use-lazy-motion` → `bundle`
+- `react-doctor/server-sequential-independent-await` → `async-waterfall`
+- `react-doctor/tanstack-start-loader-parallel-fetch` → `async-waterfall`
+
+## Vercel rules deliberately DEFERRED to React Doctor (no custom check)
+
+These Vercel best-practices map onto an existing React Doctor rule, so SlopBench
+does **not** add a duplicate detector:
+
+| Vercel rule                        | Covered by React Doctor                                                        |
+| ---------------------------------- | ------------------------------------------------------------------------------ |
+| `bundle-barrel-imports`            | `react-doctor/no-barrel-import`, `no-full-lodash-import`                       |
+| `bundle-dynamic-imports`           | `react-doctor/prefer-dynamic-import`, `no-dynamic-import-path`                 |
+| `async-parallel` / waterfalls      | `react-doctor/server-sequential-independent-await`                             |
+| `rerender-no-inline-components`    | `react-doctor/no-nested-component-definition`, `no-unstable-nested-components` |
+| `rerender-derived-state-no-effect` | React Doctor `state-and-effects` rules                                         |
+| `react19-no-forwardref`            | `react-doctor/forward-ref-uses-ref`, `no-react19-deprecated-apis`              |
+| `rendering-*` (img, etc.)          | `react-doctor/nextjs-no-img-element`, …                                        |
+
+## Signals SlopBench OWNS (custom checks — React Doctor gap)
+
+TypeScript strictness (`src/checks/ts-*.ts`, dimension `ts-strictness`):
+
+- `ts/no-explicit-any` — explicit `any` annotations
+- `ts/no-non-null-assertion` — the `!` operator
+- `ts/no-type-assertion` — `as Foo` / `<Foo>x` casts (`as const` exempt)
+- `ts/ban-ts-comment` — `@ts-ignore` / `@ts-nocheck` / `@ts-expect-error` (scored as error)
+
+Composition (`src/checks/vercel-*.ts`, dimension `composition`):
+
+- `vercel/architecture-boolean-prop-soup` — `*Props` types with ≥ `BOOLEAN_PROP_SOUP_THRESHOLD` boolean flags
+- `vercel/patterns-render-prop` — function-valued `render` / `renderX` props
+
+deslop (`src/checks/deslop-*.ts`, dimension `maintainability`):
+
+- `deslop/nested-ternary` — nested conditional expressions (one finding per chain)