SlopBench: a frontend React/TypeScript slop benchmark (packages/benchmark) by NisargIO · Pull Request #757 · millionco/react-doctor

NisargIO · 2026-06-09T04:49:00Z

What this is

SlopBench measures how good individual models are at frontend engineering — specifically how much React/TypeScript "slop" they produce or fail to clean up, scored through the lens of React Doctor (but not limited to it). It reuses the DeepSWE / Harbor task format so it runs under Pier with a swappable harness (Claude Code, Codex, Gemini CLI, opencode, or model-agnostic mini-swe-agent).

The novel idea vs. correctness-only SWE benchmarks: SlopBench scores two axes per task.

Functional correctness (gate) — hidden behavioral tests, exactly like DeepSWE.
Slop score (0–100, continuous) — computed offline by React Doctor + a strict TypeScript pass + Vercel-derived composition checks + deslop heuristics.

reward = functional_pass × (slop_score / 100)

A model can make the feature work and still score poorly for shipping slop (inline components, array-index keys, any, casts, @ts-ignore, boolean-prop soup, request waterfalls, full-library imports, non-interactive a11y handlers, …).

How it works

slop-verify (the new @react-doctor/benchmark package) runs over the agent's diff vs. the base commit, fully air-gapped:
- React Doctor (--json --no-score --no-dead-code), diff-scoped to changed files. Its five categories map to react-correctness / react-performance / accessibility / maintainability; specific bundle/waterfall rules are rerouted to dedicated bundle / async-waterfall dimensions.
- TypeScript strictness (AST, no type-checker): any, as casts, non-null !, @ts-ignore/@ts-nocheck/@ts-expect-error.
- Composition (AST, from Vercel composition-patterns): boolean-prop soup, function-valued render props.
- deslop: nested ternaries.
The canonical react.doctor score is a remote API, so grading uses a deterministic local scorer (severity × category × impact, size-normalized) — never the network. Weights live in scoring-profiles/default.json (mirrored by constants.ts); scoringVersion + doctorVersion are stamped into every report.
rule-overlap.md documents which tool owns each signal so nothing is double-counted (React Doctor owns React dims; SlopBench only fills the TS-strictness + composition gaps).

Reference influences

Grounded in React Doctor rules, the deslop skill, and Vercel's react-best-practices, composition-patterns, and next-best-practices skills.

Task corpus (20, all validated — every dimension covered)

Both families, Next App Router-weighted; at least one discriminating task per dimension:

Dimension	Task(s)
react-correctness / react-performance	`notification-list`, `status-pill-variants`, `comment-thread-extend`
accessibility	`icon-button-a11y`
ts-strictness	`format-money-util`, `typed-storage-util`, `parse-query-util`, `route-handler-json`, `retry-async-util`, `unique-by-util`
composition	`status-pill-variants`, `format-list-extend`
async-waterfall	`dashboard-loader` (Next server loader)
bundle	`chunk-util`
maintainability	`format-duration-util`, `truncate-middle-util`, `slugify-util`, `avatar-initials-util`, `paginate-util`, `title-case-util`

handle-slop (seed ships working-but-sloppy code the agent extends): group-by-extend, format-list-extend, comment-thread-extend.
Next App Router: dashboard-loader, route-handler-json.
The rest are produce-clean.

Every task: in-tree seed/, instruction.md (never mentions quality), hidden tests/test.patch, a clean reference solution/, and _authoring/ sources for the patches. Pure-TS tasks run on Node's built-in test runner (zero install); React tasks use vitest + react-dom/server.

Evidence (local validation, no Docker)

pnpm benchmark:validate → all 20 reference solutions pass + score reward>0 (references are pristine — 100 — except one unavoidable React-Doctor opinion). The two-axis separation, shown by grading correct-but-sloppy variants that still pass the functional gate:

Task	Clean	Sloppy	Slop detected
`notification-list`	100 / 1.0	94.56	array-index key, nested component, `any`
`comment-thread-extend` (handle-slop)	100 / 1.0	94.56	kept seed's index key + inline `any` component
`route-handler-json` (Next)	100 / 1.0	95.2	`any`, type assertion in query parsing
`format-money-util`	100 / 1.0	95.2	`any`, type assertion
`dashboard-loader` (Next)	100 / 1.0	98.65	sequential independent awaits
`icon-button-a11y`	100 / 1.0	99.32	non-interactive element handlers
`chunk-util`	100 / 1.0	99.47	full `lodash` import

Functional gate proven independently: a clean-but-broken solution scores slop_score ~100 yet reward 0.

Engine: 25 unit tests pass; typecheck, lint, format:check clean. (Lint/fmt need Node ≥22.18 to load the TS vite config — verified with Node 22.22; the sandbox's pinned Node is 22.14.)
@react-doctor/benchmark is wired into the root test filter, and a CI job runs the task self-test (pnpm benchmark:validate) so the corpus can't silently drift as the verifier / scoring profile / React Doctor evolve.

Running it

docker build -t slopbench-base:latest -f packages/benchmark/tasks/_base/Dockerfile .
pier run -p packages/benchmark/tasks --agent claude-code --model anthropic/claude-opus-4-7
pier run -p packages/benchmark/tasks --agent codex --model openai/gpt-5.5
node packages/benchmark/scripts/aggregate-results.mjs --logs <pier-logs> --model <m> --out results/<m>.json

See packages/benchmark/README.md and docs/SLOPBENCH.md for methodology, authoring, and the full run/aggregation workflow.

Notes / scope

A web leaderboard is intentionally v2 (this ships the results-aggregation JSON only), per the agreed plan.
Authoring tooling (scaffold-task.sh, gen-task-patches.sh, validate-task.sh, validate-all.sh) makes the corpus easy to grow further.
No changes to action.yml or any shipped package; @react-doctor/benchmark is private/unpublished.

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…g integration test Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…ests Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…-fillers) + rule-overlap doc Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…2e tests Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…, task _template Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…in verifier + task validator Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…ge-util task (validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…t slop discrimination) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

… 6 tasks green) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…e unit tests to tests/; format Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…oader; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…10 tasks; all validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…s; all validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

pkg-pr-new · 2026-06-09T04:49:57Z

Open in StackBlitz

npm i https://pkg.pr.new/eslint-plugin-react-doctor@757

npm i https://pkg.pr.new/oxlint-plugin-react-doctor@757

npm i https://pkg.pr.new/react-doctor@757

commit: aa59fb9

github-actions · 2026-06-09T04:50:09Z

No React Doctor issues found. 🎉

_{Reviewed by React Doctor for commit aa59fb9.}

…validate script Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…n; 13 tasks; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…all validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…ring; 15 tasks; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…ks; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…; all validated, references pristine) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…eed range; all references pristine) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…ply cleanly Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…-patches) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…nsion, empty logs) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

…ation tests (CI 5s default timed out) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

cursoragent and others added 16 commits June 9, 2026 03:06

feat(benchmark): scaffold SlopBench package + report contracts

830a836

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): React Doctor scanner (offline, diff-scoped) + passin…

e26c16a

…g integration test Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): deterministic local slop scorer + weights + golden t…

c659625

…ests Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): TS-strictness + composition + deslop AST checks (gap…

88ad71d

…-fillers) + rule-overlap doc Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): slop-verify orchestrator + CLI + diff collection + e…

51270da

…2e tests Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): Harbor/Pier task harness — base image, shared grader…

d292b64

…, task _template Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): task format-money-util (validated) + --no-dead-code …

c24dae6

…in verifier + task validator Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): task scaffolding + patch generators; add typed-stora…

92ae8a7

…ge-util task (validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): handle-slop task group-by-extend (validated)

22af7e5

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): React/vitest task status-pill-variants (validated)

5a9eb7c

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): notification-list task (validated; demonstrates reac…

899060e

…t slop discrimination) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): format-duration-util task + validate-all script (all…

bd4f842

… 6 tasks green) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): README + results aggregator + methodology docs; scop…

fd4d160

…e unit tests to tests/; format Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): dashboard-loader async-waterfall task (Next server l…

d11756f

…oader; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): add slugify, parse-query, format-list-extend tasks (…

b1a3000

…10 tasks; all validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): add truncate-middle + avatar-initials tasks (12 task…

c214279

…s; all validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

cursoragent and others added 12 commits June 9, 2026 04:54

ci(benchmark): wire SlopBench task self-test into CI + add benchmark:…

85a8405

…validate script Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

docs(benchmark): reference pnpm benchmark:validate in README

d39dbfe

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): icon-button-a11y task (covers accessibility dimensio…

6d9b7ea

…n; 13 tasks; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): chunk-util task (covers bundle dimension; 14 tasks; …

3bea864

…all validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): route-handler-json Next App Router task (query filte…

b73466a

…ring; 15 tasks; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): comment-thread-extend handle-slop React task (16 tas…

f8a7e9a

…ks; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): add paginate-util + retry-async-util tasks (18 tasks…

df32dbf

…; all validated, references pristine) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

feat(benchmark): add unique-by + title-case tasks (20 tasks; full agr…

41fbcb0

…eed range; all references pristine) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

fix(benchmark): regenerate task patches after formatting so all 20 ap…

00781f0

…ply cleanly Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

docs(benchmark): note to format before generating task patches

d8acdb4

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

docs(benchmark): clean up authoring note (pnpm format before gen-task…

885a13a

…-patches) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

test(benchmark): cover results aggregator (pass-rate, means, per-dime…

6a031d4

…nsion, empty logs) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

cursoragent and others added 2 commits June 9, 2026 05:48

fix(benchmark): raise vitest timeout for React Doctor-spawning integr…

d9f356e

…ation tests (CI 5s default timed out) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

fix(docs): format docs/SLOPBENCH.md (CI format:check)

aa59fb9

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SlopBench: a frontend React/TypeScript slop benchmark (packages/benchmark)#757

SlopBench: a frontend React/TypeScript slop benchmark (packages/benchmark)#757
NisargIO wants to merge 30 commits into
mainfrom
cursor/bc-c6c9d152-d3f7-49c9-baac-82f1c5d76a97-0bbb

NisargIO commented Jun 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

pkg-pr-new Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NisargIO commented Jun 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

How it works

Reference influences

Task corpus (20, all validated — every dimension covered)

Evidence (local validation, no Docker)

Running it

Notes / scope

Uh oh!

pkg-pr-new Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NisargIO commented Jun 9, 2026 •

edited by cursor Bot

Loading

pkg-pr-new Bot commented Jun 9, 2026 •

edited

Loading

github-actions Bot commented Jun 9, 2026 •

edited

Loading