Skip to content

SlopBench: a frontend React/TypeScript slop benchmark (packages/benchmark)#757

Draft
NisargIO wants to merge 30 commits into
mainfrom
cursor/bc-c6c9d152-d3f7-49c9-baac-82f1c5d76a97-0bbb
Draft

SlopBench: a frontend React/TypeScript slop benchmark (packages/benchmark)#757
NisargIO wants to merge 30 commits into
mainfrom
cursor/bc-c6c9d152-d3f7-49c9-baac-82f1c5d76a97-0bbb

Conversation

@NisargIO

@NisargIO NisargIO commented Jun 9, 2026

Copy link
Copy Markdown
Member

What this is

SlopBench measures how good individual models are at frontend engineering — specifically how much React/TypeScript "slop" they produce or fail to clean up, scored through the lens of React Doctor (but not limited to it). It reuses the DeepSWE / Harbor task format so it runs under Pier with a swappable harness (Claude Code, Codex, Gemini CLI, opencode, or model-agnostic mini-swe-agent).

The novel idea vs. correctness-only SWE benchmarks: SlopBench scores two axes per task.

  1. Functional correctness (gate) — hidden behavioral tests, exactly like DeepSWE.
  2. Slop score (0–100, continuous) — computed offline by React Doctor + a strict TypeScript pass + Vercel-derived composition checks + deslop heuristics.
reward = functional_pass × (slop_score / 100)

A model can make the feature work and still score poorly for shipping slop (inline components, array-index keys, any, casts, @ts-ignore, boolean-prop soup, request waterfalls, full-library imports, non-interactive a11y handlers, …).

How it works

  • slop-verify (the new @react-doctor/benchmark package) runs over the agent's diff vs. the base commit, fully air-gapped:
    • React Doctor (--json --no-score --no-dead-code), diff-scoped to changed files. Its five categories map to react-correctness / react-performance / accessibility / maintainability; specific bundle/waterfall rules are rerouted to dedicated bundle / async-waterfall dimensions.
    • TypeScript strictness (AST, no type-checker): any, as casts, non-null !, @ts-ignore/@ts-nocheck/@ts-expect-error.
    • Composition (AST, from Vercel composition-patterns): boolean-prop soup, function-valued render props.
    • deslop: nested ternaries.
  • The canonical react.doctor score is a remote API, so grading uses a deterministic local scorer (severity × category × impact, size-normalized) — never the network. Weights live in scoring-profiles/default.json (mirrored by constants.ts); scoringVersion + doctorVersion are stamped into every report.
  • rule-overlap.md documents which tool owns each signal so nothing is double-counted (React Doctor owns React dims; SlopBench only fills the TS-strictness + composition gaps).

Reference influences

Grounded in React Doctor rules, the deslop skill, and Vercel's react-best-practices, composition-patterns, and next-best-practices skills.

Task corpus (20, all validated — every dimension covered)

Both families, Next App Router-weighted; at least one discriminating task per dimension:

Dimension Task(s)
react-correctness / react-performance notification-list, status-pill-variants, comment-thread-extend
accessibility icon-button-a11y
ts-strictness format-money-util, typed-storage-util, parse-query-util, route-handler-json, retry-async-util, unique-by-util
composition status-pill-variants, format-list-extend
async-waterfall dashboard-loader (Next server loader)
bundle chunk-util
maintainability format-duration-util, truncate-middle-util, slugify-util, avatar-initials-util, paginate-util, title-case-util
  • handle-slop (seed ships working-but-sloppy code the agent extends): group-by-extend, format-list-extend, comment-thread-extend.
  • Next App Router: dashboard-loader, route-handler-json.
  • The rest are produce-clean.

Every task: in-tree seed/, instruction.md (never mentions quality), hidden tests/test.patch, a clean reference solution/, and _authoring/ sources for the patches. Pure-TS tasks run on Node's built-in test runner (zero install); React tasks use vitest + react-dom/server.

Evidence (local validation, no Docker)

pnpm benchmark:validateall 20 reference solutions pass + score reward>0 (references are pristine — 100 — except one unavoidable React-Doctor opinion). The two-axis separation, shown by grading correct-but-sloppy variants that still pass the functional gate:

Task Clean Sloppy Slop detected
notification-list 100 / 1.0 94.56 array-index key, nested component, any
comment-thread-extend (handle-slop) 100 / 1.0 94.56 kept seed's index key + inline any component
route-handler-json (Next) 100 / 1.0 95.2 any, type assertion in query parsing
format-money-util 100 / 1.0 95.2 any, type assertion
dashboard-loader (Next) 100 / 1.0 98.65 sequential independent awaits
icon-button-a11y 100 / 1.0 99.32 non-interactive element handlers
chunk-util 100 / 1.0 99.47 full lodash import

Functional gate proven independently: a clean-but-broken solution scores slop_score ~100 yet reward 0.

  • Engine: 25 unit tests pass; typecheck, lint, format:check clean. (Lint/fmt need Node ≥22.18 to load the TS vite config — verified with Node 22.22; the sandbox's pinned Node is 22.14.)
  • @react-doctor/benchmark is wired into the root test filter, and a CI job runs the task self-test (pnpm benchmark:validate) so the corpus can't silently drift as the verifier / scoring profile / React Doctor evolve.

Running it

docker build -t slopbench-base:latest -f packages/benchmark/tasks/_base/Dockerfile .
pier run -p packages/benchmark/tasks --agent claude-code --model anthropic/claude-opus-4-7
pier run -p packages/benchmark/tasks --agent codex --model openai/gpt-5.5
node packages/benchmark/scripts/aggregate-results.mjs --logs <pier-logs> --model <m> --out results/<m>.json

See packages/benchmark/README.md and docs/SLOPBENCH.md for methodology, authoring, and the full run/aggregation workflow.

Notes / scope

  • A web leaderboard is intentionally v2 (this ships the results-aggregation JSON only), per the agreed plan.
  • Authoring tooling (scaffold-task.sh, gen-task-patches.sh, validate-task.sh, validate-all.sh) makes the corpus easy to grow further.
  • No changes to action.yml or any shipped package; @react-doctor/benchmark is private/unpublished.
Open in Web Open in Cursor 

cursoragent and others added 16 commits June 9, 2026 03:06
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…g integration test

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ests

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…-fillers) + rule-overlap doc

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…2e tests

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…, task _template

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…in verifier + task validator

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ge-util task (validated)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…t slop discrimination)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
… 6 tasks green)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…e unit tests to tests/; format

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…oader; validated)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…10 tasks; all validated)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…s; all validated)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
@pkg-pr-new

pkg-pr-new Bot commented Jun 9, 2026

Copy link
Copy Markdown

Open in StackBlitz

npm i https://pkg.pr.new/eslint-plugin-react-doctor@757
npm i https://pkg.pr.new/oxlint-plugin-react-doctor@757
npm i https://pkg.pr.new/react-doctor@757

commit: aa59fb9

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

No React Doctor issues found. 🎉

Reviewed by React Doctor for commit aa59fb9.

cursoragent and others added 12 commits June 9, 2026 04:54
…validate script

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…n; 13 tasks; validated)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…all validated)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ring; 15 tasks; validated)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ks; validated)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…; all validated, references pristine)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…eed range; all references pristine)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ply cleanly

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…-patches)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…nsion, empty logs)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
cursoragent and others added 2 commits June 9, 2026 05:48
…ation tests (CI 5s default timed out)

Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants