SlopBench: a frontend React/TypeScript slop benchmark (packages/benchmark)#757
Draft
NisargIO wants to merge 30 commits into
Draft
SlopBench: a frontend React/TypeScript slop benchmark (packages/benchmark)#757NisargIO wants to merge 30 commits into
NisargIO wants to merge 30 commits into
Conversation
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…g integration test Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ests Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…-fillers) + rule-overlap doc Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…2e tests Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…, task _template Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…in verifier + task validator Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ge-util task (validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…t slop discrimination) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
… 6 tasks green) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…e unit tests to tests/; format Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…oader; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…10 tasks; all validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…s; all validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
commit: |
Contributor
|
No React Doctor issues found. 🎉 Reviewed by React Doctor for commit |
…validate script Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…n; 13 tasks; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…all validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ring; 15 tasks; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ks; validated) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…; all validated, references pristine) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…eed range; all references pristine) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ply cleanly Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…-patches) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…nsion, empty logs) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
…ation tests (CI 5s default timed out) Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
Co-authored-by: Nisarg Patel <NisargIO@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
SlopBench measures how good individual models are at frontend engineering — specifically how much React/TypeScript "slop" they produce or fail to clean up, scored through the lens of React Doctor (but not limited to it). It reuses the DeepSWE / Harbor task format so it runs under Pier with a swappable harness (Claude Code, Codex, Gemini CLI, opencode, or model-agnostic
mini-swe-agent).The novel idea vs. correctness-only SWE benchmarks: SlopBench scores two axes per task.
A model can make the feature work and still score poorly for shipping slop (inline components, array-index keys,
any, casts,@ts-ignore, boolean-prop soup, request waterfalls, full-library imports, non-interactive a11y handlers, …).How it works
slop-verify(the new@react-doctor/benchmarkpackage) runs over the agent's diff vs. the base commit, fully air-gapped:--json --no-score --no-dead-code), diff-scoped to changed files. Its five categories map toreact-correctness/react-performance/accessibility/maintainability; specific bundle/waterfall rules are rerouted to dedicatedbundle/async-waterfalldimensions.any,ascasts, non-null!,@ts-ignore/@ts-nocheck/@ts-expect-error.scoring-profiles/default.json(mirrored byconstants.ts);scoringVersion+doctorVersionare stamped into every report.rule-overlap.mddocuments which tool owns each signal so nothing is double-counted (React Doctor owns React dims; SlopBench only fills the TS-strictness + composition gaps).Reference influences
Grounded in React Doctor rules, the deslop skill, and Vercel's react-best-practices, composition-patterns, and next-best-practices skills.
Task corpus (20, all validated — every dimension covered)
Both families, Next App Router-weighted; at least one discriminating task per dimension:
notification-list,status-pill-variants,comment-thread-extendicon-button-a11yformat-money-util,typed-storage-util,parse-query-util,route-handler-json,retry-async-util,unique-by-utilstatus-pill-variants,format-list-extenddashboard-loader(Next server loader)chunk-utilformat-duration-util,truncate-middle-util,slugify-util,avatar-initials-util,paginate-util,title-case-utilgroup-by-extend,format-list-extend,comment-thread-extend.dashboard-loader,route-handler-json.Every task: in-tree
seed/,instruction.md(never mentions quality), hiddentests/test.patch, a clean referencesolution/, and_authoring/sources for the patches. Pure-TS tasks run on Node's built-in test runner (zero install); React tasks use vitest +react-dom/server.Evidence (local validation, no Docker)
pnpm benchmark:validate→ all 20 reference solutions pass + score reward>0 (references are pristine — 100 — except one unavoidable React-Doctor opinion). The two-axis separation, shown by grading correct-but-sloppy variants that still pass the functional gate:notification-listanycomment-thread-extend(handle-slop)anycomponentroute-handler-json(Next)any, type assertion in query parsingformat-money-utilany, type assertiondashboard-loader(Next)icon-button-a11ychunk-utillodashimportFunctional gate proven independently: a clean-but-broken solution scores
slop_score~100 yet reward 0.typecheck,lint,format:checkclean. (Lint/fmt need Node ≥22.18 to load the TS vite config — verified with Node 22.22; the sandbox's pinned Node is 22.14.)@react-doctor/benchmarkis wired into the roottestfilter, and a CI job runs the task self-test (pnpm benchmark:validate) so the corpus can't silently drift as the verifier / scoring profile / React Doctor evolve.Running it
See
packages/benchmark/README.mdanddocs/SLOPBENCH.mdfor methodology, authoring, and the full run/aggregation workflow.Notes / scope
scaffold-task.sh,gen-task-patches.sh,validate-task.sh,validate-all.sh) makes the corpus easy to grow further.action.ymlor any shipped package;@react-doctor/benchmarkis private/unpublished.