LLM Benchmark: Sequential Upgrades Test by bradleyshep · Pull Request #4817 · clockworklabs/SpacetimeDB

bradleyshep · 2026-04-16T14:36:55Z

Description of Changes

AI app generation benchmark comparing SpacetimeDB vs PostgreSQL (Express + Socket.io + Drizzle ORM). Same AI model (Claude Sonnet 4.6), same prompts, same chat app, two backends. Upgraded through 12 feature levels, manually graded at each level, bugs fixed, all costs measured via OpenTelemetry.

Results viewable at: https://spacetimedb.com/llms-benchmark-sequential-upgrade

Benchmark harness (`tools/llm-sequential-upgrade/`)

run.sh: orchestrates headless Claude Code sessions for code generation, sequential upgrades, and bug fixes. Tracks all API costs via OTel. Supports --upgrade, --fix, --composed-prompt, --resume-session modes.
grade.sh / grade-agents.sh: grading harnesses for manual testing of generated apps.
docker-compose.otel.yaml: OTel collector + PostgreSQL services.
generate-report.mjs / parse-telemetry.mjs: aggregate per-session telemetry into cost reports.
Backend guidelines in backends/: SpacetimeDB SDK reference, config templates, server setup docs, PostgreSQL setup with Drizzle/Socket.io guidance.
After LLM Benchmark Improvements + More Evals #4740 merges, we will likely want to update this so that it reads backend and SDK guidance from SKILLS

Two complete benchmark runs

Run 1 (20260403): Original methodology.
Run 2 (20260406): Refined methodology with domain bias removed from SpacetimeDB SDK docs and PostgreSQL instructions made feature-spec-neutral.
Note: no meaningful changes in results were observed with these changes. Domain familiarity biases were very small and almost certainly not the cause of STDB's major gains over PG stack.

Each run contains full L1-L12 app source for both backends, level snapshots preserving state before each upgrade, and per-session OTel cost summaries.

12 feature levels

Level	Feature
L1	Basic Chat + Typing + Read Receipts + Unread Counts
L2	Scheduled Messages
L3	Ephemeral Messages
L4	Message Reactions
L5	Message Editing with History
L6	Real-Time Permissions (kick, ban, promote)
L7	Rich User Presence
L8	Message Threading
L9	Private Rooms + Direct Messages
L10	Room Activity Indicators
L11	Draft Sync
L12	Anonymous to Registered Migration

Results

	Run 1 (20260403)	Run 2 (20260406)
SpacetimeDB total cost	$13.33	$12.62
PostgreSQL total cost	$17.80	$19.68
SpacetimeDB bugs	5	2
PostgreSQL bugs	19	8
SpacetimeDB fix sessions	4	1
PostgreSQL fix sessions	17	10

Both runs agree: SpacetimeDB apps are cheaper to build, have fewer bugs, and require fewer fix iterations. The refined methodology (Run 2) widened the cost gap and confirmed the advantage is structural, not an artifact of domain-biased SDK docs.

Performance benchmark (`perf-benchmark/`)

Stress throughput tool that fires concurrent writers at peak saturation against the AI-generated send_message handlers.

Tier	SpacetimeDB (avg)	PostgreSQL (avg)	Ratio
AI-generated (as-shipped)	5,267 msgs/sec	694 msgs/sec	7.6x
PG rate limit removed	5,267 msgs/sec	1,070 msgs/sec	4.9x
Optimized (same features kept)	25,278 msgs/sec	1,139 msgs/sec	22x

The gap widens with optimization because SpacetimeDB's bottleneck is fixable code patterns in the reducer while PostgreSQL's bottleneck is architectural (sequential network round-trips to an external database).

Optimized reference code with all features preserved is in perf-benchmark/results/optimized-reference/.

Data handling

Per-session cost summaries (cost-summary.json, COST_REPORT.md, metadata.json) are committed. Raw OTel telemetry (raw-telemetry.jsonl) containing PII is excluded via .gitignore and stored privately.

API and ABI breaking changes

None. All changes are in tools/llm-sequential-upgrade/. No production code, library, or SDK changes.

Expected complexity level and risk

1 - Trivial. Self-contained benchmarking tooling and data. No interaction with production code.

Testing

L1-L12 upgrades completed on all 4 apps (2 backends x 2 runs) with OTel cost capture
All levels manually graded after each upgrade; bugs filed and fixed via the harness
Methodology refinement between runs validated (domain bias removal, feature-neutral instructions)
Stress benchmarks run across both runs x 3 tiers (as-shipped, rate-limit-removed, optimized)
Optimized benchmarks verified to preserve all original features
Sensitive data (PII in raw telemetry) removed from repo and gitignored
Reviewer: spot-check that METRICS_DATA.json / METRICS_REPORT.json numbers match the telemetry cost-summary.json files

Complete benchmark comparing AI-generated chat apps on SpacetimeDB vs PostgreSQL (Express + Socket.io + Drizzle ORM). Same model (Claude Sonnet 4.6), same prompts, 12 feature levels, two independent runs. Results: https://spacetimedb.com/llms-benchmark-sequential-upgrade Tooling: run.sh (generation/upgrade/fix orchestrator), grade.sh (grading), OTel cost tracking, perf-benchmark stress throughput tool. Two runs (20260403 original methodology, 20260406 refined with domain bias removed). Both include full app source, level snapshots, and per-session telemetry cost summaries.

bradleyshep requested a review from cloutiertyler April 16, 2026 14:36

Merge branch 'master' into bradley/llm-exhaustive-test

25afb0f

github-advanced-security AI found potential problems Apr 16, 2026

View reviewed changes

bradleyshep added 2 commits April 17, 2026 08:53

Update benchmark-viewer.html

bbdb6cf

Merge branch 'master' into bradley/llm-exhaustive-test

a767268

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Benchmark: Sequential Upgrades Test#4817

LLM Benchmark: Sequential Upgrades Test#4817
bradleyshep wants to merge 4 commits intomasterfrom
bradley/llm-exhaustive-test

bradleyshep commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bradleyshep commented Apr 16, 2026

Description of Changes

Benchmark harness (tools/llm-sequential-upgrade/)

Two complete benchmark runs

12 feature levels

Results

Performance benchmark (perf-benchmark/)

Data handling

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmark harness (`tools/llm-sequential-upgrade/`)

Performance benchmark (`perf-benchmark/`)