Skip to content

LLM Benchmark: Sequential Upgrades Test#4817

Open
bradleyshep wants to merge 4 commits intomasterfrom
bradley/llm-exhaustive-test
Open

LLM Benchmark: Sequential Upgrades Test#4817
bradleyshep wants to merge 4 commits intomasterfrom
bradley/llm-exhaustive-test

Conversation

@bradleyshep
Copy link
Copy Markdown
Contributor

Description of Changes

AI app generation benchmark comparing SpacetimeDB vs PostgreSQL (Express + Socket.io + Drizzle ORM). Same AI model (Claude Sonnet 4.6), same prompts, same chat app, two backends. Upgraded through 12 feature levels, manually graded at each level, bugs fixed, all costs measured via OpenTelemetry.

Results viewable at: https://spacetimedb.com/llms-benchmark-sequential-upgrade

Benchmark harness (tools/llm-sequential-upgrade/)

  • run.sh: orchestrates headless Claude Code sessions for code generation, sequential upgrades, and bug fixes. Tracks all API costs via OTel. Supports --upgrade, --fix, --composed-prompt, --resume-session modes.
  • grade.sh / grade-agents.sh: grading harnesses for manual testing of generated apps.
  • docker-compose.otel.yaml: OTel collector + PostgreSQL services.
  • generate-report.mjs / parse-telemetry.mjs: aggregate per-session telemetry into cost reports.
  • Backend guidelines in backends/: SpacetimeDB SDK reference, config templates, server setup docs, PostgreSQL setup with Drizzle/Socket.io guidance.
    After LLM Benchmark Improvements + More Evals #4740 merges, we will likely want to update this so that it reads backend and SDK guidance from SKILLS

Two complete benchmark runs

Run 1 (20260403): Original methodology.
Run 2 (20260406): Refined methodology with domain bias removed from SpacetimeDB SDK docs and PostgreSQL instructions made feature-spec-neutral.
Note: no meaningful changes in results were observed with these changes. Domain familiarity biases were very small and almost certainly not the cause of STDB's major gains over PG stack.

Each run contains full L1-L12 app source for both backends, level snapshots preserving state before each upgrade, and per-session OTel cost summaries.

12 feature levels

Level Feature
L1 Basic Chat + Typing + Read Receipts + Unread Counts
L2 Scheduled Messages
L3 Ephemeral Messages
L4 Message Reactions
L5 Message Editing with History
L6 Real-Time Permissions (kick, ban, promote)
L7 Rich User Presence
L8 Message Threading
L9 Private Rooms + Direct Messages
L10 Room Activity Indicators
L11 Draft Sync
L12 Anonymous to Registered Migration

Results

Run 1 (20260403) Run 2 (20260406)
SpacetimeDB total cost $13.33 $12.62
PostgreSQL total cost $17.80 $19.68
SpacetimeDB bugs 5 2
PostgreSQL bugs 19 8
SpacetimeDB fix sessions 4 1
PostgreSQL fix sessions 17 10

Both runs agree: SpacetimeDB apps are cheaper to build, have fewer bugs, and require fewer fix iterations. The refined methodology (Run 2) widened the cost gap and confirmed the advantage is structural, not an artifact of domain-biased SDK docs.

Performance benchmark (perf-benchmark/)

Stress throughput tool that fires concurrent writers at peak saturation against the AI-generated send_message handlers.

Tier SpacetimeDB (avg) PostgreSQL (avg) Ratio
AI-generated (as-shipped) 5,267 msgs/sec 694 msgs/sec 7.6x
PG rate limit removed 5,267 msgs/sec 1,070 msgs/sec 4.9x
Optimized (same features kept) 25,278 msgs/sec 1,139 msgs/sec 22x

The gap widens with optimization because SpacetimeDB's bottleneck is fixable code patterns in the reducer while PostgreSQL's bottleneck is architectural (sequential network round-trips to an external database).

Optimized reference code with all features preserved is in perf-benchmark/results/optimized-reference/.

Data handling

Per-session cost summaries (cost-summary.json, COST_REPORT.md, metadata.json) are committed. Raw OTel telemetry (raw-telemetry.jsonl) containing PII is excluded via .gitignore and stored privately.

API and ABI breaking changes

None. All changes are in tools/llm-sequential-upgrade/. No production code, library, or SDK changes.

Expected complexity level and risk

1 - Trivial. Self-contained benchmarking tooling and data. No interaction with production code.

Testing

  • L1-L12 upgrades completed on all 4 apps (2 backends x 2 runs) with OTel cost capture
  • All levels manually graded after each upgrade; bugs filed and fixed via the harness
  • Methodology refinement between runs validated (domain bias removal, feature-neutral instructions)
  • Stress benchmarks run across both runs x 3 tiers (as-shipped, rate-limit-removed, optimized)
  • Optimized benchmarks verified to preserve all original features
  • Sensitive data (PII in raw telemetry) removed from repo and gitignored
  • Reviewer: spot-check that METRICS_DATA.json / METRICS_REPORT.json numbers match the telemetry cost-summary.json files

Complete benchmark comparing AI-generated chat apps on SpacetimeDB vs
PostgreSQL (Express + Socket.io + Drizzle ORM). Same model (Claude Sonnet 4.6),
same prompts, 12 feature levels, two independent runs.

Results: https://spacetimedb.com/llms-benchmark-sequential-upgrade

Tooling: run.sh (generation/upgrade/fix orchestrator), grade.sh (grading),
OTel cost tracking, perf-benchmark stress throughput tool.

Two runs (20260403 original methodology, 20260406 refined with domain bias
removed). Both include full app source, level snapshots, and per-session
telemetry cost summaries.
Comment thread tools/llm-sequential-upgrade/benchmark-viewer.html Fixed
Comment thread tools/llm-sequential-upgrade/perf-benchmark/optimized-reference/pg-index-optimized.ts Dismissed
Comment thread tools/llm-sequential-upgrade/perf-benchmark/optimized-reference/pg-index-optimized.ts Dismissed
Comment thread tools/llm-sequential-upgrade/perf-benchmark/optimized-reference/pg-index-optimized.ts Dismissed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants