Bench concurrency by swelljoe · Pull Request #26 · swelljoe/nelson

swelljoe · 2026-05-31T05:41:57Z

Concurrent runs for benchmarking. Guarantees only one job per model at a time (to avoid rate limits, or overwhelming local models), but parallelizes across multiple models.

Adds a bunch more models.

The run matrix is I/O-bound (each raw-api-loop container is a stdlib ReAct loop blocked on the model API, using a fraction of its 4g/2cpu ceiling), so sequential execution wastes the network wait. Add a worker-thread executor behind a new --concurrency flag (default 1 = unchanged sequential path). Design: - One worker owns one competitor at a time and runs its cells sequentially, so no model ever has two runs in flight — the rate-limit guard for free OpenRouter endpoints. Distinct models run in parallel. - Checkouts are pre-warmed once per case before the pool starts, so workers only read the :ro tree (prepare_checkout would otherwise race two threads into the same rmtree+refetch). Factored run_case's checkout into a new BenchRunner.prewarm_checkout. - DB connections are now per-thread (threading.local) with busy_timeout, since a SQLite connection is single-thread-only; WAL (already on) serializes the concurrent writes. - Safety rails share state under one lock. The auth breaker uses a concurrency-safe rule: trip on total auth-failures >= K while nothing has completed (catches a dead host login; "consecutive" is undefined when runs overlap). The sequential path keeps its original consecutive rule. Tests: thread-local DB writes; cells run exactly once; never overlaps one model while overlapping distinct ones; idempotent; breaker trips on all-fail and stays silent once anything completes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

compute_cost bills every input token at the full prompt rate, but the ReAct loop resends the whole context each turn — and providers bill those repeats as cache reads at a fraction (OpenRouter's gpt-5.5 input_cache_read is $0.50/M, 0.1x the $5/M first-read rate). So our cost was a cache-blind upper bound: gpt-5.5 metered $36 vs $10.09 actually billed. Fix: on OpenRouter endpoints send `usage: {include: true}` and sum the real `usage.cost` the provider returns per call; main() prefers it over compute_cost (kept as the fallback for providers that don't return cost). Guarded on the openrouter.ai host so a strict OpenAI-compat server (self-hosted llama-server, a vendor API) never sees the unknown request field. run_loop now returns a 5th value (provider_cost: float|None). Two new tests: sums per-call cost + sends usage.include on OpenRouter; omits it elsewhere. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds concurrent execution for the benchmark “bench loop” so runs parallelize across models while ensuring at most one in-flight run per model (rate-limit guard). Also extends raw-api-loop to request and sum provider-reported cost from OpenRouter, and expands example competitor configurations (OpenRouter cohort) plus related tooling tweaks.

Changes:

Add --concurrency to the bench loop and a concurrent executor that runs one competitor per worker thread while pre-warming checkouts.
Make Database thread-safe for the new worker threads via per-thread SQLite connections.
Enhance raw_api_loop.run_loop() to optionally request OpenRouter usage.include and return summed provider cost; update tests accordingly and add an OpenRouter auth profile + example competitors.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`nelson/automate.py`	Adds concurrent execution path for planned run matrix cells with shared safety rails.
`nelson/db.py`	Switches SQLite access to per-thread connections to support threaded execution.
`nelson/cli.py`	Adds `--concurrency` option and plumbs it into `run_once`.
`nelson/runner.py`	Extracts checkout materialization into `prewarm_checkout()` for safe pre-warming.
`nelson/raw_api_loop.py`	Returns provider-reported cost (when available) and requests OpenRouter usage inclusion.
`nelson/auth.py`	Adds `openrouter-api-key` auth profile mapping to `OPENROUTER_API_KEY`.
`tests/test_automate.py`	Adds tests covering concurrent execution guarantees and breaker behavior under overlap.
`tests/test_raw_api_loop.py`	Updates run_loop return shape and adds tests for OpenRouter cost/usage requests.
`competitors.example.yaml`	Adds an OpenRouter competitor cohort with pricing/cutoff metadata.
`.gitignore`	Ignores DB backups and timestamped HTML benchmark reports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Joe Cooper and others added 2 commits May 30, 2026 19:36

swelljoe requested a review from Copilot May 31, 2026 05:42

Copilot started reviewing on behalf of swelljoe May 31, 2026 05:42 View session

Format

33733fe

Copilot AI reviewed May 31, 2026

View reviewed changes

Comment thread nelson/automate.py

Comment thread nelson/db.py

swelljoe and others added 3 commits May 31, 2026 01:49

Potential fix for pull request finding

4923de1

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

3b31670

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

lint

2ce9d60

swelljoe merged commit 52d1e2e into main May 31, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bench concurrency#26

Bench concurrency#26
swelljoe merged 6 commits into
mainfrom
bench-concurrency

swelljoe commented May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swelljoe commented May 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants