Skip to content

bench: BROAD tighten expression_tokens 12000 -> 7000 (closes #73)#85

Merged
mbachaud merged 3 commits into
masterfrom
bench/broad-tighten
May 12, 2026
Merged

bench: BROAD tighten expression_tokens 12000 -> 7000 (closes #73)#85
mbachaud merged 3 commits into
masterfrom
bench/broad-tighten

Conversation

@mbachaud
Copy link
Copy Markdown
Owner

Summary

  • Flip helix.toml:72 expression_tokens from 12000 to 7000 per BROAD tighten: expression_tokens 12000 -> 6k-8k (bench-gated) #73's BROAD-tier tighten target of 6k-8k. Bench-gated at N=1000, gemma4:e4b, bench_needle_1000.py --axis blind, ASK_PROXY=0 retrieval-only.
  • PASS gate: retrieval_rate 11.10% -> 12.00% (delta +0.90 pp, gate <= 2 pp). Context p95 3.531s -> 3.428s (-103 ms, informational). avg_budget_utilization was 8.4% on both runs — the 12k cap was never the binding constraint.

Bench numbers (baseline 12k vs candidate 7k)

Metric 12k 7k Delta
retrieval_rate 11.10% 12.00% +0.90 pp
context p50 2.256s 2.250s -0.006s
context p95 3.531s 3.428s -0.103s
avg_injected 1138.81 1129.87 -8.94
avg_budget_util 8.4% 8.4% flat
errors 0 0 flat

Per-category retrieval all flat or up except helix (n=150, 8.0% -> 6.7%, -1.33 pp — well within seed noise at small n). education_public (n=300) carries the headline at +2.34 pp.

Artifacts (in overnight_logs/, force-added since dir is gitignored)

  • broad_tighten_2026-05-12_1422_report.md — full method + numbers + provenance + helix.toml diff
  • needle_1000_broad12k_2026-05-12_1422_blind.json — baseline bench output
  • needle_1000_broad7k_2026-05-12_1422_blind.json — candidate bench output
  • corresponding .monitor.jsonl health-monitor timelines

Method notes

  • Branch base: bench/broad-tighten @ 8ecfbab (the prior PR-prep commit that added the ASK_PROXY=0 guard around the bench's /v1/chat/completions dispatch, unblocking retrieval-only A/B).
  • Snapshot DB: genome-bench-2026-05-08-frozen.db sha256 AEAAF3AB...37C7 (a copy made before bench start). The bench's benchmark_monitor aborts on snapshot mtime/size change; helix's _background_checkpoint task fires every 60s and dirties the live DB. Pointing the bench's GENOME_DB env at the frozen copy isolates the integrity check from helix's WAL activity. Helix itself reads the original DB.
  • Wall time: 80.7 min (40.7 baseline + 40.0 candidate).

Test plan

  • N=1000 baseline 12k bench: 11.1% retrieval, 40.7 min wall, 0 errors
  • N=1000 candidate 7k bench: 12.0% retrieval, 40.0 min wall, 0 errors
  • PASS gate confirmed via overnight_logs/_compare_bench.py (|delta| <= 2pp)
  • No retrieval regression in any category > 2 pp at the run's n
  • proxy_p50 = proxy_p95 = 0 on both runs (ASK_PROXY=0 took effect, no /chat dispatch, no helix.learn background pollution)
  • Snapshot DB sha256 unchanged across both runs (frozen copy strategy worked)

Closes #73.

🤖 Generated with Claude Code

mbachaud and others added 3 commits May 12, 2026 13:35
Phase 1 (BROAD tighten 12000 -> 7000) was not executed this session.
bench_needle_1000.py does not honor ASK_PROXY=0 (4 grep hits, all
docstrings), and the project's own canonical wall-time estimate
(_run_overnight_e4b.sh line 7) is ~5.25h per run with /chat enabled at
N=1000 + gemma4:e4b. Four runs (2 phases x 2 configs) is ~21h, which
exceeds session budget and would mutate the snapshot DB via the proxy's
background helix.learn path (observed gene count grew 18934 -> 18936
across 2 partial-run needles).

Hard constraint "Don't touch Python source" blocks the obvious 3-line
guard that would honor the explicit ASK_PROXY=0 directive. Reporting
this back to the user instead of running a contaminated/incomplete bench.

Artifacts (all under overnight_logs/):
  BLOCKER_2026-05-12_wall-time.md - full root-cause + resolution paths
  _compare_bench.py               - PASS/FAIL gate helper, ready to use
  _snapshot_sha256.txt            - genome-bench-2026-05-08.db hash
  bench_broad_server_baseline.log - uvicorn startup confirming
                                    expression_budget=12000 from this
                                    worktree, ribosome=disabled,
                                    18934 genes at session start

helix.toml is unchanged at expression_tokens = 12000. The branch is
ready to receive the actual config flip + bench JSONs once the script
constraint is resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The script header has long documented an ASK_PROXY=0 retrieval-only
mode (`# ASK_PROXY 1 = full pipeline, 0 = retrieval-only`) and the
wrapper benchmarks/_run_n1000_blind.sh has exported the env since the
2026-04-14 N=1000 sweep — but the Python script unconditionally hit
/v1/chat/completions for every needle (line 462). So:

  - Every needle paid the full /chat 50-90s wall (~21h to do the
    4 runs required for the #73 BROAD-tighten A/B at N=1000) instead
    of the ~25-40 min the wrapper docs advertised.
  - The /chat path triggers the proxy's background helix.learn
    replication task, which ingests the (synthetic-but-not-marked-
    bench) query+response pair back into the genome. A "read-mostly"
    snapshot bench actually mutated the snapshot mid-run (observed
    18934 -> 18936 genes after 2 needles during a prior session).
    That violates the across-runs determinism the budget A/B depends
    on.

This commit wraps the /chat call in `if not ASK_PROXY:` (and sets
zero-shaped defaults for proxy_latency / answer_text / answered).
Default remains ASK_PROXY=1 so historical full-pipeline runs are
byte-for-byte unchanged.

Validation:
- env parse smoke: ASK_PROXY=0 -> False, ASK_PROXY=1 -> True,
  unset -> True (legacy default preserved).
- tests/test_bench_harvest.py: 18/18 pass.
- The retrieval-only fields (retrieved, context_latency_s,
  ellipticity, genes_expressed, agent_meta token counts) are
  unaffected — they come from /context, which is still called
  unconditionally above the guard.

Unblocks the #73 BROAD-tighten bench (see e0cc385 BLOCKER report on
this same branch). The PLR gate (#74) will use bench_packet.py
instead, since PLR only attaches signals to /context/packet — the
needle bench wouldn't exercise it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PASS gate at N=1000, gemma4:e4b, --axis blind, ASK_PROXY=0:
  retrieval_rate: 11.10% -> 12.00% (delta +0.90 pp, gate <= 2 pp)
  context p95:    3.531s -> 3.428s (delta -0.103 s, informational)
  avg_budget_util: 8.4% -> 8.4% (12k cap was never binding)

Per-category retrieval all flat or up except helix at n=150 (8.0% -> 6.7%,
-1.33 pp at small n; well within seed noise of the 2pp gate). education_public
n=300 carries the +0.9pp headline at +2.34 pp.

Artifacts in overnight_logs/:
  - broad_tighten_2026-05-12_1422_report.md  (full numbers, method, provenance)
  - needle_1000_broad12k_2026-05-12_1422_blind.json (baseline)
  - needle_1000_broad7k_2026-05-12_1422_blind.json  (candidate)
  - corresponding .monitor.jsonl files (in-bench health/abort timeline)

Provenance: branch bench/broad-tighten @ 8ecfbab (ASK_PROXY=0 retrieval-only
guard) + this commit. Snapshot DB: genome-bench-2026-05-08-frozen.db
sha256 AEAAF3AB8FDF9E6078BEFCEECA7A11F91F74EA8B20F9EA167292B7C3476B37C7
(frozen copy used to isolate the bench's mtime-based integrity check from
helix's 60s WAL checkpoint task; helix itself read the original DB).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mbachaud mbachaud merged commit fe0c813 into master May 12, 2026
3 checks passed
@mbachaud mbachaud deleted the bench/broad-tighten branch May 12, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BROAD tighten: expression_tokens 12000 -> 6k-8k (bench-gated)

1 participant