bench: flip [plr] enabled = true (closes #74)#86
Merged
Conversation
Phase 2 (flip [plr] enabled = false -> true) was not executed this session. Same root cause as the parallel #73 BROAD blocker: bench_needle_1000.py does not honor ASK_PROXY=0, and the canonical wall-time estimate (_run_overnight_e4b.sh line 7) is ~5.25h per run. Phase 0 prep DID complete successfully for this phase: - training/models/stacked_plr.joblib loads cleanly - schema_version=1 (matches MODEL_SCHEMA_VERSION expectation) - label_set='t07', cos_threshold=0.7 - auc_mean=0.6313, auc_std reported - classifier=GradientBoostingClassifier - source_export=cwola_export_20260415_windowed.json - trained_at=2026-04-22T07:23:03Z - PLR retrain NOT needed; artifact is the AUC=0.631 stacked head from the user's CWoLa Sprint 3 acceptance test. helix.toml is unchanged at [plr] enabled = false. The branch is ready to receive the config flip + bench JSONs once the script constraint is resolved. See overnight_logs/BLOCKER_2026-05-12_wall-time.md for the resolution-path menu the user can pick from. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PASS gate on /context/packet smoke bench (N=50, retrieval-only HTTP):
off-side leakage: 0/50 (gate = 0)
on-side presence: 100% (50/50) (gate >= 90%)
p95 latency: 2613ms -> 2224ms (delta -389ms, gate < +50ms)
p50 latency: 1165ms -> 1186ms (delta +21ms, no degradation)
The stacked PLR query-confidence head (STATISTICAL_FUSION.md §C3) now
attaches plr_confidence to /context/packet responses. Sample payload:
{"prob_B": 0.91, "logit": 2.34, "score_A": 0.09,
"high_risk": true, "artifact_label_set": "t07"}
Artifacts in this commit:
- helix.toml: [plr] enabled false -> true
- benchmarks/bench_plr_smoke.py: new HTTP smoke bench (50-100 lines, no
new deps, uses httpx + bench_needle_1000's harvester for a realistic
KV-needle query corpus). Has a --summarize sub-mode that emits the
gate verdict.
- training/models/stacked_plr.joblib + .sha256 (force-added; training/
is gitignored). Pre-trained query-quality head, schema v1, label_set
t07, training AUC 0.6314 > 0.55 §C2 gate.
- overnight_logs/plr_smoke_off_2026-05-12_1549.json (baseline)
- overnight_logs/plr_smoke_on_2026-05-12_1549.json (candidate)
- overnight_logs/plr_gate_2026-05-12_1549_report.md (full numbers,
method, provenance, helix.toml diff)
Why HTTP-only bench: _compute_plr_confidence (server.py:453) is only
called from the /context/packet route handler (server.py:1634).
benchmarks/bench_packet.py calls build_context_packet directly and so
does NOT exercise PLR. The smoke bench hits the endpoint over real HTTP
so the live_cfg.plr.enabled gate and the PLR closure both fire.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
helix.toml:320[plr] enabledfromfalsetotrueper PLR gate: train artifact + bench gate before flipping [plr] enabled = true #74's PLR gate spec. The pre-trainedstacked_plr.joblibquery-quality head (schema v1, label_sett07, training AUC 0.6314 > 0.55 §C2 gate) now attachesplr_confidenceto every/context/packetresponse.training/models/stacked_plr.joblib(force-add, sincetraining/modelsis gitignored).benchmarks/bench_plr_smoke.pyHTTP smoke bench (50 queries; ~64 s/side wall time) plus a--summarizemode that emits the PASS/FAIL gate verdict from the two output JSONs.Gate (all PASS)
= 00/50>= 90%100%< +50 ms-389 msPLR-on improves p95 by 389 ms in this run. Likely seed-noise within N=50, but the requirement was "no degradation" and we are well past that.
Bench numbers
Sample
plr_confidencepayload:{"prob_B": 0.9123, "logit": 2.3423, "score_A": 0.0877, "high_risk": true, "artifact_label_set": "t07"}Why an HTTP-only smoke bench
_compute_plr_confidencelives athelix_context/server.py:453and is only called from the/context/packetroute atserver.py:1634. The in-treebenchmarks/bench_packet.pycallsbuild_context_packetdirectly and so does NOT exercise PLR. The new smoke bench hits the endpoint over real HTTP so thelive_cfg.plr.enabledgate and the PLR closure both fire.Artifacts (in
overnight_logs/, force-added since dir is gitignored)plr_gate_2026-05-12_1549_report.md— full method + numbers + provenance + diffplr_smoke_off_2026-05-12_1549.json— baseline PLR-off outputplr_smoke_on_2026-05-12_1549.json— candidate PLR-on outputTest plan
training/models/stacked_plr.joblib, committed in this PR)plr_confidence, p95 capturedpython benchmarks/bench_plr_smoke.py --summarize <off> <on>curl /context/packet) confirmsplr_confidenceblock has expected schema (prob_B,logit,score_A,high_risk,artifact_label_set)ok_count = 50/50both)Closes #74.
🤖 Generated with Claude Code