feat: hot-swap .db endpoint + read_only guard#91
Merged
Conversation
… plan
Three deliverables from the 3-week status audit:
1. **CLAUDE.md rewrite** — matches post-PR-90 package structure:
- "6-step" → "7-stage" pipeline with BGE-M3 dense recall, RRF fusion,
freshness gate, know/miss contract
- 10-module flat table → 16-package table with back-compat shim note
- Config: 5 sections → 17 sections with key settings per section
- Endpoints: grouped into 4 blocks (core, ingestion, identity, diag)
- expression_tokens corrected to 7k (was claiming 12k)
- Genome path corrected to genomes/main/genome.db
2. **OAuth bench harness recovered** from squashed intermediate commit
672ee13 (lost during PR #90 squash). Updated to post-PR-90 naming:
- oauth_fixtures.py: Genome → KnowledgeStore, upsert_gene → upsert_doc
- bench_oauth_scope.py: same + query_genes → query_docs
- bench_oauth_provider.py: no internal imports (pure HTTP + claude -p)
- oauth_task_set.py: task IDs/queries unchanged (bench contract)
All 4 files import cleanly; synthetic scope smoke test passes
(0 cross-party leaks, 100% own-party recall).
3. **README v3 plan** at docs/archive/plans/2026-05-13-readme-v3-plan.md.
Headroom-informed restructure: proof-first positioning, collapsible
details sections, ASCII pipeline diagram, unified agent-surfaces
section, two-column docs table.
Token baseline finding (from bench_rag_vs_sike_tokens.py run):
- README v2 claimed "5.4× median" — actual with ribosome disabled
(current default) is **2.9× median** (compressor OFF = Headroom
Kompress only, target=1000c/doc). The 5.4× was with Claude Haiku
splice active. README v3 plan flags this for honest reporting.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add POST /admin/swap-db to switch the active knowledge store without restarting the server. Supports a read_only flag that makes upsert_doc, link_coactivated, store_harmonic_weights, touch_genes, and log_health no-op silently, preventing bench runs from polluting target genomes. Changes across 3 files + tests: - knowledge_store.py: read_only param on __init__, 2-line guards on 5 methods - server/routes_admin.py: POST /admin/swap-db endpoint (~50 LOC) - mcp/mcp_server.py: helix_swap_db tool (HTTP proxy to the endpoint) - tests/test_swap_db.py: 10 tests covering read_only no-ops + endpoint behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced May 14, 2026
mbachaud
added a commit
that referenced
this pull request
May 14, 2026
…-grounded retrieval scoring Bundles the agent-correctness bench infrastructure for the 6-fixture matrix (4 monolithic blobs + 2 sharded). Together these let a single invocation walk every fixture in the matrix, switch modes transparently, and score retrieval-only + agent-correctness side-by-side. **New: `benchmarks/bench_orchestrator.py`** `BenchServer` manages uvicorn lifecycle for the fixture matrix: - Same-mode transitions (blob→blob, sharded→sharded) use the atomic `POST /admin/swap-db` endpoint (PR #91). - Cross-mode transitions (blob↔sharded) require a full uvicorn restart with `HELIX_USE_SHARDS=1` flipped, because `open_read_source` in `helix_context/sharding.py` reads that env at store-construction time. The orchestrator picks the right mechanism automatically — bench code calls `srv.switch(fixture)` without caring which path runs. Includes a CLI that loops a manifest and invokes per-fixture bench scripts as subprocesses (`bench_needle`, `bench_needle_1000`). Read-only by default (PR #91 guard) so benches cannot mutate fixtures. **New: `benchmarks/fixtures.json`** Canonical 6-fixture manifest derived from `genomes/bench/matrix/frozen.json`. Paths, mode, gene counts, SHA256. Tracked via explicit `.gitignore` exception since `benchmarks/*.json` is broadly excluded. **New: `benchmarks/bench_claude_matrix.py`** Originally an untracked harness at `scripts/bench_claude_matrix.py` with sonnet default. This copy adds: - Default `--model haiku` (was sonnet) — cloud speed prioritized over model quality for the retrieval-quality bench. See `memory/bench_default_haiku.md` for the rationale. - Issue #101: prefers structured `agent.citations[].source` over the legacy `<GENE src="...">` regex; the regex stays as fallback since `context_manager.py:1367` still emits the wrapper. - `BenchServer` integration: harness manages its own uvicorn by default, including cross-mode restarts. `--external-server` flag preserves the legacy "use a server you've started yourself" flow. - `sources_via` and `n_citations` diagnostic fields per record so reviewers can see which extraction path produced the sources. **Modified: `benchmarks/bench_needle.py`** Issue #101 retrieval-source robustness. The legacy `<GENE src=` regex still works (the splice path emits it) but `agent.citations` is the structured contract; prefer that when available, fall back to the regex for archived JSONL or unusual response shapes. Adds a citations-only fallback (empty bodies but gold_delivered still computable) for cases where the renderer drops per-doc framing. **Modified: `benchmarks/bench_needle_1000.py`** Adds `gold_source_delivered` — a citation-grounded retrieval-rate metric that complements the existing permissive payload-word-boundary `retrieved` flag. The delta between them is the phantom-hit zone (value appears in payload but gold source isn't among citations). Aggregates in `summarize()`: new top-level `gold_source_delivery_rate`, new failure mode `phantom_hit`, per-category `gold_delivery_rate`. All additive — pre-2026-05 result rows aggregate cleanly (their missing fields collapse to zero, the honest representation). **Modified: `CLAUDE.md`** Stage 2 retrieval description was inaccurate about defaults. Updated: FTS5 + tag lookup + synonym + co-activation + cymatics 256-bin spectrum are all default-on (no neural inference at query time). BGE-M3 dense (`dense_embedding_enabled`) and SPLADE (`splade_enabled`) are default-off transformer-encoded options. Aligns docs with config. **Modified: `.gitignore`** Explicit exception for `benchmarks/fixtures.json` so the curated manifest stays tracked under the broad `benchmarks/*.json` exclusion. **Verification:** - Syntax-checked all .py files. - Unit-tested helpers: `parse_delivered_genes` (modern + legacy + fallback), `_gold_source_in_citations` (path normalization), Fixture defaults, haiku default in `bench_claude_matrix`. - `pytest tests/test_benchmark_monitor_preflight.py` — 8/8 pass (unchanged; preflight fix from PR #102 is baseline here). - Live smoke run earlier confirmed end-to-end: orchestrator boots uvicorn, hot-swaps to small.db, runs 10-needle bench, terminates cleanly in ~50s. - Live cross-mode run earlier confirmed: medium (blob) → medium-sharded transition via uvicorn restart works (17.6s); evidence in `benchmarks/results/claude_matrix_20260514T193211Z/`. **Findings surfaced during validation (filed separately):** - Issue #104: sharded retrieval returns empty `agent.citations` and surfaces different document sets vs blob mode for the same query (0/10 gold-delivered on medium-sharded vs 4/10 on medium blob). Sub-task spawned to investigate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mbachaud
added a commit
that referenced
this pull request
May 14, 2026
…-grounded retrieval scoring (#105) Bundles the agent-correctness bench infrastructure for the 6-fixture matrix (4 monolithic blobs + 2 sharded). Together these let a single invocation walk every fixture in the matrix, switch modes transparently, and score retrieval-only + agent-correctness side-by-side. **New: `benchmarks/bench_orchestrator.py`** `BenchServer` manages uvicorn lifecycle for the fixture matrix: - Same-mode transitions (blob→blob, sharded→sharded) use the atomic `POST /admin/swap-db` endpoint (PR #91). - Cross-mode transitions (blob↔sharded) require a full uvicorn restart with `HELIX_USE_SHARDS=1` flipped, because `open_read_source` in `helix_context/sharding.py` reads that env at store-construction time. The orchestrator picks the right mechanism automatically — bench code calls `srv.switch(fixture)` without caring which path runs. Includes a CLI that loops a manifest and invokes per-fixture bench scripts as subprocesses (`bench_needle`, `bench_needle_1000`). Read-only by default (PR #91 guard) so benches cannot mutate fixtures. **New: `benchmarks/fixtures.json`** Canonical 6-fixture manifest derived from `genomes/bench/matrix/frozen.json`. Paths, mode, gene counts, SHA256. Tracked via explicit `.gitignore` exception since `benchmarks/*.json` is broadly excluded. **New: `benchmarks/bench_claude_matrix.py`** Originally an untracked harness at `scripts/bench_claude_matrix.py` with sonnet default. This copy adds: - Default `--model haiku` (was sonnet) — cloud speed prioritized over model quality for the retrieval-quality bench. See `memory/bench_default_haiku.md` for the rationale. - Issue #101: prefers structured `agent.citations[].source` over the legacy `<GENE src="...">` regex; the regex stays as fallback since `context_manager.py:1367` still emits the wrapper. - `BenchServer` integration: harness manages its own uvicorn by default, including cross-mode restarts. `--external-server` flag preserves the legacy "use a server you've started yourself" flow. - `sources_via` and `n_citations` diagnostic fields per record so reviewers can see which extraction path produced the sources. **Modified: `benchmarks/bench_needle.py`** Issue #101 retrieval-source robustness. The legacy `<GENE src=` regex still works (the splice path emits it) but `agent.citations` is the structured contract; prefer that when available, fall back to the regex for archived JSONL or unusual response shapes. Adds a citations-only fallback (empty bodies but gold_delivered still computable) for cases where the renderer drops per-doc framing. **Modified: `benchmarks/bench_needle_1000.py`** Adds `gold_source_delivered` — a citation-grounded retrieval-rate metric that complements the existing permissive payload-word-boundary `retrieved` flag. The delta between them is the phantom-hit zone (value appears in payload but gold source isn't among citations). Aggregates in `summarize()`: new top-level `gold_source_delivery_rate`, new failure mode `phantom_hit`, per-category `gold_delivery_rate`. All additive — pre-2026-05 result rows aggregate cleanly (their missing fields collapse to zero, the honest representation). **Modified: `CLAUDE.md`** Stage 2 retrieval description was inaccurate about defaults. Updated: FTS5 + tag lookup + synonym + co-activation + cymatics 256-bin spectrum are all default-on (no neural inference at query time). BGE-M3 dense (`dense_embedding_enabled`) and SPLADE (`splade_enabled`) are default-off transformer-encoded options. Aligns docs with config. **Modified: `.gitignore`** Explicit exception for `benchmarks/fixtures.json` so the curated manifest stays tracked under the broad `benchmarks/*.json` exclusion. **Verification:** - Syntax-checked all .py files. - Unit-tested helpers: `parse_delivered_genes` (modern + legacy + fallback), `_gold_source_in_citations` (path normalization), Fixture defaults, haiku default in `bench_claude_matrix`. - `pytest tests/test_benchmark_monitor_preflight.py` — 8/8 pass (unchanged; preflight fix from PR #102 is baseline here). - Live smoke run earlier confirmed end-to-end: orchestrator boots uvicorn, hot-swaps to small.db, runs 10-needle bench, terminates cleanly in ~50s. - Live cross-mode run earlier confirmed: medium (blob) → medium-sharded transition via uvicorn restart works (17.6s); evidence in `benchmarks/results/claude_matrix_20260514T193211Z/`. **Findings surfaced during validation (filed separately):** - Issue #104: sharded retrieval returns empty `agent.citations` and surfaces different document sets vs blob mode for the same query (0/10 gold-delivered on medium-sharded vs 4/10 on medium blob). Sub-task spawned to investigate. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
POST /admin/swap-db— hot-swap the knowledge store .db without restarting the server. Accepts{"path": "genomes/bench/oauth/small.db", "read_only": true}. Opens new store, rebuilds SEMA cache, atomic-swapshelix.genome, closes old connection.read_onlyflag on KnowledgeStore — when true,upsert_doc,link_coactivated,store_harmonic_weights,touch_genes,log_healthall no-op silently. Prevents bench runs from polluting bench-target genomes.helix_swap_dbMCP tool — pure HTTP proxy to/admin/swap-db, lets Claude Code say "swap to the bench genome" mid-conversation.Motivation
Bench harnesses (oauth provider bench,
bench_rag_vs_sike_tokens.py) need to target frozen fixture genomes without restarting the server. Multi-tenant exploration needs the same. The swap-db endpoint + read_only guard makes this a one-call operation.Changes
knowledge_store.pyread_onlyparam + 5 method guardsserver/routes_admin.pyPOST /admin/swap-dbendpointmcp/mcp_server.pyhelix_swap_dbMCP tooltests/test_swap_db.pyTest plan
tests/test_swap_db.py)genomes/bench/oauth/small_helix_context_oauth.db, run oauth bench, swap back🤖 Generated with Claude Code