feat: hot-swap .db endpoint + read_only guard by mbachaud · Pull Request #91 · mbachaud/helix-context

mbachaud · 2026-05-13T23:17:06Z

Summary

POST /admin/swap-db — hot-swap the knowledge store .db without restarting the server. Accepts {"path": "genomes/bench/oauth/small.db", "read_only": true}. Opens new store, rebuilds SEMA cache, atomic-swaps helix.genome, closes old connection.
read_only flag on KnowledgeStore — when true, upsert_doc, link_coactivated, store_harmonic_weights, touch_genes, log_health all no-op silently. Prevents bench runs from polluting bench-target genomes.
helix_swap_db MCP tool — pure HTTP proxy to /admin/swap-db, lets Claude Code say "swap to the bench genome" mid-conversation.

Motivation

Bench harnesses (oauth provider bench, bench_rag_vs_sike_tokens.py) need to target frozen fixture genomes without restarting the server. Multi-tenant exploration needs the same. The swap-db endpoint + read_only guard makes this a one-call operation.

Changes

File	LOC	What
`knowledge_store.py`	+17	`read_only` param + 5 method guards
`server/routes_admin.py`	+77	`POST /admin/swap-db` endpoint
`mcp/mcp_server.py`	+29	`helix_swap_db` MCP tool
`tests/test_swap_db.py`	+262	10 tests (5 read_only unit + 5 endpoint)

Test plan

10 new tests pass (tests/test_swap_db.py)
Full suite: 1943 passed, 15 skipped, 2 xfailed (0 regressions)
Live smoke test: swap to genomes/bench/oauth/small_helix_context_oauth.db, run oauth bench, swap back

🤖 Generated with Claude Code

… plan Three deliverables from the 3-week status audit: 1. **CLAUDE.md rewrite** — matches post-PR-90 package structure: - "6-step" → "7-stage" pipeline with BGE-M3 dense recall, RRF fusion, freshness gate, know/miss contract - 10-module flat table → 16-package table with back-compat shim note - Config: 5 sections → 17 sections with key settings per section - Endpoints: grouped into 4 blocks (core, ingestion, identity, diag) - expression_tokens corrected to 7k (was claiming 12k) - Genome path corrected to genomes/main/genome.db 2. **OAuth bench harness recovered** from squashed intermediate commit 672ee13 (lost during PR #90 squash). Updated to post-PR-90 naming: - oauth_fixtures.py: Genome → KnowledgeStore, upsert_gene → upsert_doc - bench_oauth_scope.py: same + query_genes → query_docs - bench_oauth_provider.py: no internal imports (pure HTTP + claude -p) - oauth_task_set.py: task IDs/queries unchanged (bench contract) All 4 files import cleanly; synthetic scope smoke test passes (0 cross-party leaks, 100% own-party recall). 3. **README v3 plan** at docs/archive/plans/2026-05-13-readme-v3-plan.md. Headroom-informed restructure: proof-first positioning, collapsible details sections, ASCII pipeline diagram, unified agent-surfaces section, two-column docs table. Token baseline finding (from bench_rag_vs_sike_tokens.py run): - README v2 claimed "5.4× median" — actual with ribosome disabled (current default) is **2.9× median** (compressor OFF = Headroom Kompress only, target=1000c/doc). The 5.4× was with Claude Haiku splice active. README v3 plan flags this for honest reporting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add POST /admin/swap-db to switch the active knowledge store without restarting the server. Supports a read_only flag that makes upsert_doc, link_coactivated, store_harmonic_weights, touch_genes, and log_health no-op silently, preventing bench runs from polluting target genomes. Changes across 3 files + tests: - knowledge_store.py: read_only param on __init__, 2-line guards on 5 methods - server/routes_admin.py: POST /admin/swap-db endpoint (~50 LOC) - mcp/mcp_server.py: helix_swap_db tool (HTTP proxy to the endpoint) - tests/test_swap_db.py: 10 tests covering read_only no-ops + endpoint behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-grounded retrieval scoring Bundles the agent-correctness bench infrastructure for the 6-fixture matrix (4 monolithic blobs + 2 sharded). Together these let a single invocation walk every fixture in the matrix, switch modes transparently, and score retrieval-only + agent-correctness side-by-side. **New: `benchmarks/bench_orchestrator.py`** `BenchServer` manages uvicorn lifecycle for the fixture matrix: - Same-mode transitions (blob→blob, sharded→sharded) use the atomic `POST /admin/swap-db` endpoint (PR #91). - Cross-mode transitions (blob↔sharded) require a full uvicorn restart with `HELIX_USE_SHARDS=1` flipped, because `open_read_source` in `helix_context/sharding.py` reads that env at store-construction time. The orchestrator picks the right mechanism automatically — bench code calls `srv.switch(fixture)` without caring which path runs. Includes a CLI that loops a manifest and invokes per-fixture bench scripts as subprocesses (`bench_needle`, `bench_needle_1000`). Read-only by default (PR #91 guard) so benches cannot mutate fixtures. **New: `benchmarks/fixtures.json`** Canonical 6-fixture manifest derived from `genomes/bench/matrix/frozen.json`. Paths, mode, gene counts, SHA256. Tracked via explicit `.gitignore` exception since `benchmarks/*.json` is broadly excluded. **New: `benchmarks/bench_claude_matrix.py`** Originally an untracked harness at `scripts/bench_claude_matrix.py` with sonnet default. This copy adds: - Default `--model haiku` (was sonnet) — cloud speed prioritized over model quality for the retrieval-quality bench. See `memory/bench_default_haiku.md` for the rationale. - Issue #101: prefers structured `agent.citations[].source` over the legacy `<GENE src="...">` regex; the regex stays as fallback since `context_manager.py:1367` still emits the wrapper. - `BenchServer` integration: harness manages its own uvicorn by default, including cross-mode restarts. `--external-server` flag preserves the legacy "use a server you've started yourself" flow. - `sources_via` and `n_citations` diagnostic fields per record so reviewers can see which extraction path produced the sources. **Modified: `benchmarks/bench_needle.py`** Issue #101 retrieval-source robustness. The legacy `<GENE src=` regex still works (the splice path emits it) but `agent.citations` is the structured contract; prefer that when available, fall back to the regex for archived JSONL or unusual response shapes. Adds a citations-only fallback (empty bodies but gold_delivered still computable) for cases where the renderer drops per-doc framing. **Modified: `benchmarks/bench_needle_1000.py`** Adds `gold_source_delivered` — a citation-grounded retrieval-rate metric that complements the existing permissive payload-word-boundary `retrieved` flag. The delta between them is the phantom-hit zone (value appears in payload but gold source isn't among citations). Aggregates in `summarize()`: new top-level `gold_source_delivery_rate`, new failure mode `phantom_hit`, per-category `gold_delivery_rate`. All additive — pre-2026-05 result rows aggregate cleanly (their missing fields collapse to zero, the honest representation). **Modified: `CLAUDE.md`** Stage 2 retrieval description was inaccurate about defaults. Updated: FTS5 + tag lookup + synonym + co-activation + cymatics 256-bin spectrum are all default-on (no neural inference at query time). BGE-M3 dense (`dense_embedding_enabled`) and SPLADE (`splade_enabled`) are default-off transformer-encoded options. Aligns docs with config. **Modified: `.gitignore`** Explicit exception for `benchmarks/fixtures.json` so the curated manifest stays tracked under the broad `benchmarks/*.json` exclusion. **Verification:** - Syntax-checked all .py files. - Unit-tested helpers: `parse_delivered_genes` (modern + legacy + fallback), `_gold_source_in_citations` (path normalization), Fixture defaults, haiku default in `bench_claude_matrix`. - `pytest tests/test_benchmark_monitor_preflight.py` — 8/8 pass (unchanged; preflight fix from PR #102 is baseline here). - Live smoke run earlier confirmed end-to-end: orchestrator boots uvicorn, hot-swaps to small.db, runs 10-needle bench, terminates cleanly in ~50s. - Live cross-mode run earlier confirmed: medium (blob) → medium-sharded transition via uvicorn restart works (17.6s); evidence in `benchmarks/results/claude_matrix_20260514T193211Z/`. **Findings surfaced during validation (filed separately):** - Issue #104: sharded retrieval returns empty `agent.citations` and surfaces different document sets vs blob mode for the same query (0/10 gold-delivered on medium-sharded vs 4/10 on medium blob). Sub-task spawned to investigate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-grounded retrieval scoring (#105) Bundles the agent-correctness bench infrastructure for the 6-fixture matrix (4 monolithic blobs + 2 sharded). Together these let a single invocation walk every fixture in the matrix, switch modes transparently, and score retrieval-only + agent-correctness side-by-side. **New: `benchmarks/bench_orchestrator.py`** `BenchServer` manages uvicorn lifecycle for the fixture matrix: - Same-mode transitions (blob→blob, sharded→sharded) use the atomic `POST /admin/swap-db` endpoint (PR #91). - Cross-mode transitions (blob↔sharded) require a full uvicorn restart with `HELIX_USE_SHARDS=1` flipped, because `open_read_source` in `helix_context/sharding.py` reads that env at store-construction time. The orchestrator picks the right mechanism automatically — bench code calls `srv.switch(fixture)` without caring which path runs. Includes a CLI that loops a manifest and invokes per-fixture bench scripts as subprocesses (`bench_needle`, `bench_needle_1000`). Read-only by default (PR #91 guard) so benches cannot mutate fixtures. **New: `benchmarks/fixtures.json`** Canonical 6-fixture manifest derived from `genomes/bench/matrix/frozen.json`. Paths, mode, gene counts, SHA256. Tracked via explicit `.gitignore` exception since `benchmarks/*.json` is broadly excluded. **New: `benchmarks/bench_claude_matrix.py`** Originally an untracked harness at `scripts/bench_claude_matrix.py` with sonnet default. This copy adds: - Default `--model haiku` (was sonnet) — cloud speed prioritized over model quality for the retrieval-quality bench. See `memory/bench_default_haiku.md` for the rationale. - Issue #101: prefers structured `agent.citations[].source` over the legacy `<GENE src="...">` regex; the regex stays as fallback since `context_manager.py:1367` still emits the wrapper. - `BenchServer` integration: harness manages its own uvicorn by default, including cross-mode restarts. `--external-server` flag preserves the legacy "use a server you've started yourself" flow. - `sources_via` and `n_citations` diagnostic fields per record so reviewers can see which extraction path produced the sources. **Modified: `benchmarks/bench_needle.py`** Issue #101 retrieval-source robustness. The legacy `<GENE src=` regex still works (the splice path emits it) but `agent.citations` is the structured contract; prefer that when available, fall back to the regex for archived JSONL or unusual response shapes. Adds a citations-only fallback (empty bodies but gold_delivered still computable) for cases where the renderer drops per-doc framing. **Modified: `benchmarks/bench_needle_1000.py`** Adds `gold_source_delivered` — a citation-grounded retrieval-rate metric that complements the existing permissive payload-word-boundary `retrieved` flag. The delta between them is the phantom-hit zone (value appears in payload but gold source isn't among citations). Aggregates in `summarize()`: new top-level `gold_source_delivery_rate`, new failure mode `phantom_hit`, per-category `gold_delivery_rate`. All additive — pre-2026-05 result rows aggregate cleanly (their missing fields collapse to zero, the honest representation). **Modified: `CLAUDE.md`** Stage 2 retrieval description was inaccurate about defaults. Updated: FTS5 + tag lookup + synonym + co-activation + cymatics 256-bin spectrum are all default-on (no neural inference at query time). BGE-M3 dense (`dense_embedding_enabled`) and SPLADE (`splade_enabled`) are default-off transformer-encoded options. Aligns docs with config. **Modified: `.gitignore`** Explicit exception for `benchmarks/fixtures.json` so the curated manifest stays tracked under the broad `benchmarks/*.json` exclusion. **Verification:** - Syntax-checked all .py files. - Unit-tested helpers: `parse_delivered_genes` (modern + legacy + fallback), `_gold_source_in_citations` (path normalization), Fixture defaults, haiku default in `bench_claude_matrix`. - `pytest tests/test_benchmark_monitor_preflight.py` — 8/8 pass (unchanged; preflight fix from PR #102 is baseline here). - Live smoke run earlier confirmed end-to-end: orchestrator boots uvicorn, hot-swaps to small.db, runs 10-needle bench, terminates cleanly in ~50s. - Live cross-mode run earlier confirmed: medium (blob) → medium-sharded transition via uvicorn restart works (17.6s); evidence in `benchmarks/results/claude_matrix_20260514T193211Z/`. **Findings surfaced during validation (filed separately):** - Issue #104: sharded retrieval returns empty `agent.citations` and surfaces different document sets vs blob mode for the same query (0/10 gold-delivered on medium-sharded vs 4/10 on medium blob). Sub-task spawned to investigate. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mbachaud and others added 2 commits May 13, 2026 15:43

mbachaud merged commit ca0c44e into master May 13, 2026
3 checks passed

mbachaud deleted the feat/admin-swap-db branch May 13, 2026 23:20

This was referenced May 14, 2026

ShardedGenomeAdapter API has drifted behind current Genome surface — /context 500s + swap-db crashes #98

Closed

feat(bench): matrix orchestrator + claude -p Haiku harness + citation scoring #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: hot-swap .db endpoint + read_only guard#91

feat: hot-swap .db endpoint + read_only guard#91
mbachaud merged 2 commits into
masterfrom
feat/admin-swap-db

mbachaud commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbachaud commented May 13, 2026

Summary

Motivation

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant