Skip to content

feat: hot-swap .db endpoint + read_only guard#91

Merged
mbachaud merged 2 commits into
masterfrom
feat/admin-swap-db
May 13, 2026
Merged

feat: hot-swap .db endpoint + read_only guard#91
mbachaud merged 2 commits into
masterfrom
feat/admin-swap-db

Conversation

@mbachaud
Copy link
Copy Markdown
Owner

Summary

  • POST /admin/swap-db — hot-swap the knowledge store .db without restarting the server. Accepts {"path": "genomes/bench/oauth/small.db", "read_only": true}. Opens new store, rebuilds SEMA cache, atomic-swaps helix.genome, closes old connection.
  • read_only flag on KnowledgeStore — when true, upsert_doc, link_coactivated, store_harmonic_weights, touch_genes, log_health all no-op silently. Prevents bench runs from polluting bench-target genomes.
  • helix_swap_db MCP tool — pure HTTP proxy to /admin/swap-db, lets Claude Code say "swap to the bench genome" mid-conversation.

Motivation

Bench harnesses (oauth provider bench, bench_rag_vs_sike_tokens.py) need to target frozen fixture genomes without restarting the server. Multi-tenant exploration needs the same. The swap-db endpoint + read_only guard makes this a one-call operation.

Changes

File LOC What
knowledge_store.py +17 read_only param + 5 method guards
server/routes_admin.py +77 POST /admin/swap-db endpoint
mcp/mcp_server.py +29 helix_swap_db MCP tool
tests/test_swap_db.py +262 10 tests (5 read_only unit + 5 endpoint)

Test plan

  • 10 new tests pass (tests/test_swap_db.py)
  • Full suite: 1943 passed, 15 skipped, 2 xfailed (0 regressions)
  • Live smoke test: swap to genomes/bench/oauth/small_helix_context_oauth.db, run oauth bench, swap back

🤖 Generated with Claude Code

mbachaud and others added 2 commits May 13, 2026 15:43
… plan

Three deliverables from the 3-week status audit:

1. **CLAUDE.md rewrite** — matches post-PR-90 package structure:
   - "6-step" → "7-stage" pipeline with BGE-M3 dense recall, RRF fusion,
     freshness gate, know/miss contract
   - 10-module flat table → 16-package table with back-compat shim note
   - Config: 5 sections → 17 sections with key settings per section
   - Endpoints: grouped into 4 blocks (core, ingestion, identity, diag)
   - expression_tokens corrected to 7k (was claiming 12k)
   - Genome path corrected to genomes/main/genome.db

2. **OAuth bench harness recovered** from squashed intermediate commit
   672ee13 (lost during PR #90 squash). Updated to post-PR-90 naming:
   - oauth_fixtures.py: Genome → KnowledgeStore, upsert_gene → upsert_doc
   - bench_oauth_scope.py: same + query_genes → query_docs
   - bench_oauth_provider.py: no internal imports (pure HTTP + claude -p)
   - oauth_task_set.py: task IDs/queries unchanged (bench contract)
   All 4 files import cleanly; synthetic scope smoke test passes
   (0 cross-party leaks, 100% own-party recall).

3. **README v3 plan** at docs/archive/plans/2026-05-13-readme-v3-plan.md.
   Headroom-informed restructure: proof-first positioning, collapsible
   details sections, ASCII pipeline diagram, unified agent-surfaces
   section, two-column docs table.

Token baseline finding (from bench_rag_vs_sike_tokens.py run):
  - README v2 claimed "5.4× median" — actual with ribosome disabled
    (current default) is **2.9× median** (compressor OFF = Headroom
    Kompress only, target=1000c/doc). The 5.4× was with Claude Haiku
    splice active. README v3 plan flags this for honest reporting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add POST /admin/swap-db to switch the active knowledge store without
restarting the server. Supports a read_only flag that makes upsert_doc,
link_coactivated, store_harmonic_weights, touch_genes, and log_health
no-op silently, preventing bench runs from polluting target genomes.

Changes across 3 files + tests:
- knowledge_store.py: read_only param on __init__, 2-line guards on 5 methods
- server/routes_admin.py: POST /admin/swap-db endpoint (~50 LOC)
- mcp/mcp_server.py: helix_swap_db tool (HTTP proxy to the endpoint)
- tests/test_swap_db.py: 10 tests covering read_only no-ops + endpoint behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mbachaud mbachaud merged commit ca0c44e into master May 13, 2026
3 checks passed
@mbachaud mbachaud deleted the feat/admin-swap-db branch May 13, 2026 23:20
mbachaud added a commit that referenced this pull request May 14, 2026
…-grounded retrieval scoring

Bundles the agent-correctness bench infrastructure for the 6-fixture
matrix (4 monolithic blobs + 2 sharded). Together these let a single
invocation walk every fixture in the matrix, switch modes
transparently, and score retrieval-only + agent-correctness side-by-side.

**New: `benchmarks/bench_orchestrator.py`**

`BenchServer` manages uvicorn lifecycle for the fixture matrix:
- Same-mode transitions (blob→blob, sharded→sharded) use the atomic
  `POST /admin/swap-db` endpoint (PR #91).
- Cross-mode transitions (blob↔sharded) require a full uvicorn restart
  with `HELIX_USE_SHARDS=1` flipped, because `open_read_source` in
  `helix_context/sharding.py` reads that env at store-construction time.

The orchestrator picks the right mechanism automatically — bench code
calls `srv.switch(fixture)` without caring which path runs. Includes
a CLI that loops a manifest and invokes per-fixture bench scripts as
subprocesses (`bench_needle`, `bench_needle_1000`).

Read-only by default (PR #91 guard) so benches cannot mutate fixtures.

**New: `benchmarks/fixtures.json`**

Canonical 6-fixture manifest derived from
`genomes/bench/matrix/frozen.json`. Paths, mode, gene counts, SHA256.
Tracked via explicit `.gitignore` exception since `benchmarks/*.json`
is broadly excluded.

**New: `benchmarks/bench_claude_matrix.py`**

Originally an untracked harness at `scripts/bench_claude_matrix.py` with
sonnet default. This copy adds:

- Default `--model haiku` (was sonnet) — cloud speed prioritized over
  model quality for the retrieval-quality bench. See
  `memory/bench_default_haiku.md` for the rationale.
- Issue #101: prefers structured `agent.citations[].source` over the
  legacy `<GENE src="...">` regex; the regex stays as fallback since
  `context_manager.py:1367` still emits the wrapper.
- `BenchServer` integration: harness manages its own uvicorn by
  default, including cross-mode restarts. `--external-server` flag
  preserves the legacy "use a server you've started yourself" flow.
- `sources_via` and `n_citations` diagnostic fields per record so
  reviewers can see which extraction path produced the sources.

**Modified: `benchmarks/bench_needle.py`**

Issue #101 retrieval-source robustness. The legacy `<GENE src=`
regex still works (the splice path emits it) but `agent.citations`
is the structured contract; prefer that when available, fall back
to the regex for archived JSONL or unusual response shapes. Adds a
citations-only fallback (empty bodies but gold_delivered still
computable) for cases where the renderer drops per-doc framing.

**Modified: `benchmarks/bench_needle_1000.py`**

Adds `gold_source_delivered` — a citation-grounded retrieval-rate
metric that complements the existing permissive payload-word-boundary
`retrieved` flag. The delta between them is the phantom-hit zone (value
appears in payload but gold source isn't among citations).
Aggregates in `summarize()`: new top-level `gold_source_delivery_rate`,
new failure mode `phantom_hit`, per-category `gold_delivery_rate`.
All additive — pre-2026-05 result rows aggregate cleanly (their missing
fields collapse to zero, the honest representation).

**Modified: `CLAUDE.md`**

Stage 2 retrieval description was inaccurate about defaults. Updated:
FTS5 + tag lookup + synonym + co-activation + cymatics 256-bin spectrum
are all default-on (no neural inference at query time). BGE-M3 dense
(`dense_embedding_enabled`) and SPLADE (`splade_enabled`) are
default-off transformer-encoded options. Aligns docs with config.

**Modified: `.gitignore`**

Explicit exception for `benchmarks/fixtures.json` so the curated
manifest stays tracked under the broad `benchmarks/*.json` exclusion.

**Verification:**

- Syntax-checked all .py files.
- Unit-tested helpers: `parse_delivered_genes` (modern + legacy + fallback),
  `_gold_source_in_citations` (path normalization), Fixture defaults,
  haiku default in `bench_claude_matrix`.
- `pytest tests/test_benchmark_monitor_preflight.py` — 8/8 pass
  (unchanged; preflight fix from PR #102 is baseline here).
- Live smoke run earlier confirmed end-to-end: orchestrator boots
  uvicorn, hot-swaps to small.db, runs 10-needle bench, terminates
  cleanly in ~50s.
- Live cross-mode run earlier confirmed: medium (blob) → medium-sharded
  transition via uvicorn restart works (17.6s); evidence in
  `benchmarks/results/claude_matrix_20260514T193211Z/`.

**Findings surfaced during validation (filed separately):**

- Issue #104: sharded retrieval returns empty `agent.citations` and
  surfaces different document sets vs blob mode for the same query
  (0/10 gold-delivered on medium-sharded vs 4/10 on medium blob).
  Sub-task spawned to investigate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mbachaud added a commit that referenced this pull request May 14, 2026
…-grounded retrieval scoring (#105)

Bundles the agent-correctness bench infrastructure for the 6-fixture
matrix (4 monolithic blobs + 2 sharded). Together these let a single
invocation walk every fixture in the matrix, switch modes
transparently, and score retrieval-only + agent-correctness side-by-side.

**New: `benchmarks/bench_orchestrator.py`**

`BenchServer` manages uvicorn lifecycle for the fixture matrix:
- Same-mode transitions (blob→blob, sharded→sharded) use the atomic
  `POST /admin/swap-db` endpoint (PR #91).
- Cross-mode transitions (blob↔sharded) require a full uvicorn restart
  with `HELIX_USE_SHARDS=1` flipped, because `open_read_source` in
  `helix_context/sharding.py` reads that env at store-construction time.

The orchestrator picks the right mechanism automatically — bench code
calls `srv.switch(fixture)` without caring which path runs. Includes
a CLI that loops a manifest and invokes per-fixture bench scripts as
subprocesses (`bench_needle`, `bench_needle_1000`).

Read-only by default (PR #91 guard) so benches cannot mutate fixtures.

**New: `benchmarks/fixtures.json`**

Canonical 6-fixture manifest derived from
`genomes/bench/matrix/frozen.json`. Paths, mode, gene counts, SHA256.
Tracked via explicit `.gitignore` exception since `benchmarks/*.json`
is broadly excluded.

**New: `benchmarks/bench_claude_matrix.py`**

Originally an untracked harness at `scripts/bench_claude_matrix.py` with
sonnet default. This copy adds:

- Default `--model haiku` (was sonnet) — cloud speed prioritized over
  model quality for the retrieval-quality bench. See
  `memory/bench_default_haiku.md` for the rationale.
- Issue #101: prefers structured `agent.citations[].source` over the
  legacy `<GENE src="...">` regex; the regex stays as fallback since
  `context_manager.py:1367` still emits the wrapper.
- `BenchServer` integration: harness manages its own uvicorn by
  default, including cross-mode restarts. `--external-server` flag
  preserves the legacy "use a server you've started yourself" flow.
- `sources_via` and `n_citations` diagnostic fields per record so
  reviewers can see which extraction path produced the sources.

**Modified: `benchmarks/bench_needle.py`**

Issue #101 retrieval-source robustness. The legacy `<GENE src=`
regex still works (the splice path emits it) but `agent.citations`
is the structured contract; prefer that when available, fall back
to the regex for archived JSONL or unusual response shapes. Adds a
citations-only fallback (empty bodies but gold_delivered still
computable) for cases where the renderer drops per-doc framing.

**Modified: `benchmarks/bench_needle_1000.py`**

Adds `gold_source_delivered` — a citation-grounded retrieval-rate
metric that complements the existing permissive payload-word-boundary
`retrieved` flag. The delta between them is the phantom-hit zone (value
appears in payload but gold source isn't among citations).
Aggregates in `summarize()`: new top-level `gold_source_delivery_rate`,
new failure mode `phantom_hit`, per-category `gold_delivery_rate`.
All additive — pre-2026-05 result rows aggregate cleanly (their missing
fields collapse to zero, the honest representation).

**Modified: `CLAUDE.md`**

Stage 2 retrieval description was inaccurate about defaults. Updated:
FTS5 + tag lookup + synonym + co-activation + cymatics 256-bin spectrum
are all default-on (no neural inference at query time). BGE-M3 dense
(`dense_embedding_enabled`) and SPLADE (`splade_enabled`) are
default-off transformer-encoded options. Aligns docs with config.

**Modified: `.gitignore`**

Explicit exception for `benchmarks/fixtures.json` so the curated
manifest stays tracked under the broad `benchmarks/*.json` exclusion.

**Verification:**

- Syntax-checked all .py files.
- Unit-tested helpers: `parse_delivered_genes` (modern + legacy + fallback),
  `_gold_source_in_citations` (path normalization), Fixture defaults,
  haiku default in `bench_claude_matrix`.
- `pytest tests/test_benchmark_monitor_preflight.py` — 8/8 pass
  (unchanged; preflight fix from PR #102 is baseline here).
- Live smoke run earlier confirmed end-to-end: orchestrator boots
  uvicorn, hot-swaps to small.db, runs 10-needle bench, terminates
  cleanly in ~50s.
- Live cross-mode run earlier confirmed: medium (blob) → medium-sharded
  transition via uvicorn restart works (17.6s); evidence in
  `benchmarks/results/claude_matrix_20260514T193211Z/`.

**Findings surfaced during validation (filed separately):**

- Issue #104: sharded retrieval returns empty `agent.citations` and
  surfaces different document sets vs blob mode for the same query
  (0/10 gold-delivered on medium-sharded vs 4/10 on medium blob).
  Sub-task spawned to investigate.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant