feat(bench): parallel + sharded-pool ingest for build_fixture_matrix (#92)#94
Merged
Conversation
… plan
Three deliverables from the 3-week status audit:
1. **CLAUDE.md rewrite** — matches post-PR-90 package structure:
- "6-step" → "7-stage" pipeline with BGE-M3 dense recall, RRF fusion,
freshness gate, know/miss contract
- 10-module flat table → 16-package table with back-compat shim note
- Config: 5 sections → 17 sections with key settings per section
- Endpoints: grouped into 4 blocks (core, ingestion, identity, diag)
- expression_tokens corrected to 7k (was claiming 12k)
- Genome path corrected to genomes/main/genome.db
2. **OAuth bench harness recovered** from squashed intermediate commit
672ee13 (lost during PR #90 squash). Updated to post-PR-90 naming:
- oauth_fixtures.py: Genome → KnowledgeStore, upsert_gene → upsert_doc
- bench_oauth_scope.py: same + query_genes → query_docs
- bench_oauth_provider.py: no internal imports (pure HTTP + claude -p)
- oauth_task_set.py: task IDs/queries unchanged (bench contract)
All 4 files import cleanly; synthetic scope smoke test passes
(0 cross-party leaks, 100% own-party recall).
3. **README v3 plan** at docs/archive/plans/2026-05-13-readme-v3-plan.md.
Headroom-informed restructure: proof-first positioning, collapsible
details sections, ASCII pipeline diagram, unified agent-surfaces
section, two-column docs table.
Token baseline finding (from bench_rag_vs_sike_tokens.py run):
- README v2 claimed "5.4× median" — actual with ribosome disabled
(current default) is **2.9× median** (compressor OFF = Headroom
Kompress only, target=1000c/doc). The 5.4× was with Claude Haiku
splice active. README v3 plan flags this for honest reporting.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Headroom-informed rewrite. Major structural changes from v2: - **Proof-first**: bench numbers table in first 20 lines (was buried in dense paragraph). Honestly reports 2.9× median with compressor OFF (current default), notes 5× with compressor ON, and the 37× session-delivery multiplier. Marked "WIP benchmark numbers" pending final overnight bench on post-Stage pipeline. - **Progressive disclosure**: Configuration (17 sections), full endpoint reference (~30 endpoints), and package structure (16 packages) are in collapsible `<details>` blocks. Main flow stays scannable. - **ASCII pipeline diagram**: renders identically on GitHub, PyPI, terminals — no Mermaid dependency. Shows all 7 stages including the know/miss output, legibility headers, session delivery, freshness gate. - **Unified "Agent surfaces" section**: CLI / MCP / HTTP proxy in one decision table (was scattered across 3 sections). - **Two-column "Documentation" table**: "Start here" (setup, API, config) vs "Go deeper" (architecture, dimensions, knowledge graph). - **Time-boxed headers**: "Proof (30 seconds)", "Get started (60 seconds)", "Pipeline (2 minutes)" — reduces reader hesitation. - **Updated gotchas**: session delivery, fusion mode, BGE-M3 backfill, naming lexicon, know/miss contract. Removed stale compressor-centric items that assumed ribosome is always on. 321 lines (was 304). Same density, much better scannability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bundles the previously-untracked scripts/build_fixture_matrix.py and docs/benchmarks/GENOME_FIXTURE_MATRIX.md (committed via 'git checkout stash^3' from a master-side WIP) together with the implementation plan for issue #92. Subsequent commits modify build_fixture_matrix.py per that plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of issue #92. Adds optional splade_sparse kwarg to sync_splade_index so callers can precompute SPLADE vectors (via splade_backend.encode_batch) and forward the dict instead of letting the indexer call splade_backend.encode inline per gene. Default is None - existing call sites are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of issue #92. Adds optional splade_sparse kwarg to KnowledgeStore.upsert_doc, forwarded to sync_splade_index. When the parallel ingest path supplies a precomputed sparse dict, the indexer skips its inline splade_backend.encode call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds helix_context/parallel.py with two pool-size helpers used by the upcoming build_fixture_matrix --parallel / --shard-workers paths: - auto_workers leaves 12.5% CPU headroom + reserves a writer core. On an 8-core 5800X this returns 6. - auto_shard_workers further caps to vram_total_gb // 4 (each shard- worker holds its own SPLADE model). On a 3080 Ti + 5800X this returns 3. Both honour explicit --workers / --shard-workers overrides. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure function that takes (path, ext) and returns list of Gene model_dump dicts. Both the sequential and (upcoming) parallel ingest paths will call this. The mp.Pool workers serialise the return value across process boundaries, hence dicts not Gene instances. No behaviour change yet -- new code is unreferenced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three reusable functions in build_fixture_matrix.py: - _iter_ingestable_files: walks roots, returns filtered [(path, ext)]. - _drain_with_batched_splade: buffers Gene instances, batch-encodes SPLADE every N (default 64), then calls upsert_doc with the precomputed sparse dict (Phase 1 plumbing). - _parallel_ingest_to_genome: mp.Pool over files (chunk + tag) feeding the batched-SPLADE writer in the main process. No call sites yet -- Task 7 wires these into build_profile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build_profile now branches on a parallel kwarg: when True it discovers files via _iter_ingestable_files, dispatches chunk+tag work to an mp.Pool, and drains via the batched-SPLADE writer. When False (default) the original ingest_tree path is unchanged. mp.freeze_support() added to __main__ for Windows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds a tiny synthetic corpus twice (sequential + parallel) and asserts identical (gene_id, content_hash) sets. Confirms the core acceptance criterion for issue #92: parallel ingest produces a byte-equivalent gene set. Also registers a 'slow' pytest marker for end-to-end tests that load SPLADE / spaCy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces _copy_fingerprint_indexes with a complete single-shard build function. _build_one_shard returns the per-shard stats and a fingerprint_payload (list of tuples ready for an executemany into main.db's fingerprint_index). The parent process writes those rows when collecting shard results. This shape supports both serial calls and an mp.Pool of subprocesses in Task 10. _shard_worker_entry is the Pool's per-task callable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build_profile_sharded now accepts shard_workers and batch_size kwargs. When shard_workers > 1 the per-shard builds run in mp.Pool subprocesses; the parent process collects results via imap_unordered and writes register_shard + fingerprint_index rows under SQLite busy_timeout=30s. shard_workers == 1 keeps the serial behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires --shard-workers through main() for --mode sharded. When 0 (default), uses helix_context.parallel.auto_shard_workers which caps to min(vram_gb // 4, auto_workers()). Explicit positive values override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds a 2-root sharded corpus twice (shard_workers=1 vs shard_workers=2) and asserts identical main.db shards table + fingerprint_index rows. Validates that the mp.Pool path produces the same routing state as the serial path under SQLite busy_timeout serialisation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents --parallel, --workers, --shard-workers, --batch-size with concrete examples in the module docstring (and thus argparse --help description). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 14, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #92.
Summary
splade_sparsekwarg onsync_splade_indexandupsert_docso callers can precompute SPLADE outside the per-document upsert.--parallelingest mode for monolithic profiles —mp.Poolchunks+tags in parallel; main-process writer batches SPLADE in groups of 64 and commits.--shard-workers Nmode for sharded profiles — M concurrent shard-builds, main.db writes serialised through SQLitebusy_timeout=30s.helix_context.parallel.auto_workers/auto_shard_workers) so the CLI picks a sane default. On the 8-core 5800X + 3080 Ti dev box:auto_workers() == 6,auto_shard_workers() == 3.Architecture
Existing default paths (sequential blob, shard_workers=1) are byte-equivalent to master.
Test plan
pytest tests/test_splade_precompute.py— 5 passed (Phase 1 plumbing)pytest tests/test_parallel_sizers.py— 10 passed (auto-sizers)pytest tests/test_build_fixture_matrix_parallel.py— 2 passed (both parity tests; ~23s with SPLADE load)pytest tests/ -m "not live and not slow"— 1954 passed; 4 pre-existing failures intest_observability_docs.py(same on master, unrelated)python scripts/build_fixture_matrix.py --profile xl --parallelshould drop wall time from ~3 hr → <45 min on dev boxFiles
helix_context/parallel.py,tests/test_parallel_sizers.py,tests/test_splade_precompute.py,tests/test_build_fixture_matrix_parallel.pyscripts/build_fixture_matrix.py,docs/benchmarks/GENOME_FIXTURE_MATRIX.md(previously untracked on master)helix_context/storage/indexes.py,helix_context/knowledge_store.py,pyproject.toml(slow marker)Plan:
docs/superpowers/plans/2026-05-13-issue-92-parallel-ingest.md.🤖 Generated with Claude Code