feat(bench): parallel + sharded-pool ingest for build_fixture_matrix (#92) by mbachaud · Pull Request #94 · mbachaud/helix-context

mbachaud · 2026-05-14T05:18:28Z

Closes #92.

Summary

Phase 1: optional splade_sparse kwarg on sync_splade_index and upsert_doc so callers can precompute SPLADE outside the per-document upsert.
Phase 2: --parallel ingest mode for monolithic profiles — mp.Pool chunks+tags in parallel; main-process writer batches SPLADE in groups of 64 and commits.
Phase 3: --shard-workers N mode for sharded profiles — M concurrent shard-builds, main.db writes serialised through SQLite busy_timeout=30s.
Auto-sizers (helix_context.parallel.auto_workers / auto_shard_workers) so the CLI picks a sane default. On the 8-core 5800X + 3080 Ti dev box: auto_workers() == 6, auto_shard_workers() == 3.

Architecture

--parallel (blob mode):
  N file workers (chunk + tag)  →  main process writer
                                     (batched SPLADE -> upsert_doc)

--shard-workers (sharded mode):
  M shard workers (each: discover -> chunk + tag -> batched SPLADE -> upsert)
                                  ↓ fingerprint_payload via Pool result
  main process writes shards + fingerprint_index (busy_timeout=30s)

Existing default paths (sequential blob, shard_workers=1) are byte-equivalent to master.

Test plan

pytest tests/test_splade_precompute.py — 5 passed (Phase 1 plumbing)
pytest tests/test_parallel_sizers.py — 10 passed (auto-sizers)
pytest tests/test_build_fixture_matrix_parallel.py — 2 passed (both parity tests; ~23s with SPLADE load)
pytest tests/ -m "not live and not slow" — 1954 passed; 4 pre-existing failures in test_observability_docs.py (same on master, unrelated)
Manual perf check: python scripts/build_fixture_matrix.py --profile xl --parallel should drop wall time from ~3 hr → <45 min on dev box

Files

New: helix_context/parallel.py, tests/test_parallel_sizers.py, tests/test_splade_precompute.py, tests/test_build_fixture_matrix_parallel.py
New (bundled baseline): scripts/build_fixture_matrix.py, docs/benchmarks/GENOME_FIXTURE_MATRIX.md (previously untracked on master)
Modified: helix_context/storage/indexes.py, helix_context/knowledge_store.py, pyproject.toml (slow marker)

Plan: docs/superpowers/plans/2026-05-13-issue-92-parallel-ingest.md.

🤖 Generated with Claude Code

… plan Three deliverables from the 3-week status audit: 1. **CLAUDE.md rewrite** — matches post-PR-90 package structure: - "6-step" → "7-stage" pipeline with BGE-M3 dense recall, RRF fusion, freshness gate, know/miss contract - 10-module flat table → 16-package table with back-compat shim note - Config: 5 sections → 17 sections with key settings per section - Endpoints: grouped into 4 blocks (core, ingestion, identity, diag) - expression_tokens corrected to 7k (was claiming 12k) - Genome path corrected to genomes/main/genome.db 2. **OAuth bench harness recovered** from squashed intermediate commit 672ee13 (lost during PR #90 squash). Updated to post-PR-90 naming: - oauth_fixtures.py: Genome → KnowledgeStore, upsert_gene → upsert_doc - bench_oauth_scope.py: same + query_genes → query_docs - bench_oauth_provider.py: no internal imports (pure HTTP + claude -p) - oauth_task_set.py: task IDs/queries unchanged (bench contract) All 4 files import cleanly; synthetic scope smoke test passes (0 cross-party leaks, 100% own-party recall). 3. **README v3 plan** at docs/archive/plans/2026-05-13-readme-v3-plan.md. Headroom-informed restructure: proof-first positioning, collapsible details sections, ASCII pipeline diagram, unified agent-surfaces section, two-column docs table. Token baseline finding (from bench_rag_vs_sike_tokens.py run): - README v2 claimed "5.4× median" — actual with ribosome disabled (current default) is **2.9× median** (compressor OFF = Headroom Kompress only, target=1000c/doc). The 5.4× was with Claude Haiku splice active. README v3 plan flags this for honest reporting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Headroom-informed rewrite. Major structural changes from v2: - **Proof-first**: bench numbers table in first 20 lines (was buried in dense paragraph). Honestly reports 2.9× median with compressor OFF (current default), notes 5× with compressor ON, and the 37× session-delivery multiplier. Marked "WIP benchmark numbers" pending final overnight bench on post-Stage pipeline. - **Progressive disclosure**: Configuration (17 sections), full endpoint reference (~30 endpoints), and package structure (16 packages) are in collapsible `<details>` blocks. Main flow stays scannable. - **ASCII pipeline diagram**: renders identically on GitHub, PyPI, terminals — no Mermaid dependency. Shows all 7 stages including the know/miss output, legibility headers, session delivery, freshness gate. - **Unified "Agent surfaces" section**: CLI / MCP / HTTP proxy in one decision table (was scattered across 3 sections). - **Two-column "Documentation" table**: "Start here" (setup, API, config) vs "Go deeper" (architecture, dimensions, knowledge graph). - **Time-boxed headers**: "Proof (30 seconds)", "Get started (60 seconds)", "Pipeline (2 minutes)" — reduces reader hesitation. - **Updated gotchas**: session delivery, fusion mode, BGE-M3 backfill, naming lexicon, know/miss contract. Removed stale compressor-centric items that assumed ribosome is always on. 321 lines (was 304). Same density, much better scannability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bundles the previously-untracked scripts/build_fixture_matrix.py and docs/benchmarks/GENOME_FIXTURE_MATRIX.md (committed via 'git checkout stash^3' from a master-side WIP) together with the implementation plan for issue #92. Subsequent commits modify build_fixture_matrix.py per that plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 of issue #92. Adds optional splade_sparse kwarg to sync_splade_index so callers can precompute SPLADE vectors (via splade_backend.encode_batch) and forward the dict instead of letting the indexer call splade_backend.encode inline per gene. Default is None - existing call sites are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 of issue #92. Adds optional splade_sparse kwarg to KnowledgeStore.upsert_doc, forwarded to sync_splade_index. When the parallel ingest path supplies a precomputed sparse dict, the indexer skips its inline splade_backend.encode call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds helix_context/parallel.py with two pool-size helpers used by the upcoming build_fixture_matrix --parallel / --shard-workers paths: - auto_workers leaves 12.5% CPU headroom + reserves a writer core. On an 8-core 5800X this returns 6. - auto_shard_workers further caps to vram_total_gb // 4 (each shard- worker holds its own SPLADE model). On a 3080 Ti + 5800X this returns 3. Both honour explicit --workers / --shard-workers overrides. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure function that takes (path, ext) and returns list of Gene model_dump dicts. Both the sequential and (upcoming) parallel ingest paths will call this. The mp.Pool workers serialise the return value across process boundaries, hence dicts not Gene instances. No behaviour change yet -- new code is unreferenced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds three reusable functions in build_fixture_matrix.py: - _iter_ingestable_files: walks roots, returns filtered [(path, ext)]. - _drain_with_batched_splade: buffers Gene instances, batch-encodes SPLADE every N (default 64), then calls upsert_doc with the precomputed sparse dict (Phase 1 plumbing). - _parallel_ingest_to_genome: mp.Pool over files (chunk + tag) feeding the batched-SPLADE writer in the main process. No call sites yet -- Task 7 wires these into build_profile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

build_profile now branches on a parallel kwarg: when True it discovers files via _iter_ingestable_files, dispatches chunk+tag work to an mp.Pool, and drains via the batched-SPLADE writer. When False (default) the original ingest_tree path is unchanged. mp.freeze_support() added to __main__ for Windows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds a tiny synthetic corpus twice (sequential + parallel) and asserts identical (gene_id, content_hash) sets. Confirms the core acceptance criterion for issue #92: parallel ingest produces a byte-equivalent gene set. Also registers a 'slow' pytest marker for end-to-end tests that load SPLADE / spaCy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces _copy_fingerprint_indexes with a complete single-shard build function. _build_one_shard returns the per-shard stats and a fingerprint_payload (list of tuples ready for an executemany into main.db's fingerprint_index). The parent process writes those rows when collecting shard results. This shape supports both serial calls and an mp.Pool of subprocesses in Task 10. _shard_worker_entry is the Pool's per-task callable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

build_profile_sharded now accepts shard_workers and batch_size kwargs. When shard_workers > 1 the per-shard builds run in mp.Pool subprocesses; the parent process collects results via imap_unordered and writes register_shard + fingerprint_index rows under SQLite busy_timeout=30s. shard_workers == 1 keeps the serial behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires --shard-workers through main() for --mode sharded. When 0 (default), uses helix_context.parallel.auto_shard_workers which caps to min(vram_gb // 4, auto_workers()). Explicit positive values override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds a 2-root sharded corpus twice (shard_workers=1 vs shard_workers=2) and asserts identical main.db shards table + fingerprint_index rows. Validates that the mp.Pool path produces the same routing state as the serial path under SQLite busy_timeout serialisation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Documents --parallel, --workers, --shard-workers, --batch-size with concrete examples in the module docstring (and thus argparse --help description). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mbachaud and others added 16 commits May 13, 2026 15:43

Merge branch 'master' of https://github.com/mbachaud/helix-context

38f93fb

docs(bench): refresh build_fixture_matrix usage block (#92)

248b13b

Documents --parallel, --workers, --shard-workers, --batch-size with concrete examples in the module docstring (and thus argparse --help description). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mbachaud merged commit 90e5ccd into master May 14, 2026
3 checks passed

mbachaud deleted the feat/92-parallel-ingest branch May 14, 2026 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): parallel + sharded-pool ingest for build_fixture_matrix (#92)#94

feat(bench): parallel + sharded-pool ingest for build_fixture_matrix (#92)#94
mbachaud merged 16 commits into
masterfrom
feat/92-parallel-ingest

mbachaud commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbachaud commented May 14, 2026

Summary

Architecture

Test plan

Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant