Skip to content

feat(bench): parallel + sharded-pool ingest for build_fixture_matrix (#92)#94

Merged
mbachaud merged 16 commits into
masterfrom
feat/92-parallel-ingest
May 14, 2026
Merged

feat(bench): parallel + sharded-pool ingest for build_fixture_matrix (#92)#94
mbachaud merged 16 commits into
masterfrom
feat/92-parallel-ingest

Conversation

@mbachaud
Copy link
Copy Markdown
Owner

Closes #92.

Summary

  • Phase 1: optional splade_sparse kwarg on sync_splade_index and upsert_doc so callers can precompute SPLADE outside the per-document upsert.
  • Phase 2: --parallel ingest mode for monolithic profiles — mp.Pool chunks+tags in parallel; main-process writer batches SPLADE in groups of 64 and commits.
  • Phase 3: --shard-workers N mode for sharded profiles — M concurrent shard-builds, main.db writes serialised through SQLite busy_timeout=30s.
  • Auto-sizers (helix_context.parallel.auto_workers / auto_shard_workers) so the CLI picks a sane default. On the 8-core 5800X + 3080 Ti dev box: auto_workers() == 6, auto_shard_workers() == 3.

Architecture

--parallel (blob mode):
  N file workers (chunk + tag)  →  main process writer
                                     (batched SPLADE -> upsert_doc)

--shard-workers (sharded mode):
  M shard workers (each: discover -> chunk + tag -> batched SPLADE -> upsert)
                                  ↓ fingerprint_payload via Pool result
  main process writes shards + fingerprint_index (busy_timeout=30s)

Existing default paths (sequential blob, shard_workers=1) are byte-equivalent to master.

Test plan

  • pytest tests/test_splade_precompute.py — 5 passed (Phase 1 plumbing)
  • pytest tests/test_parallel_sizers.py — 10 passed (auto-sizers)
  • pytest tests/test_build_fixture_matrix_parallel.py — 2 passed (both parity tests; ~23s with SPLADE load)
  • pytest tests/ -m "not live and not slow" — 1954 passed; 4 pre-existing failures in test_observability_docs.py (same on master, unrelated)
  • Manual perf check: python scripts/build_fixture_matrix.py --profile xl --parallel should drop wall time from ~3 hr → <45 min on dev box

Files

  • New: helix_context/parallel.py, tests/test_parallel_sizers.py, tests/test_splade_precompute.py, tests/test_build_fixture_matrix_parallel.py
  • New (bundled baseline): scripts/build_fixture_matrix.py, docs/benchmarks/GENOME_FIXTURE_MATRIX.md (previously untracked on master)
  • Modified: helix_context/storage/indexes.py, helix_context/knowledge_store.py, pyproject.toml (slow marker)

Plan: docs/superpowers/plans/2026-05-13-issue-92-parallel-ingest.md.

🤖 Generated with Claude Code

mbachaud and others added 16 commits May 13, 2026 15:43
… plan

Three deliverables from the 3-week status audit:

1. **CLAUDE.md rewrite** — matches post-PR-90 package structure:
   - "6-step" → "7-stage" pipeline with BGE-M3 dense recall, RRF fusion,
     freshness gate, know/miss contract
   - 10-module flat table → 16-package table with back-compat shim note
   - Config: 5 sections → 17 sections with key settings per section
   - Endpoints: grouped into 4 blocks (core, ingestion, identity, diag)
   - expression_tokens corrected to 7k (was claiming 12k)
   - Genome path corrected to genomes/main/genome.db

2. **OAuth bench harness recovered** from squashed intermediate commit
   672ee13 (lost during PR #90 squash). Updated to post-PR-90 naming:
   - oauth_fixtures.py: Genome → KnowledgeStore, upsert_gene → upsert_doc
   - bench_oauth_scope.py: same + query_genes → query_docs
   - bench_oauth_provider.py: no internal imports (pure HTTP + claude -p)
   - oauth_task_set.py: task IDs/queries unchanged (bench contract)
   All 4 files import cleanly; synthetic scope smoke test passes
   (0 cross-party leaks, 100% own-party recall).

3. **README v3 plan** at docs/archive/plans/2026-05-13-readme-v3-plan.md.
   Headroom-informed restructure: proof-first positioning, collapsible
   details sections, ASCII pipeline diagram, unified agent-surfaces
   section, two-column docs table.

Token baseline finding (from bench_rag_vs_sike_tokens.py run):
  - README v2 claimed "5.4× median" — actual with ribosome disabled
    (current default) is **2.9× median** (compressor OFF = Headroom
    Kompress only, target=1000c/doc). The 5.4× was with Claude Haiku
    splice active. README v3 plan flags this for honest reporting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Headroom-informed rewrite. Major structural changes from v2:

- **Proof-first**: bench numbers table in first 20 lines (was buried in
  dense paragraph). Honestly reports 2.9× median with compressor OFF
  (current default), notes 5× with compressor ON, and the 37×
  session-delivery multiplier. Marked "WIP benchmark numbers" pending
  final overnight bench on post-Stage pipeline.

- **Progressive disclosure**: Configuration (17 sections), full endpoint
  reference (~30 endpoints), and package structure (16 packages) are in
  collapsible `<details>` blocks. Main flow stays scannable.

- **ASCII pipeline diagram**: renders identically on GitHub, PyPI,
  terminals — no Mermaid dependency. Shows all 7 stages including the
  know/miss output, legibility headers, session delivery, freshness gate.

- **Unified "Agent surfaces" section**: CLI / MCP / HTTP proxy in one
  decision table (was scattered across 3 sections).

- **Two-column "Documentation" table**: "Start here" (setup, API,
  config) vs "Go deeper" (architecture, dimensions, knowledge graph).

- **Time-boxed headers**: "Proof (30 seconds)", "Get started (60
  seconds)", "Pipeline (2 minutes)" — reduces reader hesitation.

- **Updated gotchas**: session delivery, fusion mode, BGE-M3 backfill,
  naming lexicon, know/miss contract. Removed stale compressor-centric
  items that assumed ribosome is always on.

321 lines (was 304). Same density, much better scannability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bundles the previously-untracked scripts/build_fixture_matrix.py and
docs/benchmarks/GENOME_FIXTURE_MATRIX.md (committed via 'git checkout
stash^3' from a master-side WIP) together with the implementation plan
for issue #92. Subsequent commits modify build_fixture_matrix.py per
that plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of issue #92. Adds optional splade_sparse kwarg to
sync_splade_index so callers can precompute SPLADE vectors (via
splade_backend.encode_batch) and forward the dict instead of letting
the indexer call splade_backend.encode inline per gene.

Default is None - existing call sites are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of issue #92. Adds optional splade_sparse kwarg to
KnowledgeStore.upsert_doc, forwarded to sync_splade_index. When the
parallel ingest path supplies a precomputed sparse dict, the indexer
skips its inline splade_backend.encode call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds helix_context/parallel.py with two pool-size helpers used by the
upcoming build_fixture_matrix --parallel / --shard-workers paths:

- auto_workers leaves 12.5% CPU headroom + reserves a writer core.
  On an 8-core 5800X this returns 6.
- auto_shard_workers further caps to vram_total_gb // 4 (each shard-
  worker holds its own SPLADE model). On a 3080 Ti + 5800X this returns 3.

Both honour explicit --workers / --shard-workers overrides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure function that takes (path, ext) and returns list of Gene
model_dump dicts. Both the sequential and (upcoming) parallel
ingest paths will call this. The mp.Pool workers serialise the
return value across process boundaries, hence dicts not Gene
instances.

No behaviour change yet -- new code is unreferenced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three reusable functions in build_fixture_matrix.py:

- _iter_ingestable_files: walks roots, returns filtered [(path, ext)].
- _drain_with_batched_splade: buffers Gene instances, batch-encodes
  SPLADE every N (default 64), then calls upsert_doc with the
  precomputed sparse dict (Phase 1 plumbing).
- _parallel_ingest_to_genome: mp.Pool over files (chunk + tag) feeding
  the batched-SPLADE writer in the main process.

No call sites yet -- Task 7 wires these into build_profile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build_profile now branches on a parallel kwarg: when True it discovers
files via _iter_ingestable_files, dispatches chunk+tag work to an
mp.Pool, and drains via the batched-SPLADE writer. When False (default)
the original ingest_tree path is unchanged.

mp.freeze_support() added to __main__ for Windows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds a tiny synthetic corpus twice (sequential + parallel) and
asserts identical (gene_id, content_hash) sets. Confirms the core
acceptance criterion for issue #92: parallel ingest produces a
byte-equivalent gene set.

Also registers a 'slow' pytest marker for end-to-end tests that
load SPLADE / spaCy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces _copy_fingerprint_indexes with a complete single-shard
build function. _build_one_shard returns the per-shard stats and
a fingerprint_payload (list of tuples ready for an executemany
into main.db's fingerprint_index). The parent process writes
those rows when collecting shard results.

This shape supports both serial calls and an mp.Pool of subprocesses
in Task 10. _shard_worker_entry is the Pool's per-task callable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build_profile_sharded now accepts shard_workers and batch_size kwargs.
When shard_workers > 1 the per-shard builds run in mp.Pool subprocesses;
the parent process collects results via imap_unordered and writes
register_shard + fingerprint_index rows under SQLite busy_timeout=30s.

shard_workers == 1 keeps the serial behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires --shard-workers through main() for --mode sharded. When 0
(default), uses helix_context.parallel.auto_shard_workers which
caps to min(vram_gb // 4, auto_workers()). Explicit positive values
override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds a 2-root sharded corpus twice (shard_workers=1 vs shard_workers=2)
and asserts identical main.db shards table + fingerprint_index rows.

Validates that the mp.Pool path produces the same routing state as
the serial path under SQLite busy_timeout serialisation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents --parallel, --workers, --shard-workers, --batch-size with
concrete examples in the module docstring (and thus argparse --help
description).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel ingest path for build_fixture_matrix.py (auto-sized worker pool)

1 participant