Skip to content

[codex] add shard file workers#96

Merged
mbachaud merged 1 commit into
masterfrom
codex/issue-95-shard-file-workers
May 14, 2026
Merged

[codex] add shard file workers#96
mbachaud merged 1 commit into
masterfrom
codex/issue-95-shard-file-workers

Conversation

@mbachaud
Copy link
Copy Markdown
Owner

Summary

  • Add shard-local CPU file workers so each SPLADE-owning shard process can receive chunked/tagged genes from an inner CPU pool.
  • Add --shard-file-workers with auto-sizing based on the CPU budget per shard worker.
  • Make auto_shard_workers() tolerate fractional 12 GB VRAM reports while respecting live free VRAM to avoid overcommitting a running server.
  • Keep Windows process-spawn safe by using ProcessPoolExecutor for outer shard workers and replacing profile filter lambdas with a picklable top-level filter.

Why

Issue #95 found two follow-ups from the sharded ingest run: 3080 Ti VRAM was rounded down too aggressively, and each shard worker was sequential inside the shard. On a live system where free VRAM keeps the safe SPLADE count at 2, the higher-value path is feeding those 2 SPLADE workers with more CPU chunk/tag work.

Impact

A typical live run can now use:

python scripts/build_fixture_matrix.py --profile medium --mode sharded --shard-workers 2 --shard-file-workers 3

That keeps SPLADE VRAM bounded while using more CPU cores for file prep.

Validation

  • python -m pytest -m "not slow" tests/test_build_fixture_matrix_parallel.py tests/test_parallel_sizers.py -> 18 passed, 2 slow tests deselected
  • python -m py_compile helix_context/parallel.py scripts/build_fixture_matrix.py tests/test_parallel_sizers.py tests/test_build_fixture_matrix_parallel.py
  • git diff --check

Draft because the live medium/xl SPLADE benchmark has not been run in this PR branch yet.

@mbachaud
Copy link
Copy Markdown
Owner Author

Live perf data on the dev box (5800X + 3080 Ti + 980 Pro)

The benchmark the draft PR description was waiting on. Two layers of numbers: build-time (the headline this PR is for) and serve-time parity (proof the sharded routing doesn't regress retrieval quality vs blob).

Build-time wall clock (this PR's primary win)

All 6 profiles built fresh from cold corpus, no warm-cache between runs:

Profile Monolithic (blob, --parallel off) Sharded (--shard-workers 3 --shard-file-workers 3) Speedup
small (4 roots, 1.2k genes) 3.4 min n/a
medium (6 roots, 17k genes) 47 min 8.6 min 5.5×
large (1 root, 27k genes) 95.6 min n/a
xl (13 roots, 46k genes) 170 min 26 min (--shard-workers 4 --shard-file-workers 3) 6.5×

Run from the dev box ~12 hours ago via PR #96 branch. The xl-sharded case is the most interesting one because the corpus is uneven — F:/Projects is ~60% of the gene volume of xl, the other 12 game roots are tiny. The 26 min wall-clock came from 4 outer SPLADE workers chasing 13 shards; the last ~13 min was just the F:/Projects shard alone on 1 SPLADE writer + 3 inner file workers (the tail effect — separately tracked in #97).

CPU utilization profile during xl-sharded 4×3:

  • 0-13 min: ~95% CPU (all 12 file workers + 4 SPLADE writers loaded)
  • 13-26 min: ~24% CPU (only F:/Projects shard remaining, ~4 cores busy)

Serve-time retrieval parity

10-needle bench (curated SIKE set from benchmarks/bench_needle.py) via claude -p --model sonnet, served from each fixture via /admin/swap-db:

Profile Mode Helix gold-retrieval Claude answer score (-1/0/+1) Cost (10 needles)
small blob 10% +0.6 $1.11
medium blob 30% +0.1 $1.16
large blob 20% +0.4 $1.11
xl blob 20% +0.4 $1.16
medium-sharded sharded 20% +0.0 $1.14
xl-sharded sharded 20% +0.6 $1.38

Sharded retrieval matches blob retrieval at 20% gold-delivered. No quality regression from the routing layer on this needle set. The xl-sharded answer score being higher than xl-blob (+0.6 vs +0.4) is within N=10 noise — answer score is dominated by Claude's training-data fallback whenever Helix doesn't deliver the gold source, so it's not a clean signal on Helix per se.

Helix retrieval at 10-30% across the matrix is the real story for next quarter's work (separate issue #93 EnterpriseRAG-Bench prep), not a PR #96 concern.

Bugs caught by the bench (filed as #98)

Six API gaps in ShardedGenomeAdapter blocked the sharded fixtures from being usable through /context. All six fixed locally to unblock the bench; will land as a follow-up PR per #98. Quick list:

  1. path attribute (str(helix.genome.path) in swap-db)
  2. _dense_embedding_enabled
  3. _entity_graph_retrieval_enabled
  4. _last_query_scores_lock
  5. query_docs method (had query_genes only — legacy name)
  6. Adapter's query_docs body calls self._router.query_docs which doesn't exist (router is still on query_genes)

None of these touch the build path in this PR — they're all server-side adapter gaps that only fire when a sharded fixture is mounted via swap-db. The build itself was clean and the resulting .db files are correct.

Total bench cost

$7.07 for the full 60-needle blob+sharded sweep + $2.52 for the sharded re-run after the adapter patches. ~25 min wall-clock for the headline runs.

Recommendation

Ready to merge from a build-perf standpoint. The serve-side adapter bugs are separate concerns (#98) and shouldn't block this PR — they predate the changes here.

Logs + JSONLs at benchmarks/results/claude_matrix_20260514T071730Z/ (initial run) and benchmarks/results/claude_matrix_20260514T160500Z/ (sharded rerun).

@mbachaud mbachaud marked this pull request as ready for review May 14, 2026 16:24
@mbachaud mbachaud merged commit 245ad11 into master May 14, 2026
3 checks passed
@mbachaud mbachaud deleted the codex/issue-95-shard-file-workers branch May 14, 2026 16:24
mbachaud added a commit that referenced this pull request May 14, 2026
…d sort (#99)

Fixes ShardedGenomeAdapter drift caught during the post-#96 bench and adds
largest-first shard scheduling.

#98 — adapter mirrors the KnowledgeStore surface that context_manager and
routes_admin read off self.genome: path property, _dense_embedding_enabled /
_entity_graph_retrieval_enabled / _last_query_scores_lock defaults, R3
query_docs / get_doc / upsert_doc renames (with legacy aliases), and the
router bridge bug fix. Parity test catches future drift; swap-db round-trip
test covers A->B->A with HELIX_USE_SHARDS=1.

#97 A.1 — _estimate_eligible_bytes pre-scan; build_profile_sharded submits
shards to the worker pool largest-first by default. --no-shard-sort for
deterministic ordering. A.2 / B left as follow-up.

73 shard-adjacent tests green; CI green on linux/macos/windows.

Closes #97. Closes #98.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant