[codex] add shard file workers#96
Conversation
fa4db7d to
0b51ef5
Compare
Live perf data on the dev box (5800X + 3080 Ti + 980 Pro)The benchmark the draft PR description was waiting on. Two layers of numbers: build-time (the headline this PR is for) and serve-time parity (proof the sharded routing doesn't regress retrieval quality vs blob). Build-time wall clock (this PR's primary win)All 6 profiles built fresh from cold corpus, no warm-cache between runs:
Run from the dev box ~12 hours ago via PR #96 branch. The xl-sharded case is the most interesting one because the corpus is uneven — F:/Projects is ~60% of the gene volume of xl, the other 12 game roots are tiny. The 26 min wall-clock came from 4 outer SPLADE workers chasing 13 shards; the last ~13 min was just the F:/Projects shard alone on 1 SPLADE writer + 3 inner file workers (the tail effect — separately tracked in #97). CPU utilization profile during xl-sharded 4×3:
Serve-time retrieval parity10-needle bench (curated SIKE set from
Sharded retrieval matches blob retrieval at 20% gold-delivered. No quality regression from the routing layer on this needle set. The xl-sharded answer score being higher than xl-blob (+0.6 vs +0.4) is within N=10 noise — answer score is dominated by Claude's training-data fallback whenever Helix doesn't deliver the gold source, so it's not a clean signal on Helix per se. Helix retrieval at 10-30% across the matrix is the real story for next quarter's work (separate issue #93 EnterpriseRAG-Bench prep), not a PR #96 concern. Bugs caught by the bench (filed as #98)Six API gaps in
None of these touch the build path in this PR — they're all server-side adapter gaps that only fire when a sharded fixture is mounted via swap-db. The build itself was clean and the resulting Total bench cost$7.07 for the full 60-needle blob+sharded sweep + $2.52 for the sharded re-run after the adapter patches. ~25 min wall-clock for the headline runs. RecommendationReady to merge from a build-perf standpoint. The serve-side adapter bugs are separate concerns (#98) and shouldn't block this PR — they predate the changes here. Logs + JSONLs at |
…d sort (#99) Fixes ShardedGenomeAdapter drift caught during the post-#96 bench and adds largest-first shard scheduling. #98 — adapter mirrors the KnowledgeStore surface that context_manager and routes_admin read off self.genome: path property, _dense_embedding_enabled / _entity_graph_retrieval_enabled / _last_query_scores_lock defaults, R3 query_docs / get_doc / upsert_doc renames (with legacy aliases), and the router bridge bug fix. Parity test catches future drift; swap-db round-trip test covers A->B->A with HELIX_USE_SHARDS=1. #97 A.1 — _estimate_eligible_bytes pre-scan; build_profile_sharded submits shards to the worker pool largest-first by default. --no-shard-sort for deterministic ordering. A.2 / B left as follow-up. 73 shard-adjacent tests green; CI green on linux/macos/windows. Closes #97. Closes #98.
Summary
--shard-file-workerswith auto-sizing based on the CPU budget per shard worker.auto_shard_workers()tolerate fractional 12 GB VRAM reports while respecting live free VRAM to avoid overcommitting a running server.ProcessPoolExecutorfor outer shard workers and replacing profile filter lambdas with a picklable top-level filter.Why
Issue #95 found two follow-ups from the sharded ingest run: 3080 Ti VRAM was rounded down too aggressively, and each shard worker was sequential inside the shard. On a live system where free VRAM keeps the safe SPLADE count at 2, the higher-value path is feeding those 2 SPLADE workers with more CPU chunk/tag work.
Impact
A typical live run can now use:
That keeps SPLADE VRAM bounded while using more CPU cores for file prep.
Validation
python -m pytest -m "not slow" tests/test_build_fixture_matrix_parallel.py tests/test_parallel_sizers.py-> 18 passed, 2 slow tests deselectedpython -m py_compile helix_context/parallel.py scripts/build_fixture_matrix.py tests/test_parallel_sizers.py tests/test_build_fixture_matrix_parallel.pygit diff --checkDraft because the live medium/xl SPLADE benchmark has not been run in this PR branch yet.