[codex] add shard file workers by mbachaud · Pull Request #96 · mbachaud/helix-context

mbachaud · 2026-05-14T06:05:39Z

Summary

Add shard-local CPU file workers so each SPLADE-owning shard process can receive chunked/tagged genes from an inner CPU pool.
Add --shard-file-workers with auto-sizing based on the CPU budget per shard worker.
Make auto_shard_workers() tolerate fractional 12 GB VRAM reports while respecting live free VRAM to avoid overcommitting a running server.
Keep Windows process-spawn safe by using ProcessPoolExecutor for outer shard workers and replacing profile filter lambdas with a picklable top-level filter.

Why

Issue #95 found two follow-ups from the sharded ingest run: 3080 Ti VRAM was rounded down too aggressively, and each shard worker was sequential inside the shard. On a live system where free VRAM keeps the safe SPLADE count at 2, the higher-value path is feeding those 2 SPLADE workers with more CPU chunk/tag work.

Impact

A typical live run can now use:

python scripts/build_fixture_matrix.py --profile medium --mode sharded --shard-workers 2 --shard-file-workers 3

That keeps SPLADE VRAM bounded while using more CPU cores for file prep.

Validation

python -m pytest -m "not slow" tests/test_build_fixture_matrix_parallel.py tests/test_parallel_sizers.py -> 18 passed, 2 slow tests deselected
python -m py_compile helix_context/parallel.py scripts/build_fixture_matrix.py tests/test_parallel_sizers.py tests/test_build_fixture_matrix_parallel.py
git diff --check

Draft because the live medium/xl SPLADE benchmark has not been run in this PR branch yet.

mbachaud · 2026-05-14T16:15:53Z

Live perf data on the dev box (5800X + 3080 Ti + 980 Pro)

The benchmark the draft PR description was waiting on. Two layers of numbers: build-time (the headline this PR is for) and serve-time parity (proof the sharded routing doesn't regress retrieval quality vs blob).

Build-time wall clock (this PR's primary win)

All 6 profiles built fresh from cold corpus, no warm-cache between runs:

Profile	Monolithic (blob, `--parallel` off)	Sharded (`--shard-workers 3 --shard-file-workers 3`)	Speedup
small (4 roots, 1.2k genes)	3.4 min	n/a	—
medium (6 roots, 17k genes)	47 min	8.6 min	5.5×
large (1 root, 27k genes)	95.6 min	n/a	—
xl (13 roots, 46k genes)	170 min	26 min (`--shard-workers 4 --shard-file-workers 3`)	6.5×

Run from the dev box ~12 hours ago via PR #96 branch. The xl-sharded case is the most interesting one because the corpus is uneven — F:/Projects is ~60% of the gene volume of xl, the other 12 game roots are tiny. The 26 min wall-clock came from 4 outer SPLADE workers chasing 13 shards; the last ~13 min was just the F:/Projects shard alone on 1 SPLADE writer + 3 inner file workers (the tail effect — separately tracked in #97).

CPU utilization profile during xl-sharded 4×3:

0-13 min: ~95% CPU (all 12 file workers + 4 SPLADE writers loaded)
13-26 min: ~24% CPU (only F:/Projects shard remaining, ~4 cores busy)

Serve-time retrieval parity

10-needle bench (curated SIKE set from benchmarks/bench_needle.py) via claude -p --model sonnet, served from each fixture via /admin/swap-db:

Profile	Mode	Helix gold-retrieval	Claude answer score (-1/0/+1)	Cost (10 needles)
small	blob	10%	+0.6	$1.11
medium	blob	30%	+0.1	$1.16
large	blob	20%	+0.4	$1.11
xl	blob	20%	+0.4	$1.16
medium-sharded	sharded	20%	+0.0	$1.14
xl-sharded	sharded	20%	+0.6	$1.38

Sharded retrieval matches blob retrieval at 20% gold-delivered. No quality regression from the routing layer on this needle set. The xl-sharded answer score being higher than xl-blob (+0.6 vs +0.4) is within N=10 noise — answer score is dominated by Claude's training-data fallback whenever Helix doesn't deliver the gold source, so it's not a clean signal on Helix per se.

Helix retrieval at 10-30% across the matrix is the real story for next quarter's work (separate issue #93 EnterpriseRAG-Bench prep), not a PR #96 concern.

Bugs caught by the bench (filed as #98)

Six API gaps in ShardedGenomeAdapter blocked the sharded fixtures from being usable through /context. All six fixed locally to unblock the bench; will land as a follow-up PR per #98. Quick list:

path attribute (str(helix.genome.path) in swap-db)
_dense_embedding_enabled
_entity_graph_retrieval_enabled
_last_query_scores_lock
query_docs method (had query_genes only — legacy name)
Adapter's query_docs body calls self._router.query_docs which doesn't exist (router is still on query_genes)

None of these touch the build path in this PR — they're all server-side adapter gaps that only fire when a sharded fixture is mounted via swap-db. The build itself was clean and the resulting .db files are correct.

Total bench cost

$7.07 for the full 60-needle blob+sharded sweep + $2.52 for the sharded re-run after the adapter patches. ~25 min wall-clock for the headline runs.

Recommendation

Ready to merge from a build-perf standpoint. The serve-side adapter bugs are separate concerns (#98) and shouldn't block this PR — they predate the changes here.

Logs + JSONLs at benchmarks/results/claude_matrix_20260514T071730Z/ (initial run) and benchmarks/results/claude_matrix_20260514T160500Z/ (sharded rerun).

…d sort (#99) Fixes ShardedGenomeAdapter drift caught during the post-#96 bench and adds largest-first shard scheduling. #98 — adapter mirrors the KnowledgeStore surface that context_manager and routes_admin read off self.genome: path property, _dense_embedding_enabled / _entity_graph_retrieval_enabled / _last_query_scores_lock defaults, R3 query_docs / get_doc / upsert_doc renames (with legacy aliases), and the router bridge bug fix. Parity test catches future drift; swap-db round-trip test covers A->B->A with HELIX_USE_SHARDS=1. #97 A.1 — _estimate_eligible_bytes pre-scan; build_profile_sharded submits shards to the worker pool largest-first by default. --no-shard-sort for deterministic ordering. A.2 / B left as follow-up. 73 shard-adjacent tests green; CI green on linux/macos/windows. Closes #97. Closes #98.

add shard file workers

0b51ef5

mbachaud force-pushed the codex/issue-95-shard-file-workers branch from fa4db7d to 0b51ef5 Compare May 14, 2026 06:07

This was referenced May 14, 2026

Shard-pool tail: pre-ingest sizing + adaptive worker redistribution for uneven shard sizes #97

Closed

ShardedGenomeAdapter API has drifted behind current Genome surface — /context 500s + swap-db crashes #98

Closed

mbachaud marked this pull request as ready for review May 14, 2026 16:24

mbachaud merged commit 245ad11 into master May 14, 2026
3 checks passed

mbachaud deleted the codex/issue-95-shard-file-workers branch May 14, 2026 16:24

mbachaud mentioned this pull request May 14, 2026

fix(sharded): adapter parity with KnowledgeStore + largest-first shard sort #99

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] add shard file workers#96

[codex] add shard file workers#96
mbachaud merged 1 commit into
masterfrom
codex/issue-95-shard-file-workers

mbachaud commented May 14, 2026

Uh oh!

mbachaud commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbachaud commented May 14, 2026

Summary

Why

Impact

Validation

Uh oh!

mbachaud commented May 14, 2026

Live perf data on the dev box (5800X + 3080 Ti + 980 Pro)

Build-time wall clock (this PR's primary win)

Serve-time retrieval parity

Bugs caught by the bench (filed as #98)

Total bench cost

Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant