feat: parallel index build with leader-merges architecture#188
feat: parallel index build with leader-merges architecture#188
Conversation
|
Do we have a comparison between the old architecture (main branch) and this new one? From what I can see, the new architecture offers:
|
The main gist is to move write responsibilities to the leader node in order to make the writes mostly sequential (as you point out) and especially to avoid the heavy fragmentation that was arising with the previous approach, where workers wrote their own segments with a followup compaction step. Since we are mostly CPU bound during indexing (and mostly on the read path) due to the calls to to_tsvector and memtable-population operations, this turns out to give pretty good speedups in the # of workers. I had to experiment quite a bit to hit on this approach (several other PRs in flight that i'm now going to close) and add visualizations to see what was happening with the fragmentation. Ironically, I intended the viz for human consumption, but the robot started using them immediately and became the main consumer. |
7dde6bc to
10fc78d
Compare
SteveLauC
left a comment
There was a problem hiding this comment.
Did a brief review of the PR, sorry for the bunch of comments 🫠
Thanks much @SteveLauC for going through the PR, it's helpful. I'd like to chat irl if you're game, ping me at tj@tigerdata.com. |
Workers scan the heap in parallel and build memtables in shared DSA memory using double-buffering. The leader reads worker memtables and writes L0 segments, ensuring all disk I/O happens sequentially for contiguous page allocation. Key implementation details: - Workers use dshash tables for term storage with custom hash/compare functions that support length-based lookups (no allocation needed for TSVector terms which aren't null-terminated) - Leader pre-allocates a page pool to avoid smgr cache issues with concurrent page allocation - LWLock tranches registered in _PG_init() for dshash synchronization - Hash tables destroyed and recreated when buffers are reused to prevent duplicate document entries Fixes during implementation: - Added len field to TpStringKey for allocation-free lookups - Fixed hash table clearing on buffer reuse (was causing 1.5x doc count) - Registered custom LWLock tranches instead of using PARALLEL_HASH_JOIN
The final truncation after parallel build was incorrectly computing max_used by only looking at segment data pages (seg + num_pages). This missed page index pages which can be located at any block number. Use tp_segment_collect_pages() to get ALL pages (data + page index) for each segment when computing the maximum used block. This fixes "could not read blocks" errors when querying parallel-built indexes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of writing separate L0 segments for each worker buffer and then compacting them, the leader now: 1. Accumulates all worker buffer contents into a local memtable 2. Writes ONE merged segment when all workers are done This produces a single compact segment similar to serial build, avoiding the 3x index size overhead that resulted from writing many small segments. The new approach: - tp_accumulate_worker_buffer_to_local(): copies DSA buffer data to local memtable (palloc-based, private to leader process) - tp_write_segment_from_local_memtable(): writes segment from merged data - tp_leader_process_buffers(): orchestrates the accumulate-then-merge flow Double-buffering is preserved: as worker buffers fill, leader accumulates their contents and signals buffers consumed, allowing workers to continue. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace the intermediate local memtable accumulation approach with direct streaming N-way merge from worker DSA buffers to disk segment. Key changes: - Add TpWorkerMergeSource to track position in each buffer's sorted term array during N-way merge - Add collect_buffer_term_postings() for N-way merge of postings - Add tp_merge_worker_buffers_to_segment() that streams directly from worker buffers to segment without intermediate copy - Update tp_leader_process_buffers() to merge when nworkers buffers are ready (allowing fast workers to contribute both buffers while slow workers catch up) - Remove local memtable accumulation code and related functions This eliminates memory overhead from copying all worker data to a local memtable before writing the segment.
…el build The dictionary entry writing code in tp_merge_worker_buffers_to_segment() assumed all entries fit on the first page. With large datasets (150K+ docs), dictionary entries span multiple pages, causing "Invalid logical page" errors. Fixed by: 1. Flushing segment writer before writing entries 2. Using page-boundary-aware code that calculates logical/physical pages 3. Handling entries that span two pages like merge.c does
- Use posting-count threshold (tp_memtable_spill_threshold) instead of memory-based threshold for buffer switching. This produces the same number of segments as serial build (7 vs 163 before). - Add compression support for posting blocks, matching serial build's compressed format. This reduces index size from 2.5GB to 1.2GB. - Track total_postings in worker buffers for threshold checking. Results on 8.8M MS MARCO docs: - Serial: 282s, 1273 MB, 7 segments - Parallel: 247s, 1222 MB, 7 segments (13% faster, 4% smaller)
Postgres caps parallel workers based on maintenance_work_mem, requiring 32MB per participant (workers + leader). With the default 64MB, only 1 worker can be launched regardless of max_parallel_maintenance_workers. Update the NOTICE message to inform users about both GUCs.
- Fall back to serial build when only 1 parallel worker available (1 worker provides no benefit over serial) - Skip writing recovery docid pages during bulk index build (CREATE INDEX is atomic - if it fails, index is discarded) - Lower default expansion factor from 1.0 to 0.5 (BM25 indexes are typically 30-50% of heap size) - Increase minimum pool pages from 1024 to 8192 (64MB) to handle small tables with page index overhead - Update parallel build notice to mention maintenance_work_mem
1 worker still helps with read/write parallelism (worker scans while leader writes). Reverted the nworkers > 1 check back to nworkers > 0. Updated test expected output after confirming tests pass with both debug and release builds.
Use P_NEW (dynamic relation extension) instead of pre-allocating a pool of pages. This removes the `parallel_build_expansion_factor` GUC and simplifies the code significantly. The page pool was originally needed because workers could not safely extend the relation (stale smgr cache). With the leader-merges architecture, only the leader writes segments, so P_NEW works correctly. Changes: - Remove pool pre-allocation and tp_preallocate_page_pool() - Remove tp_pool_get_page() and write_page_index_from_pool() - Remove parallel_build_expansion_factor GUC from mod.c - Remove pool-related fields from TpParallelBuildShared - Update benchmark workflow to remove obsolete GUC settings - Use tp_segment_writer_init() which extends relation with P_NEW - Use write_page_index() which also extends with P_NEW Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Also add comments explaining the maintenance_work_mem requirement for parallel builds (32MB per worker + leader). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add NOTICE message showing workers launched during parallel build - Expand parallel_build test from 1 to 10 test cases: - Serial build (0 workers) - 1, 2, and 4 worker configurations - Short documents, duplicate terms, NULL content - Post-build inserts and VACUUM - Custom BM25 parameters (k1, b) - Below-threshold serial fallback - Use larger GitHub runner for parallel scaling benchmarks - Remove stray benchmark output files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark parallel index builds as complete (v0.5.0) - Add write concurrency as future optimization target - Update version to v0.5.0 and date to 2026-01-30 - Fix clang-format in build_parallel.c Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8+ term queries are currently slow due to BMW block-index iteration limitations (documented in v0.3.0 section). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
v1.0.0 criteria: - Feature freeze and QA push - Benchmark validation with blog post - Concurrent stress tests - Operational readiness (pg_dump/restore, pg_upgrade, VACUUM, replication) - Documentation improvements - Dogfooding and soak testing Future work: - Boolean queries, background compaction - Potential: multi-tenant support, positional queries, faceted optimization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add NULL check for DSA string allocation to prevent crash if memory exhausted - Fix memory leak in clear_worker_buffer: properly free DSA allocations (term strings and posting lists) before destroying hash tables This ensures DSA memory is recycled when worker buffers are cleared, preventing pool exhaustion on large parallel builds.
- Use DSHASH_HANDLE_INVALID instead of 0 for handle comparisons - Wrap snapshot registration in #if PG_VERSION_NUM >= 180000 (PG18+ only) - Use TP_TRANCHE_BUILD_DSA instead of LWTRANCHE_PARALLEL_HASH_JOIN - Remove unused fields: num_terms, memory_used, memory_budget_per_worker - Remove unused TP_MEMORY_SLOP_FACTOR define - Add CHECK_FOR_INTERRUPTS() in worker wait loop for Ctrl-C handling - Update NOTICE message to mention maintenance_work_mem requirement - Remove unused shared parameter from tp_merge_worker_buffers_to_segment() Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…n indexes - Add note about many-token query optimization opportunity (buffered union) - Add expression index support to planned features Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The planner requires maintenance_work_mem >= 64MB to enable parallel index builds. Document this requirement in README since builds silently fall back to serial mode otherwise. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
df1b3d5 to
2929025
Compare
Summary
Architecture rationale
The key insight is moving all write responsibilities to the leader node:
to_tsvectorcalls and memtable-population operations, so centralizing writes doesn't bottleneck throughputThis architecture emerged from experimentation with several alternatives, informed by visualizations of page layout fragmentation.
Key changes
tp_memtable_spill_threshold(32M postings) as serial buildBenchmark (MS MARCO 8.8M docs)
3.1x faster with proper configuration. Index sizes now match exactly.
Note: Postgres requires
maintenance_work_mem >= 64MBfor parallel builds (the planner won't assign workers otherwise).Testing
make installcheckpasses