Skip to content

feat: parallel index build with leader-merges architecture#188

Merged
tjgreen42 merged 26 commits intomainfrom
parallel-build-leader-merges
Jan 31, 2026
Merged

feat: parallel index build with leader-merges architecture#188
tjgreen42 merged 26 commits intomainfrom
parallel-build-leader-merges

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

@tjgreen42 tjgreen42 commented Jan 29, 2026

Summary

  • Implements parallel index build where workers scan heap and build memtables, leader merges and writes segments
  • Workers use double-buffered memtables in shared DSA memory to avoid blocking
  • Leader performs all writes using dynamic relation extension (P_NEW)
  • Serial and parallel builds now produce identical index output (same size, same format)

Architecture rationale

The key insight is moving all write responsibilities to the leader node:

  1. Sequential writes: Leader writes segments sequentially, avoiding the heavy fragmentation that arose with earlier approaches where workers wrote their own segments
  2. CPU-bound workload: Index building is mostly CPU-bound on the read path due to to_tsvector calls and memtable-population operations, so centralizing writes doesn't bottleneck throughput
  3. No followup compaction: Previous approaches required a compaction step after workers finished; this design produces well-ordered segments directly

This architecture emerged from experimentation with several alternatives, informed by visualizations of page layout fragmentation.

Key changes

  • Architecture: Workers fill memtables → signal ready → leader merges N buffers into single segment
  • Threshold matching: Uses same tp_memtable_spill_threshold (32M postings) as serial build
  • Compression: Applies delta+bitpacking compression to posting blocks (same as serial)
  • Page boundary handling: Fixed dictionary entry writing across page boundaries
  • Simplified page allocation: Uses P_NEW instead of pre-allocated pool (no GUC needed)
  • Skip recovery docids: Removed unnecessary docid writes during CREATE INDEX (atomic operation doesn't need recovery). This applies to both serial and parallel builds, so indexes are now identical.

Benchmark (MS MARCO 8.8M docs)

Configuration Time Index Size
Serial 4:27 1190 MB
Parallel (4 workers) 1:26 1190 MB

3.1x faster with proper configuration. Index sizes now match exactly.

Note: Postgres requires maintenance_work_mem >= 64MB for parallel builds (the planner won't assign workers otherwise).

Testing

  • make installcheck passes
  • Tested on MS MARCO with 0, 1, 4 workers
  • Index sizes match between serial and parallel builds

@SteveLauC
Copy link
Copy Markdown
Collaborator

Do we have a comparison between the old architecture (main branch) and this new one? From what I can see, the new architecture offers:

  1. Less random I/O as leader is the only writer, and it writes sequentially (mostly)
  2. Less small tail segments
  3. Less chances to trigger compaction

@tjgreen42
Copy link
Copy Markdown
Collaborator Author

Do we have a comparison between the old architecture (main branch) and this new one? From what I can see, the new architecture offers:

  1. Less random I/O as leader is the only writer, and it writes sequentially (mostly)
  2. Less small tail segments
  3. Less chances to trigger compaction

The main gist is to move write responsibilities to the leader node in order to make the writes mostly sequential (as you point out) and especially to avoid the heavy fragmentation that was arising with the previous approach, where workers wrote their own segments with a followup compaction step. Since we are mostly CPU bound during indexing (and mostly on the read path) due to the calls to to_tsvector and memtable-population operations, this turns out to give pretty good speedups in the # of workers. I had to experiment quite a bit to hit on this approach (several other PRs in flight that i'm now going to close) and add visualizations to see what was happening with the fragmentation. Ironically, I intended the viz for human consumption, but the robot started using them immediately and became the main consumer.

@tjgreen42 tjgreen42 force-pushed the parallel-build-leader-merges branch from 7dde6bc to 10fc78d Compare January 30, 2026 20:10
Copy link
Copy Markdown
Collaborator

@SteveLauC SteveLauC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a brief review of the PR, sorry for the bunch of comments 🫠

@tjgreen42
Copy link
Copy Markdown
Collaborator Author

Did a brief review of the PR, sorry for the bunch of comments 🫠

Thanks much @SteveLauC for going through the PR, it's helpful. I'd like to chat irl if you're game, ping me at tj@tigerdata.com.

tjgreen42 and others added 24 commits January 31, 2026 09:03
Workers scan the heap in parallel and build memtables in shared DSA
memory using double-buffering. The leader reads worker memtables and
writes L0 segments, ensuring all disk I/O happens sequentially for
contiguous page allocation.

Key implementation details:
- Workers use dshash tables for term storage with custom hash/compare
  functions that support length-based lookups (no allocation needed
  for TSVector terms which aren't null-terminated)
- Leader pre-allocates a page pool to avoid smgr cache issues with
  concurrent page allocation
- LWLock tranches registered in _PG_init() for dshash synchronization
- Hash tables destroyed and recreated when buffers are reused to
  prevent duplicate document entries

Fixes during implementation:
- Added len field to TpStringKey for allocation-free lookups
- Fixed hash table clearing on buffer reuse (was causing 1.5x doc count)
- Registered custom LWLock tranches instead of using PARALLEL_HASH_JOIN
The final truncation after parallel build was incorrectly computing
max_used by only looking at segment data pages (seg + num_pages). This
missed page index pages which can be located at any block number.

Use tp_segment_collect_pages() to get ALL pages (data + page index)
for each segment when computing the maximum used block. This fixes
"could not read blocks" errors when querying parallel-built indexes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of writing separate L0 segments for each worker buffer and then
compacting them, the leader now:

1. Accumulates all worker buffer contents into a local memtable
2. Writes ONE merged segment when all workers are done

This produces a single compact segment similar to serial build, avoiding
the 3x index size overhead that resulted from writing many small segments.

The new approach:
- tp_accumulate_worker_buffer_to_local(): copies DSA buffer data to local
  memtable (palloc-based, private to leader process)
- tp_write_segment_from_local_memtable(): writes segment from merged data
- tp_leader_process_buffers(): orchestrates the accumulate-then-merge flow

Double-buffering is preserved: as worker buffers fill, leader accumulates
their contents and signals buffers consumed, allowing workers to continue.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace the intermediate local memtable accumulation approach with
direct streaming N-way merge from worker DSA buffers to disk segment.

Key changes:
- Add TpWorkerMergeSource to track position in each buffer's sorted
  term array during N-way merge
- Add collect_buffer_term_postings() for N-way merge of postings
- Add tp_merge_worker_buffers_to_segment() that streams directly from
  worker buffers to segment without intermediate copy
- Update tp_leader_process_buffers() to merge when nworkers buffers
  are ready (allowing fast workers to contribute both buffers while
  slow workers catch up)
- Remove local memtable accumulation code and related functions

This eliminates memory overhead from copying all worker data to a
local memtable before writing the segment.
…el build

The dictionary entry writing code in tp_merge_worker_buffers_to_segment()
assumed all entries fit on the first page. With large datasets (150K+ docs),
dictionary entries span multiple pages, causing "Invalid logical page" errors.

Fixed by:
1. Flushing segment writer before writing entries
2. Using page-boundary-aware code that calculates logical/physical pages
3. Handling entries that span two pages like merge.c does
- Use posting-count threshold (tp_memtable_spill_threshold) instead of
  memory-based threshold for buffer switching. This produces the same
  number of segments as serial build (7 vs 163 before).

- Add compression support for posting blocks, matching serial build's
  compressed format. This reduces index size from 2.5GB to 1.2GB.

- Track total_postings in worker buffers for threshold checking.

Results on 8.8M MS MARCO docs:
- Serial:   282s, 1273 MB, 7 segments
- Parallel: 247s, 1222 MB, 7 segments (13% faster, 4% smaller)
Postgres caps parallel workers based on maintenance_work_mem, requiring
32MB per participant (workers + leader). With the default 64MB, only 1
worker can be launched regardless of max_parallel_maintenance_workers.

Update the NOTICE message to inform users about both GUCs.
- Fall back to serial build when only 1 parallel worker available
  (1 worker provides no benefit over serial)
- Skip writing recovery docid pages during bulk index build
  (CREATE INDEX is atomic - if it fails, index is discarded)
- Lower default expansion factor from 1.0 to 0.5 (BM25 indexes are
  typically 30-50% of heap size)
- Increase minimum pool pages from 1024 to 8192 (64MB) to handle
  small tables with page index overhead
- Update parallel build notice to mention maintenance_work_mem
1 worker still helps with read/write parallelism (worker scans while
leader writes). Reverted the nworkers > 1 check back to nworkers > 0.

Updated test expected output after confirming tests pass with both
debug and release builds.
Use P_NEW (dynamic relation extension) instead of pre-allocating a pool
of pages. This removes the `parallel_build_expansion_factor` GUC and
simplifies the code significantly.

The page pool was originally needed because workers could not safely
extend the relation (stale smgr cache). With the leader-merges architecture,
only the leader writes segments, so P_NEW works correctly.

Changes:
- Remove pool pre-allocation and tp_preallocate_page_pool()
- Remove tp_pool_get_page() and write_page_index_from_pool()
- Remove parallel_build_expansion_factor GUC from mod.c
- Remove pool-related fields from TpParallelBuildShared
- Update benchmark workflow to remove obsolete GUC settings
- Use tp_segment_writer_init() which extends relation with P_NEW
- Use write_page_index() which also extends with P_NEW

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Also add comments explaining the maintenance_work_mem requirement
for parallel builds (32MB per worker + leader).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add NOTICE message showing workers launched during parallel build
- Expand parallel_build test from 1 to 10 test cases:
  - Serial build (0 workers)
  - 1, 2, and 4 worker configurations
  - Short documents, duplicate terms, NULL content
  - Post-build inserts and VACUUM
  - Custom BM25 parameters (k1, b)
  - Below-threshold serial fallback
- Use larger GitHub runner for parallel scaling benchmarks
- Remove stray benchmark output files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark parallel index builds as complete (v0.5.0)
- Add write concurrency as future optimization target
- Update version to v0.5.0 and date to 2026-01-30
- Fix clang-format in build_parallel.c

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8+ term queries are currently slow due to BMW block-index iteration
limitations (documented in v0.3.0 section).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
v1.0.0 criteria:
- Feature freeze and QA push
- Benchmark validation with blog post
- Concurrent stress tests
- Operational readiness (pg_dump/restore, pg_upgrade, VACUUM, replication)
- Documentation improvements
- Dogfooding and soak testing

Future work:
- Boolean queries, background compaction
- Potential: multi-tenant support, positional queries, faceted optimization

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add NULL check for DSA string allocation to prevent crash if memory
  exhausted
- Fix memory leak in clear_worker_buffer: properly free DSA allocations
  (term strings and posting lists) before destroying hash tables

This ensures DSA memory is recycled when worker buffers are cleared,
preventing pool exhaustion on large parallel builds.
- Use DSHASH_HANDLE_INVALID instead of 0 for handle comparisons
- Wrap snapshot registration in #if PG_VERSION_NUM >= 180000 (PG18+ only)
- Use TP_TRANCHE_BUILD_DSA instead of LWTRANCHE_PARALLEL_HASH_JOIN
- Remove unused fields: num_terms, memory_used, memory_budget_per_worker
- Remove unused TP_MEMORY_SLOP_FACTOR define
- Add CHECK_FOR_INTERRUPTS() in worker wait loop for Ctrl-C handling
- Update NOTICE message to mention maintenance_work_mem requirement
- Remove unused shared parameter from tp_merge_worker_buffers_to_segment()

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
tjgreen42 and others added 2 commits January 31, 2026 09:03
…n indexes

- Add note about many-token query optimization opportunity (buffered union)
- Add expression index support to planned features

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The planner requires maintenance_work_mem >= 64MB to enable parallel
index builds. Document this requirement in README since builds silently
fall back to serial mode otherwise.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@tjgreen42 tjgreen42 force-pushed the parallel-build-leader-merges branch from df1b3d5 to 2929025 Compare January 31, 2026 17:04
@tjgreen42 tjgreen42 merged commit daa99ed into main Jan 31, 2026
12 checks passed
@tjgreen42 tjgreen42 deleted the parallel-build-leader-merges branch January 31, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants