feat: parallel index build with leader-merges architecture by tjgreen42 · Pull Request #188 · timescale/pg_textsearch

tjgreen42 · 2026-01-29T03:32:35Z

Summary

Implements parallel index build where workers scan heap and build memtables, leader merges and writes segments
Workers use double-buffered memtables in shared DSA memory to avoid blocking
Leader performs all writes using dynamic relation extension (P_NEW)
Serial and parallel builds now produce identical index output (same size, same format)

Architecture rationale

The key insight is moving all write responsibilities to the leader node:

Sequential writes: Leader writes segments sequentially, avoiding the heavy fragmentation that arose with earlier approaches where workers wrote their own segments
CPU-bound workload: Index building is mostly CPU-bound on the read path due to to_tsvector calls and memtable-population operations, so centralizing writes doesn't bottleneck throughput
No followup compaction: Previous approaches required a compaction step after workers finished; this design produces well-ordered segments directly

This architecture emerged from experimentation with several alternatives, informed by visualizations of page layout fragmentation.

Key changes

Architecture: Workers fill memtables → signal ready → leader merges N buffers into single segment
Threshold matching: Uses same tp_memtable_spill_threshold (32M postings) as serial build
Compression: Applies delta+bitpacking compression to posting blocks (same as serial)
Page boundary handling: Fixed dictionary entry writing across page boundaries
Simplified page allocation: Uses P_NEW instead of pre-allocated pool (no GUC needed)
Skip recovery docids: Removed unnecessary docid writes during CREATE INDEX (atomic operation doesn't need recovery). This applies to both serial and parallel builds, so indexes are now identical.

Benchmark (MS MARCO 8.8M docs)

Configuration	Time	Index Size
Serial	4:27	1190 MB
Parallel (4 workers)	1:26	1190 MB

3.1x faster with proper configuration. Index sizes now match exactly.

Note: Postgres requires maintenance_work_mem >= 64MB for parallel builds (the planner won't assign workers otherwise).

Testing

make installcheck passes
Tested on MS MARCO with 0, 1, 4 workers
Index sizes match between serial and parallel builds

SteveLauC · 2026-01-30T06:00:38Z

Do we have a comparison between the old architecture (main branch) and this new one? From what I can see, the new architecture offers:

Less random I/O as leader is the only writer, and it writes sequentially (mostly)
Less small tail segments
Less chances to trigger compaction

tjgreen42 · 2026-01-30T19:21:52Z

Do we have a comparison between the old architecture (main branch) and this new one? From what I can see, the new architecture offers:

Less random I/O as leader is the only writer, and it writes sequentially (mostly)

Less small tail segments

Less chances to trigger compaction

The main gist is to move write responsibilities to the leader node in order to make the writes mostly sequential (as you point out) and especially to avoid the heavy fragmentation that was arising with the previous approach, where workers wrote their own segments with a followup compaction step. Since we are mostly CPU bound during indexing (and mostly on the read path) due to the calls to to_tsvector and memtable-population operations, this turns out to give pretty good speedups in the # of workers. I had to experiment quite a bit to hit on this approach (several other PRs in flight that i'm now going to close) and add visualizations to see what was happening with the fragmentation. Ironically, I intended the viz for human consumption, but the robot started using them immediately and became the main consumer.

SteveLauC

Did a brief review of the PR, sorry for the bunch of comments 🫠

src/am/build.c

.github/workflows/benchmark.yml

OPTIMIZATION_ROADMAP.md

src/am/build_parallel.c

src/am/build_parallel.h

src/am/build_parallel.c

tjgreen42 · 2026-01-31T16:16:48Z

Did a brief review of the PR, sorry for the bunch of comments 🫠

Thanks much @SteveLauC for going through the PR, it's helpful. I'd like to chat irl if you're game, ping me at tj@tigerdata.com.

Workers scan the heap in parallel and build memtables in shared DSA memory using double-buffering. The leader reads worker memtables and writes L0 segments, ensuring all disk I/O happens sequentially for contiguous page allocation. Key implementation details: - Workers use dshash tables for term storage with custom hash/compare functions that support length-based lookups (no allocation needed for TSVector terms which aren't null-terminated) - Leader pre-allocates a page pool to avoid smgr cache issues with concurrent page allocation - LWLock tranches registered in _PG_init() for dshash synchronization - Hash tables destroyed and recreated when buffers are reused to prevent duplicate document entries Fixes during implementation: - Added len field to TpStringKey for allocation-free lookups - Fixed hash table clearing on buffer reuse (was causing 1.5x doc count) - Registered custom LWLock tranches instead of using PARALLEL_HASH_JOIN

The final truncation after parallel build was incorrectly computing max_used by only looking at segment data pages (seg + num_pages). This missed page index pages which can be located at any block number. Use tp_segment_collect_pages() to get ALL pages (data + page index) for each segment when computing the maximum used block. This fixes "could not read blocks" errors when querying parallel-built indexes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Instead of writing separate L0 segments for each worker buffer and then compacting them, the leader now: 1. Accumulates all worker buffer contents into a local memtable 2. Writes ONE merged segment when all workers are done This produces a single compact segment similar to serial build, avoiding the 3x index size overhead that resulted from writing many small segments. The new approach: - tp_accumulate_worker_buffer_to_local(): copies DSA buffer data to local memtable (palloc-based, private to leader process) - tp_write_segment_from_local_memtable(): writes segment from merged data - tp_leader_process_buffers(): orchestrates the accumulate-then-merge flow Double-buffering is preserved: as worker buffers fill, leader accumulates their contents and signals buffers consumed, allowing workers to continue. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace the intermediate local memtable accumulation approach with direct streaming N-way merge from worker DSA buffers to disk segment. Key changes: - Add TpWorkerMergeSource to track position in each buffer's sorted term array during N-way merge - Add collect_buffer_term_postings() for N-way merge of postings - Add tp_merge_worker_buffers_to_segment() that streams directly from worker buffers to segment without intermediate copy - Update tp_leader_process_buffers() to merge when nworkers buffers are ready (allowing fast workers to contribute both buffers while slow workers catch up) - Remove local memtable accumulation code and related functions This eliminates memory overhead from copying all worker data to a local memtable before writing the segment.

…el build The dictionary entry writing code in tp_merge_worker_buffers_to_segment() assumed all entries fit on the first page. With large datasets (150K+ docs), dictionary entries span multiple pages, causing "Invalid logical page" errors. Fixed by: 1. Flushing segment writer before writing entries 2. Using page-boundary-aware code that calculates logical/physical pages 3. Handling entries that span two pages like merge.c does

- Use posting-count threshold (tp_memtable_spill_threshold) instead of memory-based threshold for buffer switching. This produces the same number of segments as serial build (7 vs 163 before). - Add compression support for posting blocks, matching serial build's compressed format. This reduces index size from 2.5GB to 1.2GB. - Track total_postings in worker buffers for threshold checking. Results on 8.8M MS MARCO docs: - Serial: 282s, 1273 MB, 7 segments - Parallel: 247s, 1222 MB, 7 segments (13% faster, 4% smaller)

Postgres caps parallel workers based on maintenance_work_mem, requiring 32MB per participant (workers + leader). With the default 64MB, only 1 worker can be launched regardless of max_parallel_maintenance_workers. Update the NOTICE message to inform users about both GUCs.

- Fall back to serial build when only 1 parallel worker available (1 worker provides no benefit over serial) - Skip writing recovery docid pages during bulk index build (CREATE INDEX is atomic - if it fails, index is discarded) - Lower default expansion factor from 1.0 to 0.5 (BM25 indexes are typically 30-50% of heap size) - Increase minimum pool pages from 1024 to 8192 (64MB) to handle small tables with page index overhead - Update parallel build notice to mention maintenance_work_mem

1 worker still helps with read/write parallelism (worker scans while leader writes). Reverted the nworkers > 1 check back to nworkers > 0. Updated test expected output after confirming tests pass with both debug and release builds.

Use P_NEW (dynamic relation extension) instead of pre-allocating a pool of pages. This removes the `parallel_build_expansion_factor` GUC and simplifies the code significantly. The page pool was originally needed because workers could not safely extend the relation (stale smgr cache). With the leader-merges architecture, only the leader writes segments, so P_NEW works correctly. Changes: - Remove pool pre-allocation and tp_preallocate_page_pool() - Remove tp_pool_get_page() and write_page_index_from_pool() - Remove parallel_build_expansion_factor GUC from mod.c - Remove pool-related fields from TpParallelBuildShared - Update benchmark workflow to remove obsolete GUC settings - Use tp_segment_writer_init() which extends relation with P_NEW - Use write_page_index() which also extends with P_NEW Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Also add comments explaining the maintenance_work_mem requirement for parallel builds (32MB per worker + leader). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add NOTICE message showing workers launched during parallel build - Expand parallel_build test from 1 to 10 test cases: - Serial build (0 workers) - 1, 2, and 4 worker configurations - Short documents, duplicate terms, NULL content - Post-build inserts and VACUUM - Custom BM25 parameters (k1, b) - Below-threshold serial fallback - Use larger GitHub runner for parallel scaling benchmarks - Remove stray benchmark output files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Mark parallel index builds as complete (v0.5.0) - Add write concurrency as future optimization target - Update version to v0.5.0 and date to 2026-01-30 - Fix clang-format in build_parallel.c Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

8+ term queries are currently slow due to BMW block-index iteration limitations (documented in v0.3.0 section). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

v1.0.0 criteria: - Feature freeze and QA push - Benchmark validation with blog post - Concurrent stress tests - Operational readiness (pg_dump/restore, pg_upgrade, VACUUM, replication) - Documentation improvements - Dogfooding and soak testing Future work: - Boolean queries, background compaction - Potential: multi-tenant support, positional queries, faceted optimization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add NULL check for DSA string allocation to prevent crash if memory exhausted - Fix memory leak in clear_worker_buffer: properly free DSA allocations (term strings and posting lists) before destroying hash tables This ensures DSA memory is recycled when worker buffers are cleared, preventing pool exhaustion on large parallel builds.

- Use DSHASH_HANDLE_INVALID instead of 0 for handle comparisons - Wrap snapshot registration in #if PG_VERSION_NUM >= 180000 (PG18+ only) - Use TP_TRANCHE_BUILD_DSA instead of LWTRANCHE_PARALLEL_HASH_JOIN - Remove unused fields: num_terms, memory_used, memory_budget_per_worker - Remove unused TP_MEMORY_SLOP_FACTOR define - Add CHECK_FOR_INTERRUPTS() in worker wait loop for Ctrl-C handling - Update NOTICE message to mention maintenance_work_mem requirement - Remove unused shared parameter from tp_merge_worker_buffers_to_segment() Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…n indexes - Add note about many-token query optimization opportunity (buffered union) - Add expression index support to planned features Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The planner requires maintenance_work_mem >= 64MB to enable parallel index builds. Document this requirement in README since builds silently fall back to serial mode otherwise. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

tjgreen42 force-pushed the parallel-build-leader-merges branch from 7dde6bc to 10fc78d Compare January 30, 2026 20:10

SteveLauC reviewed Jan 31, 2026

View reviewed changes

tjgreen42 mentioned this pull request Jan 31, 2026

Implement dynamic page pool extension for parallel index builds #172

Closed

tjgreen42 and others added 24 commits January 31, 2026 09:03

style: fix formatting

0d44e4d

fix: add final truncation after compaction to reclaim unused pages

32113b3

style: fix formatting

4ade7ac

style: fix formatting

df05e23

chore: add max_parallel_maintenance_workers to Cranfield benchmark

96ea39d

Also add comments explaining the maintenance_work_mem requirement for parallel builds (32MB per worker + leader). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: add v0.5.0 to roadmap with parallel index builds

6fe7c86

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: add many-token query optimization to roadmap

1909fa8

8+ term queries are currently slow due to BMW block-index iteration limitations (documented in v0.3.0 section). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: fix comment about parallel build I/O

5485d59

docs: update README for parallel indexing

db49659

tjgreen42 and others added 2 commits January 31, 2026 09:03

docs: update roadmap with many-token query optimization and expressio…

d49f5f2

…n indexes - Add note about many-token query optimization opportunity (buffered union) - Add expression index support to planned features Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

tjgreen42 force-pushed the parallel-build-leader-merges branch from df1b3d5 to 2929025 Compare January 31, 2026 17:04

tjgreen42 merged commit daa99ed into main Jan 31, 2026
12 checks passed

tjgreen42 deleted the parallel-build-leader-merges branch January 31, 2026 18:14

tjgreen42 mentioned this pull request Feb 6, 2026

docs: update benchmark comparison with Feb 6, 2026 results #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parallel index build with leader-merges architecture#188

feat: parallel index build with leader-merges architecture#188
tjgreen42 merged 26 commits intomainfrom
parallel-build-leader-merges

tjgreen42 commented Jan 29, 2026 •

edited

Loading

Uh oh!

SteveLauC commented Jan 30, 2026

Uh oh!

tjgreen42 commented Jan 30, 2026

Uh oh!

SteveLauC left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tjgreen42 commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tjgreen42 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture rationale

Key changes

Benchmark (MS MARCO 8.8M docs)

Testing

Uh oh!

SteveLauC commented Jan 30, 2026

Uh oh!

tjgreen42 commented Jan 30, 2026

Uh oh!

SteveLauC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tjgreen42 commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tjgreen42 commented Jan 29, 2026 •

edited

Loading