perf: parallel ISA-L compression and pwrite output pipeline by KimYannn · Pull Request #3 · KimYannn/fastp

KimYannn · 2026-03-16T05:56:44Z

Summary

Parallel ISA-L gzip compression: Worker threads compress output in parallel using ISA-L, with flight batch management to coordinate ordering, replacing the serial single-writer-thread bottleneck
Parallel pwrite output: Workers write directly via pwrite() using a lock-free offset ring buffer, eliminating the writer thread as a serialization point
Bounded mismatch counting: Early-exit overlap analysis rejects non-overlapping offsets after comparing just the first few dozen bytes instead of the full 150bp (~5-10x reduction in comparison work)
Adaptive timeout with spin backoff: EMA-based ingress rate estimation auto-tunes flush timeout; progressive backoff replaces tight spin-wait in pwrite offset coordination
Decoupled FASTQ reader/writer with zstd support: Reader and writer paths are decoupled, with zstd input/output support and seekable output with autotuned workers
Pipeline backpressure and writer flight control: Improved flow control between pipeline stages
AArch64 static linking fix: Removed full -static linking that caused ISA-L relocation overflow on ARM64
End-to-end benchmark suite (scripts/bench_e2e.py): PE/SE/stdin-stdout modes, hardware profiling, dual-mode (baseline vs optimized) comparison with JSON merge capability

Key Performance Improvements

Mode	Before	After	Speedup
PE fq→fq (-w 12)	8.21s	6.33s	1.30x
PE fq→gz (parallel ISA-L)	—	significant	eliminates writer bottleneck

Changed Files

Core: peprocessor.cpp/h, seprocessor.cpp/h, writerthread.cpp/h, writer.cpp/h
New: isal_compress.h, flight_batch_manager.cpp/h, trace_profiler.cpp/h
Build: Makefile, .github/workflows/ci.yml
Bench: scripts/bench_e2e.py (replaces bench_e2e.sh)
Docs: 3 RFCs in docs/rfc/, 2 benchmark results in docs/bench/

Test Plan

CI passes on Linux (x86_64) and macOS (ARM64)
E2E benchmark: python scripts/bench_e2e.py ./fastp_baseline ./fastp_opt
MD5 verification of output correctness across all modes (PE/SE, fq/gz/zst)
Memory usage (peak RSS) does not regress significantly
Verify pwrite gz output is byte-identical to serial writer output

🤖 Generated with Claude Code

…atching Move gzip compression from single-threaded Writer (libdeflate) into N worker threads using ISA-L isal_deflate_stateless. Each worker accumulates raw output into a 512KB batch buffer before compressing, balancing compression ratio against latency. Writer thread receives pre-compressed data and performs raw fwrite only. Key changes: - Add isal_compress.h: thread-safe ISA-L gzip wrapper (levels 0-3) - Writer: add preCompressed flag to skip libdeflate when data is already compressed - WriterThread: auto-detect .gz output, compress in worker context with per-thread accumulation buffers, flush on input completion - Add bench_gz_compress.sh for A/B benchmarking Benchmarks (10M PE reads, Apple M4 Pro): fq→gz 14t: 2.99x speedup (28.0s → 9.4s), +8.5% output size gz→gz 14t: 1.93x speedup (3.0s → 1.6s), +8.5% output size gz→gz 4t: 1.20x speedup (2.8s → 2.3s), +8.4% output size Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace countMismatches with countMismatchesBounded in the overlap detection loop, enabling early termination when mismatch count exceeds the threshold. Also removes the now-redundant prefix-only fallback check. Most non-overlapping offsets are rejected after comparing just the first few dozen bytes instead of the full 150bp, reducing total comparison work by ~5-10x. Benchmark (10M PE reads, fq→fq, -w 12): 8.21s → 6.33s (1.30x, +23%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Worker threads now write directly via pwrite() using a lock-free offset ring buffer for coordination, eliminating the serial writer thread as a bottleneck for multi-threaded output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ckoff Three changes to the pwrite parallel writer: 1. Fix gz output corruption in pwrite mode: cross-pack accumulation (flight batching) was incompatible with interleaved sequence numbering. Worker 0 accumulated packs 0,3,6,... into one batch, placing pack 0+3 data at pack 3's file offset — before packs 1 and 2. Now each pack is compressed individually in pwrite mode, preserving correct ordering. 2. Add adaptive timeout for legacy mode flight batching (single-thread gz): EMA-based ingress rate estimation auto-tunes a flush timeout (2ms–50ms) so partial buffers don't wait indefinitely for the 512KB size threshold. 3. Replace tight spin-wait with progressive backoff in pwrite offset coordination: 32 iterations of arch-specific pause/yield hints, then std::this_thread::yield(). Eliminates -w 12 fq→fq regression (was 0.7x baseline, now 1.24x) and doubles fq→gz throughput at high thread counts. Also fixes silent pwrite error handling: EINTR retry, fatal exit on failure, and offset published by actual bytes written (not planned size). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- E2E benchmark script: PE/SE/stdin-stdout modes, auto thread allocation - No-baseline mode (--orig optional): compact table for opt-only runs - System info: CPU topology (P/E cores), memory type, disk specs, load avg - Peak RSS tracking via os.wait4(), cross-platform (macOS + Linux) - MD5 verification of output correctness across orig/opt binaries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ISA-L AArch64 assembly uses LDR literal with ±1MB range that overflows under fully static linking. Use same .a-file linking strategy across all platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Allows running baseline and optimized benchmarks separately, then merging the two result JSONs into a single comparison table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # README.md # src/main.cpp # src/overlapanalysis.cpp

KimYannn and others added 12 commits March 3, 2026 22:47

fix: remove -static linking for AArch64 ISA-L relocation compatibility

8b7f030

ISA-L AArch64 assembly uses LDR literal with ±1MB range that overflows under fully static linking. Use same .a-file linking strategy across all platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add --merge to combine two opt-only bench JSONs into comparison

7620ab4

Allows running baseline and optimized benchmarks separately, then merging the two result JSONs into a single comparison table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Improve pipeline backpressure/autotune and writer flight control

e6d760d

Decouple FASTQ reader/writer and add zstd in/out support

6da98ec

Switch zst backend to seekable output with autotuned workers

a349763

Refine thread budget split and writer-path autotuning

0bf314f

Merge branch 'master' into feat/parallel-isal-compress

c55ed05

# Conflicts: # README.md # src/main.cpp # src/overlapanalysis.cpp

KimYannn closed this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallel ISA-L compression and pwrite output pipeline#3

perf: parallel ISA-L compression and pwrite output pipeline#3
KimYannn wants to merge 12 commits intomasterfrom
feat/parallel-isal-compress

KimYannn commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KimYannn commented Mar 16, 2026

Summary

Key Performance Improvements

Changed Files

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant