Skip to content

perf: parallel ISA-L compression and pwrite output pipeline#3

Closed
KimYannn wants to merge 12 commits intomasterfrom
feat/parallel-isal-compress
Closed

perf: parallel ISA-L compression and pwrite output pipeline#3
KimYannn wants to merge 12 commits intomasterfrom
feat/parallel-isal-compress

Conversation

@KimYannn
Copy link
Copy Markdown
Owner

Summary

  • Parallel ISA-L gzip compression: Worker threads compress output in parallel using ISA-L, with flight batch management to coordinate ordering, replacing the serial single-writer-thread bottleneck
  • Parallel pwrite output: Workers write directly via pwrite() using a lock-free offset ring buffer, eliminating the writer thread as a serialization point
  • Bounded mismatch counting: Early-exit overlap analysis rejects non-overlapping offsets after comparing just the first few dozen bytes instead of the full 150bp (~5-10x reduction in comparison work)
  • Adaptive timeout with spin backoff: EMA-based ingress rate estimation auto-tunes flush timeout; progressive backoff replaces tight spin-wait in pwrite offset coordination
  • Decoupled FASTQ reader/writer with zstd support: Reader and writer paths are decoupled, with zstd input/output support and seekable output with autotuned workers
  • Pipeline backpressure and writer flight control: Improved flow control between pipeline stages
  • AArch64 static linking fix: Removed full -static linking that caused ISA-L relocation overflow on ARM64
  • End-to-end benchmark suite (scripts/bench_e2e.py): PE/SE/stdin-stdout modes, hardware profiling, dual-mode (baseline vs optimized) comparison with JSON merge capability

Key Performance Improvements

Mode Before After Speedup
PE fq→fq (-w 12) 8.21s 6.33s 1.30x
PE fq→gz (parallel ISA-L) significant eliminates writer bottleneck

Changed Files

  • Core: peprocessor.cpp/h, seprocessor.cpp/h, writerthread.cpp/h, writer.cpp/h
  • New: isal_compress.h, flight_batch_manager.cpp/h, trace_profiler.cpp/h
  • Build: Makefile, .github/workflows/ci.yml
  • Bench: scripts/bench_e2e.py (replaces bench_e2e.sh)
  • Docs: 3 RFCs in docs/rfc/, 2 benchmark results in docs/bench/

Test Plan

  • CI passes on Linux (x86_64) and macOS (ARM64)
  • E2E benchmark: python scripts/bench_e2e.py ./fastp_baseline ./fastp_opt
  • MD5 verification of output correctness across all modes (PE/SE, fq/gz/zst)
  • Memory usage (peak RSS) does not regress significantly
  • Verify pwrite gz output is byte-identical to serial writer output

🤖 Generated with Claude Code

KimYannn and others added 12 commits March 3, 2026 22:47
…atching

Move gzip compression from single-threaded Writer (libdeflate) into N
worker threads using ISA-L isal_deflate_stateless. Each worker accumulates
raw output into a 512KB batch buffer before compressing, balancing
compression ratio against latency. Writer thread receives pre-compressed
data and performs raw fwrite only.

Key changes:
- Add isal_compress.h: thread-safe ISA-L gzip wrapper (levels 0-3)
- Writer: add preCompressed flag to skip libdeflate when data is
  already compressed
- WriterThread: auto-detect .gz output, compress in worker context
  with per-thread accumulation buffers, flush on input completion
- Add bench_gz_compress.sh for A/B benchmarking

Benchmarks (10M PE reads, Apple M4 Pro):
  fq→gz 14t: 2.99x speedup (28.0s → 9.4s), +8.5% output size
  gz→gz 14t: 1.93x speedup (3.0s → 1.6s), +8.5% output size
  gz→gz  4t: 1.20x speedup (2.8s → 2.3s), +8.4% output size

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace countMismatches with countMismatchesBounded in the overlap
detection loop, enabling early termination when mismatch count exceeds
the threshold. Also removes the now-redundant prefix-only fallback check.

Most non-overlapping offsets are rejected after comparing just the first
few dozen bytes instead of the full 150bp, reducing total comparison
work by ~5-10x.

Benchmark (10M PE reads, fq→fq, -w 12): 8.21s → 6.33s (1.30x, +23%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Worker threads now write directly via pwrite() using a lock-free offset
ring buffer for coordination, eliminating the serial writer thread as a
bottleneck for multi-threaded output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ckoff

Three changes to the pwrite parallel writer:

1. Fix gz output corruption in pwrite mode: cross-pack accumulation
   (flight batching) was incompatible with interleaved sequence numbering.
   Worker 0 accumulated packs 0,3,6,... into one batch, placing pack 0+3
   data at pack 3's file offset — before packs 1 and 2. Now each pack is
   compressed individually in pwrite mode, preserving correct ordering.

2. Add adaptive timeout for legacy mode flight batching (single-thread gz):
   EMA-based ingress rate estimation auto-tunes a flush timeout (2ms–50ms)
   so partial buffers don't wait indefinitely for the 512KB size threshold.

3. Replace tight spin-wait with progressive backoff in pwrite offset
   coordination: 32 iterations of arch-specific pause/yield hints, then
   std::this_thread::yield(). Eliminates -w 12 fq→fq regression (was 0.7x
   baseline, now 1.24x) and doubles fq→gz throughput at high thread counts.

Also fixes silent pwrite error handling: EINTR retry, fatal exit on
failure, and offset published by actual bytes written (not planned size).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- E2E benchmark script: PE/SE/stdin-stdout modes, auto thread allocation
- No-baseline mode (--orig optional): compact table for opt-only runs
- System info: CPU topology (P/E cores), memory type, disk specs, load avg
- Peak RSS tracking via os.wait4(), cross-platform (macOS + Linux)
- MD5 verification of output correctness across orig/opt binaries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ISA-L AArch64 assembly uses LDR literal with ±1MB range that overflows
under fully static linking. Use same .a-file linking strategy across
all platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allows running baseline and optimized benchmarks separately, then
merging the two result JSONs into a single comparison table.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	README.md
#	src/main.cpp
#	src/overlapanalysis.cpp
@KimYannn KimYannn closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant