perf: parallel ISA-L compression and pwrite output pipeline#3
Closed
perf: parallel ISA-L compression and pwrite output pipeline#3
Conversation
…atching Move gzip compression from single-threaded Writer (libdeflate) into N worker threads using ISA-L isal_deflate_stateless. Each worker accumulates raw output into a 512KB batch buffer before compressing, balancing compression ratio against latency. Writer thread receives pre-compressed data and performs raw fwrite only. Key changes: - Add isal_compress.h: thread-safe ISA-L gzip wrapper (levels 0-3) - Writer: add preCompressed flag to skip libdeflate when data is already compressed - WriterThread: auto-detect .gz output, compress in worker context with per-thread accumulation buffers, flush on input completion - Add bench_gz_compress.sh for A/B benchmarking Benchmarks (10M PE reads, Apple M4 Pro): fq→gz 14t: 2.99x speedup (28.0s → 9.4s), +8.5% output size gz→gz 14t: 1.93x speedup (3.0s → 1.6s), +8.5% output size gz→gz 4t: 1.20x speedup (2.8s → 2.3s), +8.4% output size Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace countMismatches with countMismatchesBounded in the overlap detection loop, enabling early termination when mismatch count exceeds the threshold. Also removes the now-redundant prefix-only fallback check. Most non-overlapping offsets are rejected after comparing just the first few dozen bytes instead of the full 150bp, reducing total comparison work by ~5-10x. Benchmark (10M PE reads, fq→fq, -w 12): 8.21s → 6.33s (1.30x, +23%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Worker threads now write directly via pwrite() using a lock-free offset ring buffer for coordination, eliminating the serial writer thread as a bottleneck for multi-threaded output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ckoff Three changes to the pwrite parallel writer: 1. Fix gz output corruption in pwrite mode: cross-pack accumulation (flight batching) was incompatible with interleaved sequence numbering. Worker 0 accumulated packs 0,3,6,... into one batch, placing pack 0+3 data at pack 3's file offset — before packs 1 and 2. Now each pack is compressed individually in pwrite mode, preserving correct ordering. 2. Add adaptive timeout for legacy mode flight batching (single-thread gz): EMA-based ingress rate estimation auto-tunes a flush timeout (2ms–50ms) so partial buffers don't wait indefinitely for the 512KB size threshold. 3. Replace tight spin-wait with progressive backoff in pwrite offset coordination: 32 iterations of arch-specific pause/yield hints, then std::this_thread::yield(). Eliminates -w 12 fq→fq regression (was 0.7x baseline, now 1.24x) and doubles fq→gz throughput at high thread counts. Also fixes silent pwrite error handling: EINTR retry, fatal exit on failure, and offset published by actual bytes written (not planned size). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- E2E benchmark script: PE/SE/stdin-stdout modes, auto thread allocation - No-baseline mode (--orig optional): compact table for opt-only runs - System info: CPU topology (P/E cores), memory type, disk specs, load avg - Peak RSS tracking via os.wait4(), cross-platform (macOS + Linux) - MD5 verification of output correctness across orig/opt binaries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ISA-L AArch64 assembly uses LDR literal with ±1MB range that overflows under fully static linking. Use same .a-file linking strategy across all platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allows running baseline and optimized benchmarks separately, then merging the two result JSONs into a single comparison table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # README.md # src/main.cpp # src/overlapanalysis.cpp
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pwrite()using a lock-free offset ring buffer, eliminating the writer thread as a serialization point-staticlinking that caused ISA-L relocation overflow on ARM64scripts/bench_e2e.py): PE/SE/stdin-stdout modes, hardware profiling, dual-mode (baseline vs optimized) comparison with JSON merge capabilityKey Performance Improvements
Changed Files
peprocessor.cpp/h,seprocessor.cpp/h,writerthread.cpp/h,writer.cpp/hisal_compress.h,flight_batch_manager.cpp/h,trace_profiler.cpp/hMakefile,.github/workflows/ci.ymlscripts/bench_e2e.py(replacesbench_e2e.sh)docs/rfc/, 2 benchmark results indocs/bench/Test Plan
python scripts/bench_e2e.py ./fastp_baseline ./fastp_opt🤖 Generated with Claude Code