diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 30ec631..7c570b8 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -283,52 +283,71 @@ GPU Huffman with Blelloch prefix sum crosses over ~128KB. At 256KB the GPU scan GPU BWT radix sort is 7-14x faster than the old bitonic sort. Still slower than CPU SA-IS at small sizes but becoming competitive at 64KB+ (CPU SA-IS ~1ms at 64KB vs GPU 4.1ms). The gap narrows at larger sizes where GPU parallelism helps more. +## GPU/CPU strategy (settled) + +The optimal split for libpz is **GPU for LZ77 match-finding, CPU for entropy coding**, +overlapped via the unified scheduler with ring-buffered streaming. + +**Why GPU wins on LZ77:** Match-finding is embarrassingly parallel — each position's +search is independent. The cooperative-stitch kernel does 1,788 probes/position and +is 2x faster than CPU at 256KB+. Ring-buffered batching (`find_matches_batched`) +adds +7-17% throughput by amortizing buffer allocation and overlapping GPU compute +with CPU readback. + +**Why CPU wins on entropy:** rANS/FSE/Huffman are serial state machines — each +symbol depends on the previous state. GPU entropy has been tried extensively +(500+ iterations: single-stream, independent blocks, Recoil checkpoints, batched +cross-block) and is 0.77x CPU on encode, 0.54x on decode. The serial dependency +limits GPU to ~300 useful threads when saturation needs ~8K-16K. PCIe transfer +overhead dominates at typical block sizes (128KB-256KB). + +**Architecture:** The unified scheduler dispatches LZ77 to GPU and entropy to CPU +workers in parallel. While CPU thread N entropy-encodes block K, the GPU is already +match-finding block K+1. The `GPU_ENTROPY_THRESHOLD` (256KB) is deliberately set +above `DEFAULT_GPU_BLOCK_SIZE` (128KB) to prevent routing entropy to the slower +GPU path. + +See `docs/design-docs/gpu-strategy.md` for full analysis and `CLAUDE.md` "Known +dead ends" for the complete list of GPU optimization attempts that failed. + ## Remaining GPU bottlenecks 1. **GPU BWT still slower than CPU SA-IS** — Radix sort improved 7-14x over bitonic sort, but CPU SA-IS (O(n)) remains faster at small/medium sizes. GPU catches up at 64KB+ but prefix-doubling's O(n log n) work is inherently more than SA-IS's O(n). -2. **No shared memory usage** — LZ77 hash kernel uses only global memory. - Loading hash buckets into `__local` memory could help at larger sizes. - -3. **Hash bucket overflow** — Fixed BUCKET_CAP=64 means highly repetitive data +2. **Hash bucket overflow** — Fixed BUCKET_CAP=64 means highly repetitive data may miss good matches. Adaptive bucket sizing could help. -4. **Huffman WriteCodes atomic contention** — Per-bit atomic_or on the output - buffer limits scaling. Chunk-based packing could reduce contention. - -5. **LZ77 match array still downloaded for dedupe** — GPU match dedup is sequential +3. **LZ77 match array still downloaded for dedupe** — GPU match dedup is sequential and runs on CPU. Keeping serialized LZ77 bytes on GPU for histogram+Huffman is already done (ByteHistogram optimization), but the match download is unavoidable. ## Next steps -### Priority 0: rANS SIMD completion - -See `docs/exec-plans/tech-debt-tracker.md` for full details. Two items: SIMD decode paths (SSE2/AVX2 intrinsics for interleaved decode) and reciprocal multiplication (eliminate data-dependent division in encode). +### Priority 0: Close the gzip compression ratio gap -### Priority 1: Use local/shared memory in LZ77 hash kernel -- Load hash buckets into `__local` memory for faster repeated access -- Could improve GPU LZ77 performance at mid-range sizes (64KB-256KB) -- May lower the GPU crossover point from 256KB toward 64-128KB +LzSeqR is our best pipeline at 35.1% vs gzip's 28.6% (6.5pp gap). The gap is +encoding efficiency, not match quality (see `CLAUDE.md` "Known dead ends"). The +format is pre-release so all changes are free. Key opportunities: -### Priority 3: Chunk-based Huffman bit packing -- Replace per-bit `atomic_or` in WriteCodes with work-group-local packing -- Each work-group packs its chunk into local memory (no atomics within WG) -- Single copy from local → global per chunk -- Could 5-10x GPU Huffman throughput +- **Zstd-style literal-run sequences** — replace per-token flags with + `(literal_run_length, offset, match_length)` tuples, eliminating the flags + stream entirely. Highest ceiling. +- **Larger repeat offset cache** (4→8) — each additional repeat saves all extra + bits for that match. +- **Entropy-code the extra bits** — `offset_extra` and `length_extra` currently + bypass rANS; if values are skewed, 5-15% savings. +- **Sparse frequency tables** — 512 bytes per rANS stream → ~61 bytes for + narrow-alphabet streams. Saves ~1.3KB/block. -### Priority 4: Fuzz testing -- Set up `cargo-fuzz` for round-trip correctness on all pipelines -- Target edge cases in LZ77, Huffman, BWT decode paths +### Priority 1: LzSeq-specific optimal parser -### Priority 5: Auto-selection threshold tuning -- Run all 3 pipelines on Canterbury + Silesia corpora -- Measure actual compression ratios vs analysis metrics -- Tune heuristic decision tree thresholds empirically +The optimal parser currently uses LZ77's `match_cost` approximation. A dedicated +LzSeq optimal parser that tracks repeat offset state through the backward DP would +find more repeat matches, directly improving ratio. -### Priority 6: aarch64 NEON/SVE SIMD implementation +### Priority 2: aarch64 NEON/SVE SIMD implementation - Replace scalar stubs in `src/simd.rs` with actual NEON intrinsics - `compare_bytes`: `vceqq_u8` + `vmovn_u16` for 16-byte comparison - `byte_frequencies`: 4-bank unrolled (NEON lacks efficient gather/scatter) diff --git a/CLAUDE.md b/CLAUDE.md index eb35cdb..ce132d8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -52,6 +52,25 @@ Specialized agents in `.claude/agents/` run on cheaper models and keep verbose o - `scripts/` — test, bench, profile, setup, and analysis tools - `docs/` — design docs, quality status, exec plans, references +## Known dead ends + +Before optimizing GPU code paths, read this first — multiple agents have spent full sessions rediscovering these: + +- **GPU entropy (rANS/FSE) is slower than CPU** — 0.77x on encode, 0.54x on decode. This has been proven across 500+ optimization iterations. The serial state dependency in rANS limits GPU to ~300 threads; saturation needs ~8K-16K. Do not attempt to batch, parallelize, or "pipeline" GPU entropy encoding. +- **`gpu_fused_span()` returning `Some((0,1))` is counterproductive** — it routes entropy to GPU (slower). It exists as architectural prep for if GPU entropy ever becomes competitive. The `GPU_ENTROPY_THRESHOLD` (256KB > default GPU block size 128KB) intentionally prevents this path from activating. +- **The CLI uses `streaming::compress_stream`, not `pipeline::compress_with_options`** — the parallel scheduler's GPU coordinator is not invoked by the CLI. The streaming path deliberately uses CPU rANS for entropy. +- **The real GPU win (ring-buffered LZ77 batching) is already shipped** — delivers +7-17% throughput. See `docs/design-docs/gpu-strategy.md`. +- **GPU device init time skews throughput benchmarks** — first-call GPU init adds significant overhead that `bench.sh` captures but Criterion amortizes across iterations. When comparing GPU vs CPU throughput, use Criterion (`cargo bench`) for apples-to-apples; `bench.sh` reflects real-world cold-start cost. Don't chase "GPU is slower" regressions that are really just init time. +- **Compression ratio is limited by 5-byte match encoding, not match quality** — the LZ match-finder finds good matches, but the 5-byte serialized match format creates overhead on short matches. Improving ratio means fixing the encoding format, not tuning the matcher. +- **GPU Huffman is a dead end** — Huffman coding requires bit-level alignment, but GPU throughput depends on byte-aligned memory access patterns. This is a fundamental architectural mismatch; do not attempt to port Huffman to GPU. +- **GPU hash tables for LZ matching don't work** — GPU atomics don't preserve insertion order, so hash chains lose recency information. Match quality collapses to ~6% vs CPU's 99.6% on repetitive data. Tried twice (global atomics + shared-memory variant), both catastrophically failed. See `docs/design-docs/experiments.md`. +- **SSE2 rANS decode is 32% slower than scalar** — scalar 4-lane decode gets good ILP from out-of-order execution. SSE2 extract operations serialize and lose that parallelism. Proper SIMD rANS would need merged slot-indexed tables and SSE4.1+. The dispatch is disabled; don't re-enable it. +- **Fully parallel GPU LZ parsing (ParlZ) has unacceptable ratio loss** — 37.6% compression gap vs serial parsing. Forward-max-propagation conflict resolution is too aggressive. Hybrid GPU match-finding + CPU serial parsing is the correct architecture. +- **Iterative GPU algorithms have quadratic host overhead** — Repair grammar compression hit 0.4 MB/s due to 100+ rounds of buffer alloc + readback. Avoid per-round GPU↔CPU synchronization; prefer single-dispatch or persistent-buffer designs. +- **Window-capped suffix sorts break BWT invertibility** — FWST produced 433% ratio (massive expansion). Full suffix sort is structurally required for LF-mapping; there's no shortcut. + +For detailed history of all failed experiments, see `docs/design-docs/gpu-experiments-wave2-conclusions.md` and `docs/design-docs/experiments.md`. + ## Key conventions See **docs/DESIGN.md** for full design principles and **docs/design-docs/core-beliefs.md** for agent-first operating principles. diff --git a/scripts/fetch-silesia.sh b/scripts/fetch-silesia.sh old mode 100644 new mode 100755 index d36dec2..ec18f38 --- a/scripts/fetch-silesia.sh +++ b/scripts/fetch-silesia.sh @@ -1,37 +1,37 @@ -#!/usr/bin/env bash -# Download and extract the Silesia compression corpus. -# https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia -# -# Usage: ./scripts/fetch-silesia.sh [--force] -# -# Downloads to samples/silesia/. Skips if already present unless --force. - -set -euo pipefail - -SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" -PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)" - -DEST="${PROJECT_DIR}/samples/silesia" -URL="https://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip" -EXPECTED_FILES=12 - -if [[ -d "$DEST" ]] && [[ "$(ls "$DEST" | wc -l)" -ge "$EXPECTED_FILES" ]] && [[ "${1:-}" != "--force" ]]; then - echo "Silesia corpus already present in $DEST ($(ls "$DEST" | wc -l) files)" - echo "Use --force to re-download." - exit 0 -fi - -TMPFILE="$(mktemp /tmp/silesia-XXXXXX.zip)" -trap 'rm -f "$TMPFILE"' EXIT - -echo "Downloading Silesia corpus (66 MB)..." -curl -fSL "$URL" -o "$TMPFILE" - -mkdir -p "$DEST" -echo "Extracting to $DEST..." -unzip -o "$TMPFILE" -d "$DEST" - -# Clean up any .pz artifacts from previous runs -rm -f "$DEST"/*.pz - -echo "Done: $(ls "$DEST" | wc -l) files, $(du -sh "$DEST" | cut -f1) total" +#!/usr/bin/env bash +# Download and extract the Silesia compression corpus. +# https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia +# +# Usage: ./scripts/fetch-silesia.sh [--force] +# +# Downloads to samples/silesia/. Skips if already present unless --force. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)" + +DEST="${PROJECT_DIR}/samples/silesia" +URL="https://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip" +EXPECTED_FILES=12 + +if [[ -d "$DEST" ]] && [[ "$(ls "$DEST" | wc -l)" -ge "$EXPECTED_FILES" ]] && [[ "${1:-}" != "--force" ]]; then + echo "Silesia corpus already present in $DEST ($(ls "$DEST" | wc -l) files)" + echo "Use --force to re-download." + exit 0 +fi + +TMPFILE="$(mktemp /tmp/silesia-XXXXXX.zip)" +trap 'rm -f "$TMPFILE"' EXIT + +echo "Downloading Silesia corpus (66 MB)..." +curl -fSL "$URL" -o "$TMPFILE" + +mkdir -p "$DEST" +echo "Extracting to $DEST..." +unzip -o "$TMPFILE" -d "$DEST" + +# Clean up any .pz artifacts from previous runs +rm -f "$DEST"/*.pz + +echo "Done: $(ls "$DEST" | wc -l) files, $(du -sh "$DEST" | cut -f1) total" diff --git a/src/pipeline/mod.rs b/src/pipeline/mod.rs index 3266eee..aa2724c 100644 --- a/src/pipeline/mod.rs +++ b/src/pipeline/mod.rs @@ -143,8 +143,11 @@ const DEFAULT_BW_BLOCK_SIZE: usize = 512 * 1024; /// which also improves throughput on streaming GPU paths. const DEFAULT_GPU_BLOCK_SIZE: usize = 128 * 1024; -/// Minimum block size for GPU entropy to win over CPU (empirical from Phase 4). -/// Applies to both individual blocks and total stream byte count. +/// Gate that **prevents** routing entropy to the GPU, which is slower than CPU +/// (0.77x encode, 0.54x decode — see `docs/design-docs/gpu-strategy.md`). +/// Set deliberately above `DEFAULT_GPU_BLOCK_SIZE` (128KB) so GPU blocks never +/// qualify for the fused entropy path. Do NOT lower this threshold — it will +/// regress throughput by ~10-15x. See "Known dead ends" in CLAUDE.md. /// 256KB = 262144 bytes (aligns with AC3.2 threshold). pub const GPU_ENTROPY_THRESHOLD: usize = 256 * 1024; diff --git a/src/pipeline/parallel.rs b/src/pipeline/parallel.rs index 8a43794..d3395ab 100644 --- a/src/pipeline/parallel.rs +++ b/src/pipeline/parallel.rs @@ -191,12 +191,23 @@ fn should_route_block_to_gpu_stage0( /// Returns `Some((start, end))` when every stage in `start..=end` has a GPU /// implementation, allowing the GPU coordinator to run them sequentially /// without intermediate queue round-trips or CPU readback. +/// +/// **WARNING — GPU entropy is currently SLOWER than CPU** (0.77x encode, 0.54x decode). +/// This function exists as architectural preparation for when GPU entropy becomes +/// competitive. In practice, `GPU_ENTROPY_THRESHOLD` (256KB) is deliberately set above +/// `DEFAULT_GPU_BLOCK_SIZE` (128KB) so the fused path is never activated for GPU blocks. +/// Lowering the threshold or bypassing it will REGRESS throughput by ~10-15x because +/// it serializes all work onto the single coordinator thread AND uses the slower GPU +/// entropy path. See `docs/design-docs/gpu-strategy.md` and +/// `.claude/feedback/2026-03-01-gpu-wins-on-lz77-loses-on-entropy.md`. #[cfg(feature = "webgpu")] fn gpu_fused_span(pipeline: Pipeline) -> Option<(usize, usize)> { match pipeline { // Both stages have GPU paths: LZ77 match-finding + rANS encode + // NOTE: GPU rANS is slower than CPU — fused path is gated by GPU_ENTROPY_THRESHOLD Pipeline::Lzr => Some((0, 1)), // Both stages have GPU paths: LzSeq fused match+demux + rANS encode + // NOTE: GPU rANS is slower than CPU — fused path is gated by GPU_ENTROPY_THRESHOLD Pipeline::LzSeqR => Some((0, 1)), _ => None, } diff --git a/src/pipeline/tests.rs b/src/pipeline/tests.rs index 202bc8d..aab7f1f 100644 --- a/src/pipeline/tests.rs +++ b/src/pipeline/tests.rs @@ -899,6 +899,59 @@ mod gpu_batched_tests { let decompressed = decompress(&compressed).unwrap(); assert_eq!(decompressed, input); } + + /// Multi-block pipeline pipelining regression test. + /// + /// Creates input large enough to produce many blocks (>= 8), then + /// compresses with GPU and verifies the round-trip. This exercises + /// the full GPU coordinator pipeline: ring-buffered LZ77 stage 0 + /// batching, StageN entropy encoding, and result assembly. + /// + /// Correctness failures here indicate pipelining bugs (e.g., staging + /// buffer reuse before readback, ring slot cross-contamination, or + /// incomplete synchronization between GPU compute and CPU readback). + #[test] + fn test_gpu_pipeline_multiblock_correctness() { + let engine = match crate::webgpu::WebGpuEngine::new() { + Ok(e) => std::sync::Arc::new(e), + Err(_) => return, + }; + let opts = CompressOptions { + backend: Backend::WebGpu, + webgpu_engine: Some(engine), + block_size: 64 * 1024, // 64KB blocks → 8 blocks for 512KB input + threads: 2, + ..CompressOptions::default() + }; + + // 512KB input with distinct pattern per 64KB region to detect + // cross-contamination between blocks. + let mut input = Vec::with_capacity(512 * 1024); + let patterns: Vec<&[u8]> = vec![ + b"alpha pattern block one data here. ", + b"beta different content for block two. ", + b"gamma third block uses gamma pattern. ", + b"delta fourth block with delta stuff. ", + b"epsilon five five five five five. ", + b"zeta block six zeta zeta zeta. ", + b"eta seventh block eta eta eta. ", + b"theta eighth block theta theta. ", + ]; + for p in &patterns { + let block: Vec = p.iter().cycle().take(64 * 1024).copied().collect(); + input.extend_from_slice(&block); + } + + for pipeline in [Pipeline::Deflate, Pipeline::Lzr, Pipeline::LzSeqR] { + let compressed = compress_with_options(&input, pipeline, &opts).unwrap(); + let decompressed = decompress(&compressed).unwrap(); + assert_eq!( + decompressed, input, + "{:?} multi-block GPU pipeline round-trip failed", + pipeline + ); + } + } } // --- LzSeqR optimal parsing tests (Task 6) --- diff --git a/src/webgpu/tests/lz77.rs b/src/webgpu/tests/lz77.rs index 1920620..b49e306 100644 --- a/src/webgpu/tests/lz77.rs +++ b/src/webgpu/tests/lz77.rs @@ -933,3 +933,127 @@ fn test_lz77_lazy_kernel_round_trip() { let decompressed = crate::lz77::decompress(&compressed).unwrap(); assert_eq!(decompressed, input, "lazy kernel round-trip failed"); } + +// --------------------------------------------------------------------------- +// GPU pipelining validation tests +// --------------------------------------------------------------------------- + +/// Validates that batched GPU LZ77 match-finding produces correct results +/// across many blocks, exercising the ring-buffered submission pipeline. +/// +/// This is a regression guard against pipelining bugs: if ring-based +/// overlap breaks (e.g., reading a staging buffer before GPU compute +/// finishes, or reusing a slot before readback completes), the round-trip +/// will fail with corrupted match data. +#[test] +fn test_lz77_batched_pipeline_correctness() { + let engine = match WebGpuEngine::new() { + Ok(e) => e, + Err(PzError::Unsupported) => return, + Err(e) => panic!("unexpected error: {:?}", e), + }; + + // Create 8 distinct blocks (each > MIN_GPU_INPUT_SIZE) with different + // patterns to ensure the ring buffer doesn't cross-contaminate results. + let block_size = MIN_GPU_INPUT_SIZE + 4096; + let patterns: Vec<&[u8]> = vec![ + b"the quick brown fox jumps over the lazy dog. ", + b"pack my box with five dozen liquor jugs! ", + b"how vexingly quick daft zebras jump. ", + b"sphinx of black quartz judge my vow! ", + b"abcdefghijklmnopqrstuvwxyz0123456789 ", + b"compression test data with repeating patterns here. ", + b"gpu pipelining validation block number seven. ", + b"the final block uses yet another distinct pattern. ", + ]; + + let blocks: Vec> = patterns + .iter() + .map(|p| p.iter().cycle().take(block_size).copied().collect()) + .collect(); + let block_refs: Vec<&[u8]> = blocks.iter().map(|b| b.as_slice()).collect(); + + // Batched GPU match-finding (exercises ring pipeline) + let batched_results = engine.find_matches_batched(&block_refs).unwrap(); + assert_eq!(batched_results.len(), blocks.len()); + + // Verify each block round-trips correctly + for (i, (matches, block)) in batched_results.iter().zip(&blocks).enumerate() { + assert!(!matches.is_empty(), "block {i} should have matches"); + let mut compressed = + Vec::with_capacity(matches.len() * crate::lz77::Match::SERIALIZED_SIZE); + for m in matches { + compressed.extend_from_slice(&m.to_bytes()); + } + let decompressed = crate::lz77::decompress(&compressed).unwrap(); + assert_eq!( + decompressed, *block, + "block {i} round-trip failed (pattern cross-contamination?)" + ); + } + + // Cross-validate: batched results should match individual GPU results + for (i, block) in blocks.iter().enumerate() { + let single_matches = engine.find_matches(block).unwrap(); + assert_eq!( + batched_results[i].len(), + single_matches.len(), + "block {i}: batched vs single match count mismatch" + ); + } +} + +/// Validates that batched GPU LZ77 doesn't serialize excessively by +/// timing batched vs serial execution. Batched should be no slower than +/// serial (and ideally faster due to amortized setup + ring overlap). +#[test] +fn test_lz77_batched_pipeline_not_slower_than_serial() { + let engine = match WebGpuEngine::new() { + Ok(e) => e, + Err(PzError::Unsupported) => return, + Err(e) => panic!("unexpected error: {:?}", e), + }; + + let block_size = MIN_GPU_INPUT_SIZE + 8192; + let pattern = b"the quick brown fox jumps over the lazy dog. "; + let blocks: Vec> = (0..6) + .map(|i| { + // Vary each block slightly to avoid GPU caching effects + let mut block: Vec = pattern.iter().cycle().take(block_size).copied().collect(); + block[0] = b'A' + (i as u8); + block + }) + .collect(); + let block_refs: Vec<&[u8]> = blocks.iter().map(|b| b.as_slice()).collect(); + + // Warmup: run once to prime GPU pipelines + let _ = engine.find_matches_batched(&block_refs); + + // Time serial (one at a time) + let serial_start = std::time::Instant::now(); + for block in &blocks { + let _ = engine.find_matches(block).unwrap(); + } + let serial_elapsed = serial_start.elapsed(); + + // Time batched + let batched_start = std::time::Instant::now(); + let _ = engine.find_matches_batched(&block_refs).unwrap(); + let batched_elapsed = batched_start.elapsed(); + + // Batched should NOT be more than 1.5x slower than serial. + // (In practice, batched is typically faster due to amortized buffer + // allocation with the ring and overlapped GPU/CPU execution.) + let ratio = batched_elapsed.as_secs_f64() / serial_elapsed.as_secs_f64(); + eprintln!( + "[pz-gpu] pipelining test: serial={:.1}ms, batched={:.1}ms, ratio={:.2}x", + serial_elapsed.as_secs_f64() * 1000.0, + batched_elapsed.as_secs_f64() * 1000.0, + ratio, + ); + assert!( + ratio <= 1.5, + "batched GPU LZ77 is {ratio:.2}x slower than serial — \ + pipelining regression (serial={serial_elapsed:?}, batched={batched_elapsed:?})" + ); +}