ChrisLundquist · ChrisLundquist · Mar 10, 2026 · Mar 9, 2026 · Mar 9, 2026 · Mar 9, 2026
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -283,52 +283,71 @@ GPU Huffman with Blelloch prefix sum crosses over ~128KB. At 256KB the GPU scan
 
 GPU BWT radix sort is 7-14x faster than the old bitonic sort. Still slower than CPU SA-IS at small sizes but becoming competitive at 64KB+ (CPU SA-IS ~1ms at 64KB vs GPU 4.1ms). The gap narrows at larger sizes where GPU parallelism helps more.
 
+## GPU/CPU strategy (settled)
+
+The optimal split for libpz is **GPU for LZ77 match-finding, CPU for entropy coding**,
+overlapped via the unified scheduler with ring-buffered streaming.
+
+**Why GPU wins on LZ77:** Match-finding is embarrassingly parallel — each position's
+search is independent. The cooperative-stitch kernel does 1,788 probes/position and
+is 2x faster than CPU at 256KB+. Ring-buffered batching (`find_matches_batched`)
+adds +7-17% throughput by amortizing buffer allocation and overlapping GPU compute
+with CPU readback.
+
+**Why CPU wins on entropy:** rANS/FSE/Huffman are serial state machines — each
+symbol depends on the previous state. GPU entropy has been tried extensively
+(500+ iterations: single-stream, independent blocks, Recoil checkpoints, batched
+cross-block) and is 0.77x CPU on encode, 0.54x on decode. The serial dependency
+limits GPU to ~300 useful threads when saturation needs ~8K-16K. PCIe transfer
+overhead dominates at typical block sizes (128KB-256KB).
+
+**Architecture:** The unified scheduler dispatches LZ77 to GPU and entropy to CPU
+workers in parallel. While CPU thread N entropy-encodes block K, the GPU is already
+match-finding block K+1. The `GPU_ENTROPY_THRESHOLD` (256KB) is deliberately set
+above `DEFAULT_GPU_BLOCK_SIZE` (128KB) to prevent routing entropy to the slower
+GPU path.
+
+See `docs/design-docs/gpu-strategy.md` for full analysis and `CLAUDE.md` "Known
+dead ends" for the complete list of GPU optimization attempts that failed.
+
 ## Remaining GPU bottlenecks
 
 1. **GPU BWT still slower than CPU SA-IS** — Radix sort improved 7-14x over bitonic
    sort, but CPU SA-IS (O(n)) remains faster at small/medium sizes. GPU catches up
    at 64KB+ but prefix-doubling's O(n log n) work is inherently more than SA-IS's O(n).
 
-2. **No shared memory usage** — LZ77 hash kernel uses only global memory.
-   Loading hash buckets into `__local` memory could help at larger sizes.
-
-3. **Hash bucket overflow** — Fixed BUCKET_CAP=64 means highly repetitive data
+2. **Hash bucket overflow** — Fixed BUCKET_CAP=64 means highly repetitive data
    may miss good matches. Adaptive bucket sizing could help.
 
-4. **Huffman WriteCodes atomic contention** — Per-bit atomic_or on the output
-   buffer limits scaling. Chunk-based packing could reduce contention.
-
-5. **LZ77 match array still downloaded for dedupe** — GPU match dedup is sequential
+3. **LZ77 match array still downloaded for dedupe** — GPU match dedup is sequential
    and runs on CPU. Keeping serialized LZ77 bytes on GPU for histogram+Huffman
    is already done (ByteHistogram optimization), but the match download is unavoidable.
 
 ## Next steps
 
-### Priority 0: rANS SIMD completion
-
-See `docs/exec-plans/tech-debt-tracker.md` for full details. Two items: SIMD decode paths (SSE2/AVX2 intrinsics for interleaved decode) and reciprocal multiplication (eliminate data-dependent division in encode).
+### Priority 0: Close the gzip compression ratio gap
 
-### Priority 1: Use local/shared memory in LZ77 hash kernel
-- Load hash buckets into `__local` memory for faster repeated access
-- Could improve GPU LZ77 performance at mid-range sizes (64KB-256KB)
-- May lower the GPU crossover point from 256KB toward 64-128KB
+LzSeqR is our best pipeline at 35.1% vs gzip's 28.6% (6.5pp gap). The gap is
+encoding efficiency, not match quality (see `CLAUDE.md` "Known dead ends"). The
+format is pre-release so all changes are free. Key opportunities:
 
-### Priority 3: Chunk-based Huffman bit packing
-- Replace per-bit `atomic_or` in WriteCodes with work-group-local packing
-- Each work-group packs its chunk into local memory (no atomics within WG)
-- Single copy from local → global per chunk
-- Could 5-10x GPU Huffman throughput
+- **Zstd-style literal-run sequences** — replace per-token flags with
+  `(literal_run_length, offset, match_length)` tuples, eliminating the flags
+  stream entirely. Highest ceiling.
+- **Larger repeat offset cache** (4→8) — each additional repeat saves all extra
+  bits for that match.
+- **Entropy-code the extra bits** — `offset_extra` and `length_extra` currently
+  bypass rANS; if values are skewed, 5-15% savings.
+- **Sparse frequency tables** — 512 bytes per rANS stream → ~61 bytes for
+  narrow-alphabet streams. Saves ~1.3KB/block.
 
-### Priority 4: Fuzz testing
-- Set up `cargo-fuzz` for round-trip correctness on all pipelines
-- Target edge cases in LZ77, Huffman, BWT decode paths
+### Priority 1: LzSeq-specific optimal parser
 
-### Priority 5: Auto-selection threshold tuning
-- Run all 3 pipelines on Canterbury + Silesia corpora
-- Measure actual compression ratios vs analysis metrics
-- Tune heuristic decision tree thresholds empirically
+The optimal parser currently uses LZ77's `match_cost` approximation. A dedicated
+LzSeq optimal parser that tracks repeat offset state through the backward DP would
+find more repeat matches, directly improving ratio.
 
-### Priority 6: aarch64 NEON/SVE SIMD implementation
+### Priority 2: aarch64 NEON/SVE SIMD implementation
 - Replace scalar stubs in `src/simd.rs` with actual NEON intrinsics
 - `compare_bytes`: `vceqq_u8` + `vmovn_u16` for 16-byte comparison
 - `byte_frequencies`: 4-bank unrolled (NEON lacks efficient gather/scatter)

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -52,6 +52,25 @@ Specialized agents in `.claude/agents/` run on cheaper models and keep verbose o
 - `scripts/` — test, bench, profile, setup, and analysis tools
 - `docs/` — design docs, quality status, exec plans, references
 
+## Known dead ends
+
+Before optimizing GPU code paths, read this first — multiple agents have spent full sessions rediscovering these:
+
+- **GPU entropy (rANS/FSE) is slower than CPU** — 0.77x on encode, 0.54x on decode. This has been proven across 500+ optimization iterations. The serial state dependency in rANS limits GPU to ~300 threads; saturation needs ~8K-16K. Do not attempt to batch, parallelize, or "pipeline" GPU entropy encoding.
+- **`gpu_fused_span()` returning `Some((0,1))` is counterproductive** — it routes entropy to GPU (slower). It exists as architectural prep for if GPU entropy ever becomes competitive. The `GPU_ENTROPY_THRESHOLD` (256KB > default GPU block size 128KB) intentionally prevents this path from activating.
+- **The CLI uses `streaming::compress_stream`, not `pipeline::compress_with_options`** — the parallel scheduler's GPU coordinator is not invoked by the CLI. The streaming path deliberately uses CPU rANS for entropy.
+- **The real GPU win (ring-buffered LZ77 batching) is already shipped** — delivers +7-17% throughput. See `docs/design-docs/gpu-strategy.md`.
+- **GPU device init time skews throughput benchmarks** — first-call GPU init adds significant overhead that `bench.sh` captures but Criterion amortizes across iterations. When comparing GPU vs CPU throughput, use Criterion (`cargo bench`) for apples-to-apples; `bench.sh` reflects real-world cold-start cost. Don't chase "GPU is slower" regressions that are really just init time.
+- **Compression ratio is limited by 5-byte match encoding, not match quality** — the LZ match-finder finds good matches, but the 5-byte serialized match format creates overhead on short matches. Improving ratio means fixing the encoding format, not tuning the matcher.
+- **GPU Huffman is a dead end** — Huffman coding requires bit-level alignment, but GPU throughput depends on byte-aligned memory access patterns. This is a fundamental architectural mismatch; do not attempt to port Huffman to GPU.
+- **GPU hash tables for LZ matching don't work** — GPU atomics don't preserve insertion order, so hash chains lose recency information. Match quality collapses to ~6% vs CPU's 99.6% on repetitive data. Tried twice (global atomics + shared-memory variant), both catastrophically failed. See `docs/design-docs/experiments.md`.
+- **SSE2 rANS decode is 32% slower than scalar** — scalar 4-lane decode gets good ILP from out-of-order execution. SSE2 extract operations serialize and lose that parallelism. Proper SIMD rANS would need merged slot-indexed tables and SSE4.1+. The dispatch is disabled; don't re-enable it.
+- **Fully parallel GPU LZ parsing (ParlZ) has unacceptable ratio loss** — 37.6% compression gap vs serial parsing. Forward-max-propagation conflict resolution is too aggressive. Hybrid GPU match-finding + CPU serial parsing is the correct architecture.
+- **Iterative GPU algorithms have quadratic host overhead** — Repair grammar compression hit 0.4 MB/s due to 100+ rounds of buffer alloc + readback. Avoid per-round GPU↔CPU synchronization; prefer single-dispatch or persistent-buffer designs.
+- **Window-capped suffix sorts break BWT invertibility** — FWST produced 433% ratio (massive expansion). Full suffix sort is structurally required for LF-mapping; there's no shortcut.
+
+For detailed history of all failed experiments, see `docs/design-docs/gpu-experiments-wave2-conclusions.md` and `docs/design-docs/experiments.md`.
+
 ## Key conventions
 
 See **docs/DESIGN.md** for full design principles and **docs/design-docs/core-beliefs.md** for agent-first operating principles.

diff --git a/scripts/fetch-silesia.sh b/scripts/fetch-silesia.sh
@@ -1,37 +1,37 @@
-#!/usr/bin/env bash
-# Download and extract the Silesia compression corpus.
-# https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
-#
-# Usage: ./scripts/fetch-silesia.sh [--force]
-#
-# Downloads to samples/silesia/. Skips if already present unless --force.
-
-set -euo pipefail
-
-SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
-PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
-
-DEST="${PROJECT_DIR}/samples/silesia"
-URL="https://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip"
-EXPECTED_FILES=12
-
-if [[ -d "$DEST" ]] && [[ "$(ls "$DEST" | wc -l)" -ge "$EXPECTED_FILES" ]] && [[ "${1:-}" != "--force" ]]; then
-    echo "Silesia corpus already present in $DEST ($(ls "$DEST" | wc -l) files)"
-    echo "Use --force to re-download."
-    exit 0
-fi
-
-TMPFILE="$(mktemp /tmp/silesia-XXXXXX.zip)"
-trap 'rm -f "$TMPFILE"' EXIT
-
-echo "Downloading Silesia corpus (66 MB)..."
-curl -fSL "$URL" -o "$TMPFILE"
-
-mkdir -p "$DEST"
-echo "Extracting to $DEST..."
-unzip -o "$TMPFILE" -d "$DEST"
-
-# Clean up any .pz artifacts from previous runs
-rm -f "$DEST"/*.pz
-
-echo "Done: $(ls "$DEST" | wc -l) files, $(du -sh "$DEST" | cut -f1) total"
+#!/usr/bin/env bash
+# Download and extract the Silesia compression corpus.
+# https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
+#
+# Usage: ./scripts/fetch-silesia.sh [--force]
+#
+# Downloads to samples/silesia/. Skips if already present unless --force.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
+
+DEST="${PROJECT_DIR}/samples/silesia"
+URL="https://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip"
+EXPECTED_FILES=12
+
+if [[ -d "$DEST" ]] && [[ "$(ls "$DEST" | wc -l)" -ge "$EXPECTED_FILES" ]] && [[ "${1:-}" != "--force" ]]; then
+    echo "Silesia corpus already present in $DEST ($(ls "$DEST" | wc -l) files)"
+    echo "Use --force to re-download."
+    exit 0
+fi
+
+TMPFILE="$(mktemp /tmp/silesia-XXXXXX.zip)"
+trap 'rm -f "$TMPFILE"' EXIT
+
+echo "Downloading Silesia corpus (66 MB)..."
+curl -fSL "$URL" -o "$TMPFILE"
+
+mkdir -p "$DEST"
+echo "Extracting to $DEST..."
+unzip -o "$TMPFILE" -d "$DEST"
+
+# Clean up any .pz artifacts from previous runs
+rm -f "$DEST"/*.pz
+
+echo "Done: $(ls "$DEST" | wc -l) files, $(du -sh "$DEST" | cut -f1) total"
diff --git a/src/pipeline/mod.rs b/src/pipeline/mod.rs
@@ -143,8 +143,11 @@ const DEFAULT_BW_BLOCK_SIZE: usize = 512 * 1024;
 /// which also improves throughput on streaming GPU paths.
 const DEFAULT_GPU_BLOCK_SIZE: usize = 128 * 1024;
 
-/// Minimum block size for GPU entropy to win over CPU (empirical from Phase 4).
-/// Applies to both individual blocks and total stream byte count.
+/// Gate that **prevents** routing entropy to the GPU, which is slower than CPU
+/// (0.77x encode, 0.54x decode — see `docs/design-docs/gpu-strategy.md`).
+/// Set deliberately above `DEFAULT_GPU_BLOCK_SIZE` (128KB) so GPU blocks never
+/// qualify for the fused entropy path. Do NOT lower this threshold — it will
+/// regress throughput by ~10-15x. See "Known dead ends" in CLAUDE.md.
 /// 256KB = 262144 bytes (aligns with AC3.2 threshold).
 pub const GPU_ENTROPY_THRESHOLD: usize = 256 * 1024;
 

diff --git a/src/pipeline/parallel.rs b/src/pipeline/parallel.rs
@@ -191,12 +191,23 @@ fn should_route_block_to_gpu_stage0(
 /// Returns `Some((start, end))` when every stage in `start..=end` has a GPU
 /// implementation, allowing the GPU coordinator to run them sequentially
 /// without intermediate queue round-trips or CPU readback.
+///
+/// **WARNING — GPU entropy is currently SLOWER than CPU** (0.77x encode, 0.54x decode).
+/// This function exists as architectural preparation for when GPU entropy becomes
+/// competitive. In practice, `GPU_ENTROPY_THRESHOLD` (256KB) is deliberately set above
+/// `DEFAULT_GPU_BLOCK_SIZE` (128KB) so the fused path is never activated for GPU blocks.
+/// Lowering the threshold or bypassing it will REGRESS throughput by ~10-15x because
+/// it serializes all work onto the single coordinator thread AND uses the slower GPU
+/// entropy path. See `docs/design-docs/gpu-strategy.md` and
+/// `.claude/feedback/2026-03-01-gpu-wins-on-lz77-loses-on-entropy.md`.
 #[cfg(feature = "webgpu")]
 fn gpu_fused_span(pipeline: Pipeline) -> Option<(usize, usize)> {
     match pipeline {
         // Both stages have GPU paths: LZ77 match-finding + rANS encode
+        // NOTE: GPU rANS is slower than CPU — fused path is gated by GPU_ENTROPY_THRESHOLD
         Pipeline::Lzr => Some((0, 1)),
         // Both stages have GPU paths: LzSeq fused match+demux + rANS encode
+        // NOTE: GPU rANS is slower than CPU — fused path is gated by GPU_ENTROPY_THRESHOLD
         Pipeline::LzSeqR => Some((0, 1)),
         _ => None,
     }

diff --git a/src/pipeline/tests.rs b/src/pipeline/tests.rs
@@ -899,6 +899,59 @@ mod gpu_batched_tests {
         let decompressed = decompress(&compressed).unwrap();
         assert_eq!(decompressed, input);
     }
+
+    /// Multi-block pipeline pipelining regression test.
+    ///
+    /// Creates input large enough to produce many blocks (>= 8), then
+    /// compresses with GPU and verifies the round-trip. This exercises
+    /// the full GPU coordinator pipeline: ring-buffered LZ77 stage 0
+    /// batching, StageN entropy encoding, and result assembly.
+    ///
+    /// Correctness failures here indicate pipelining bugs (e.g., staging
+    /// buffer reuse before readback, ring slot cross-contamination, or
+    /// incomplete synchronization between GPU compute and CPU readback).
+    #[test]
+    fn test_gpu_pipeline_multiblock_correctness() {
+        let engine = match crate::webgpu::WebGpuEngine::new() {
+            Ok(e) => std::sync::Arc::new(e),
+            Err(_) => return,
+        };
+        let opts = CompressOptions {
+            backend: Backend::WebGpu,
+            webgpu_engine: Some(engine),
+            block_size: 64 * 1024, // 64KB blocks → 8 blocks for 512KB input
+            threads: 2,
+            ..CompressOptions::default()
+        };
+
+        // 512KB input with distinct pattern per 64KB region to detect
+        // cross-contamination between blocks.
+        let mut input = Vec::with_capacity(512 * 1024);
+        let patterns: Vec<&[u8]> = vec![
+            b"alpha pattern block one data here. ",
+            b"beta different content for block two. ",
+            b"gamma third block uses gamma pattern. ",
+            b"delta fourth block with delta stuff. ",
+            b"epsilon five five five five five. ",
+            b"zeta block six zeta zeta zeta. ",
+            b"eta seventh block eta eta eta. ",
+            b"theta eighth block theta theta. ",
+        ];
+        for p in &patterns {
+            let block: Vec<u8> = p.iter().cycle().take(64 * 1024).copied().collect();
+            input.extend_from_slice(&block);
+        }
+
+        for pipeline in [Pipeline::Deflate, Pipeline::Lzr, Pipeline::LzSeqR] {
+            let compressed = compress_with_options(&input, pipeline, &opts).unwrap();
+            let decompressed = decompress(&compressed).unwrap();
+            assert_eq!(
+                decompressed, input,
+                "{:?} multi-block GPU pipeline round-trip failed",
+                pipeline
+            );
+        }
+    }
 }
 
 // --- LzSeqR optimal parsing tests (Task 6) ---