ChrisLundquist · ChrisLundquist · Mar 10, 2026 · Mar 10, 2026
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -5,13 +5,13 @@ For day-to-day development instructions, see `CLAUDE.md`.
 
 ## Completed milestones (12/12)
 - **Algorithms:** LZ77 (brute, hashchain, lazy, parallel), LzSeq (code+extra-bits, repeat offsets, 128KB window), Huffman, BWT (SA-IS), MTF, RLE, FSE, rANS
-- **Pipelines:** Deflate (LZ77+Huffman), Bw (BWT+MTF+RLE+FSE), Lzr (LZ77+rANS), Lzf (LZ77+FSE), LzSeqR (LzSeq+rANS), LzSeqH (LzSeq+Huffman), SortLz (sort-LZ77+FSE) — Deflate, Lzr, and Lzf use multi-stream entropy coding for ~16-18% better compression; LzSeqR/LzSeqH use zstd-style code+extra-bits encoding with 6-stream demux; SortLz uses sort-based match finding (GPU-accelerated)
+- **Pipelines:** Bw (BWT+MTF+RLE+FSE), Lzf (LzSeq+FSE), LzSeqR (LzSeq+rANS), LzSeqH (LzSeq+Huffman), SortLz (sort-LZ77+FSE) — Lzf and LzSeqR/LzSeqH use multi-stream entropy coding for ~16-18% better compression; LzSeqR/LzSeqH use zstd-style code+extra-bits encoding with 6-stream demux; SortLz uses sort-based match finding (GPU-accelerated)
 - **Auto-selection:** Heuristic (`select_pipeline`) and trial-based (`select_pipeline_trial`) pipeline selection using data analysis (entropy, match density, run ratio, autocorrelation); LzSeqR included in trial candidates
 - **Data analysis:** `src/analysis.rs` — statistical profiling (Shannon entropy, autocorrelation, run ratio, match density, distribution shape) with sampling support
 - **Optimal parsing:** GPU top-K match table → CPU backward DP (4-6% better compression)
 - **Multi-threading:** Block-parallel and pipeline-parallel via V2 container format; within-block parallel LZ77 match finding (`compress_lazy_parallel`)
-- **SortLZ:** Sort-based match finder — standalone pipeline (ID 10) and pluggable `MatchFinder::SortLz` for Deflate/Lzr/Lzf/LzSeqR/LzSeqH; GPU radix sort batched (single submit); adaptive `select_match_finder()` heuristic; u64-optimized `extend_match`; 39.6% ratio (beats Deflate 43.4%)
-- **GPU kernels:** LZ77 hash-table (fast), LZ77 batch/per-position (legacy), LZ77 top-K, BWT radix sort + parallel rank assignment, SortLZ radix sort + match verification, Huffman encode (two-pass with Blelloch prefix sum), GPU Deflate chaining (LZ77→Huffman on device)
+- **SortLZ:** Sort-based match finder — standalone pipeline (ID 10) and pluggable `MatchFinder::SortLz` for Lzf/LzSeqR/LzSeqH; GPU radix sort batched (single submit); adaptive `select_match_finder()` heuristic; u64-optimized `extend_match`; 39.6% ratio
+- **GPU kernels:** LZ77 hash-table (fast), LZ77 batch/per-position (legacy), LZ77 top-K, BWT radix sort + parallel rank assignment, SortLZ radix sort + match verification, Huffman encode (two-pass with Blelloch prefix sum)
 - **Tooling:** CLI (`pz` with `-a`/`--auto` and `--trial` flags), C FFI, Criterion benchmarks, CI (3 OS)
 - **Fuzz testing (M5.3):** `cargo-fuzz` infrastructure with 12 targets covering all algorithms and pipelines (roundtrip + crash resistance)
 
@@ -23,9 +23,9 @@ For day-to-day development instructions, see `CLAUDE.md`.
 - **CPU:** Uses SA-IS (Suffix Array by Induced Sorting) — O(n) linear time via doubled-text-with-sentinel strategy.
 - **GPU:** Uses LSB-first 8-bit radix sort with prefix-doubling for suffix array construction. Replaced earlier bitonic sort (PR #21). Features adaptive key width (skip zero-digit radix passes) and event chain batching (one host sync per doubling step). Rank assignment runs on GPU via Blelloch prefix sum + scatter. Still slower than CPU SA-IS at all sizes but dramatically improved from bitonic sort (7-14x faster). The GPU uses circular comparison `(sa[i]+k) % n` vs CPU SA-IS's doubled-text approach — both produce valid BWTs that round-trip correctly.
 
-## Multi-stream Deflate
+## Multi-stream entropy coding
 
-The Deflate, Lzr, and Lzf pipelines use **multi-stream entropy coding** to improve
+The Lzf and LzSeqR pipelines use **multi-stream entropy coding** to improve
 compression ratio by separating LZ77 output into independent byte streams with
 tighter symbol distributions. Instead of feeding one mixed stream to the entropy
 coder, the encoder deinterleaves tokens into three streams:
@@ -36,7 +36,7 @@ coder, the encoder deinterleaves tokens into three streams:
 | **Lengths** | Match lengths (capped to u8) | Length distribution is highly skewed (short matches dominate) |
 | **Literals** | Literal bytes + low offset bytes + next bytes | Natural-language / binary byte distribution |
 
-Each stream gets its own Huffman tree (Deflate), FSE table (Lzf), or rANS context (Lzr),
+Each stream gets its own FSE table (Lzf) or rANS context (LzSeqR),
 yielding lower per-stream entropy than a single combined stream.
 
 ### Encoding format
@@ -65,14 +65,12 @@ Comparison on Canterbury + Large corpus (14 files, 13.3 MB total), averaged over
 
 | Pipeline | Before (bytes) | After (bytes) | Size delta | Throughput delta |
 |----------|---------------|--------------|------------|-----------------|
-| Deflate  | 6,319,168     | 5,301,184    | **-16.1%** | +1.6% faster    |
 | Lzf      | 6,199,044     | 5,107,601    | **-17.6%** | +2.8% faster    |
 
 **Decompression throughput:**
 
 | Pipeline | Throughput delta |
 |----------|-----------------|
-| Deflate  | **+8.4%** faster |
 | Lzf      | **+2.4%** faster |
 
 Multi-stream is a pure win: better compression **and** faster speed. The largest
@@ -132,16 +130,9 @@ Every symbol present in the input gets at least frequency 1. Excess is trimmed
 from the most-frequent symbol; deficit is added to it. The normalization code
 is shared conceptually with `src/fse.rs` (both operate on power-of-2 tables).
 
-### Pipeline: Lzr (LZ77 + rANS)
-
-`Pipeline::Lzr` (ID 3) reuses the existing multi-stream LZ77 architecture
-(offsets, lengths, literals) with rANS as the entropy coder instead of Huffman
-(Deflate) or FSE (Lzf). It participates in auto-selection via
-`select_pipeline_trial()`.
-
 ### Forward TODOs
 
-See `docs/exec-plans/tech-debt-tracker.md` for rANS SIMD decode and reciprocal multiplication work items. Benchmark integration (rANS/Lzr in `benches/throughput.rs` and `benches/stages.rs`) is also pending.
+See `docs/exec-plans/tech-debt-tracker.md` for rANS SIMD decode and reciprocal multiplication work items.
 
 ## SIMD acceleration
 `src/simd.rs` provides runtime-dispatched SIMD for CPU hot paths:
@@ -170,8 +161,7 @@ match verification. Zero atomics, fully deterministic — ideal for GPU executio
 | **`MatchFinder::SortLz`** | Pluggable match finder for other pipelines | Host pipeline's format |
 
 When used as a `MatchFinder`, SortLZ is transparent to the wire format — the
-output is 100% compatible with the host pipeline (Deflate, Lzr, Lzf, LzSeqR,
-LzSeqH). The consumer and decompressor see no difference.
+output is 100% compatible with the host pipeline (Lzf, LzSeqR, LzSeqH). The consumer and decompressor see no difference.
 
 ### Pipeline::SortLz wire format (per block)
 
@@ -221,19 +211,7 @@ Uses GPU radix sort (same kernels as BWT) + GPU match verification:
 | 256KB| 131 MB/s     | 31 MB/s   | 53 MB/s   | **1.7x faster** |
 | 4MB  | 142 MB/s     | 8 MB/s    | 89 MB/s   | **10.6x faster** |
 
-SortLZ compression ratio: **39.6%** (vs hashchain+Deflate 43.4%, BWT 32.7%).
-
-## GPU stage chaining
-The Deflate GPU path chains LZ77 → Huffman on the GPU with minimized transfers:
-1. GPU: LZ77 hash-table kernel → download match array → CPU dedupe + serialize
-2. GPU: upload LZ77 bytes once → `ByteHistogram` kernel → download only 256×u32 (1KB)
-3. CPU: build Huffman tree from histogram, produce code LUT
-4. GPU: Huffman encode (reusing LZ77 buffer) with Blelloch prefix sum
-5. GPU: download final encoded bitstream
-
-The `ByteHistogram` kernel eliminates the need to scan LZ77 data on CPU for frequency counting — only 1KB of histogram data is transferred instead of the full LZ77 stream.
-
-This is activated automatically when a GPU backend is selected and input ≥ `MIN_GPU_INPUT_SIZE`.
+SortLZ compression ratio: **39.6%** (vs BWT 32.7%).
 
 ## Parallel LZ77
 `compress_lazy_parallel(input, num_threads)` pre-computes matches in parallel (each thread builds its own hash chain), then serializes sequentially with lazy evaluation. Thresholds:
@@ -262,14 +240,6 @@ This is activated automatically when a GPU backend is selected and input ≥ `MI
 
 GPU Huffman with Blelloch prefix sum crosses over ~128KB. At 256KB the GPU scan path is 3.4x faster than CPU.
 
-### Deflate chained (GPU LZ77 → GPU Huffman)
-
-| Size | CPU 1-thread | GPU chained | Speedup |
-|------|-------------|-------------|---------|
-| 64KB | 1.63ms (38 MiB/s) | 3.01ms (21 MiB/s) | CPU 1.8x faster |
-| 256KB | 6.06ms (41 MiB/s) | 4.93ms (51 MiB/s) | **GPU 1.2x faster** |
-| 1MB | 23.5ms (43 MiB/s) | 18.3ms (55 MiB/s) | **GPU 1.3x faster** |
-
 ### BWT GPU (radix sort)
 
 | Size | GPU radix | Throughput | Old bitonic | Speedup vs bitonic |

diff --git a/Cargo.toml b/Cargo.toml
@@ -21,10 +21,6 @@ path = "src/bin/pz.rs"
 [dev-dependencies]
 criterion = { version = "0.5", features = ["html_reports"] }
 
-[[bench]]
-name = "throughput_deflate"
-harness = false
-
 [[bench]]
 name = "throughput_bw"
 harness = false
@@ -101,10 +97,6 @@ harness = false
 name = "stages_auto_select"
 harness = false
 
-[[bench]]
-name = "stages_deflate_webgpu_chained"
-harness = false
-
 [[bench]]
 name = "rans_decode_bench"
 harness = false

diff --git a/README.md b/README.md
@@ -6,12 +6,11 @@ Lossless data compression library with GPU acceleration, written in Rust.
 
 | Pipeline | Stages | Similar to |
 |----------|--------|------------|
-| **Deflate** | LZ77 + Multi-stream Huffman | gzip |
 | **BW** | BWT + MTF + RLE + FSE | bzip2 |
 | **LZR** | LZ77 + Multi-stream rANS | — |
 | **LZF** | LZ77 + Multi-stream FSE | zstd-like |
 
-Deflate, Lzr, and Lzf use **multi-stream entropy coding**: LZ77 output is split into separate offset, length, and literal streams, each with its own entropy coder. This yields ~16-18% better compression than single-stream encoding with no speed penalty. LZR uses rANS (range ANS) — a multiply-shift entropy coder designed for SIMD and GPU parallelism via interleaved decode states. LZF uses FSE (Finite State Entropy) — a fast table-driven tANS coder similar to zstd.
+Lzr and Lzf use **multi-stream entropy coding**: LZ77 output is split into separate offset, length, and literal streams, each with its own entropy coder. This yields ~16-18% better compression than single-stream encoding with no speed penalty. LZR uses rANS (range ANS) — a multiply-shift entropy coder designed for SIMD and GPU parallelism via interleaved decode states. LZF uses FSE (Finite State Entropy) — a fast table-driven tANS coder similar to zstd.
 
 Optional WebGPU support for GPU-accelerated LZ77 match finding and BWT suffix array construction.
 
@@ -77,7 +76,7 @@ Uses [criterion](https://github.com/bheisler/criterion.rs) for statistical bench
 ```
 cargo bench                         # CPU benchmarks
 cargo bench --features webgpu       # includes GPU benchmarks
-cargo bench --bench throughput_deflate
+cargo bench --bench throughput_lzf
 cargo bench --bench stages_lz77
 ```
 

diff --git a/benches/stages_deflate_webgpu_chained.rs b/benches/stages_deflate_webgpu_chained.rs
diff --git a/benches/stages_lz77.rs b/benches/stages_lz77.rs
@@ -91,7 +91,7 @@ fn bench_lz77_webgpu_batched(c: &mut Criterion) {
 
         let eng = engine.clone();
         group.bench_with_input(
-            BenchmarkId::new("gpu_batched_deflate", size),
+            BenchmarkId::new("gpu_batched_lzf", size),
             &data,
             move |b, data| {
                 let opts = CompressOptions {
@@ -100,23 +100,19 @@ fn bench_lz77_webgpu_batched(c: &mut Criterion) {
                     threads: 4,
                     ..Default::default()
                 };
-                b.iter(|| {
-                    pz::pipeline::compress_with_options(data, Pipeline::Deflate, &opts).unwrap()
-                });
+                b.iter(|| pz::pipeline::compress_with_options(data, Pipeline::Lzf, &opts).unwrap());
             },
         );
 
         group.bench_with_input(
-            BenchmarkId::new("cpu_parallel_deflate", size),
+            BenchmarkId::new("cpu_parallel_lzf", size),
             &data,
             |b, data| {
                 let opts = CompressOptions {
                     threads: 4,
                     ..Default::default()
                 };
-                b.iter(|| {
-                    pz::pipeline::compress_with_options(data, Pipeline::Deflate, &opts).unwrap()
-                });
+                b.iter(|| pz::pipeline::compress_with_options(data, Pipeline::Lzf, &opts).unwrap());
             },
         );
     }

diff --git a/benches/stages_match_finders.rs b/benches/stages_match_finders.rs
@@ -203,19 +203,15 @@ fn bench_gpu_match_finding(c: &mut Criterion) {
     }
     group.finish();
 
-    // --- Part 3: Cross-pipeline GPU sortlz (Deflate, LzSeqR, Lzf) at 256K ---
+    // --- Part 3: Cross-pipeline GPU sortlz (LzSeqR, Lzf) at 256K ---
     let mut group = c.benchmark_group("gpu_sortlz_pipelines");
     cap(&mut group);
 
     let size = 262_144;
     let data = get_test_data(size);
     group.throughput(Throughput::Bytes(size as u64));
 
-    for (name, pipeline) in [
-        ("deflate", Pipeline::Deflate),
-        ("lzseqr", Pipeline::LzSeqR),
-        ("lzf", Pipeline::Lzf),
-    ] {
+    for (name, pipeline) in [("lzseqr", Pipeline::LzSeqR), ("lzf", Pipeline::Lzf)] {
         // CPU hashchain baseline
         let opts = CompressOptions {
             parse_strategy: ParseStrategy::Lazy,

diff --git a/benches/throughput_deflate.rs b/benches/throughput_deflate.rs
diff --git a/docs/DESIGN.md b/docs/DESIGN.md
@@ -19,15 +19,15 @@ Each algorithm (`bwt`, `huffman`, `lz77`, `fse`, `rans`, etc.) is:
 - Usable standalone or in pipelines
 - Not tied to a specific compression format
 
-Pipelines (`deflate`, `bw`, `lzr`, `lzf`) combine algorithms via the demuxer pattern:
+Pipelines (`bw`, `lzf`, `lzseqr`) combine algorithms via the demuxer pattern:
 ```rust
 // Pipeline = sequence of stages
-LZ77 → demux into streams → Huffman encode → merge streams
+LZ77 → demux into streams → FSE/rANS encode → merge streams
 ```
 
 **Why:** Flexibility. New pipelines reuse existing algorithms. New algorithms enhance all pipelines.
 
-**Example:** The same LZ77 implementation works in Deflate (LZ77+Huffman), Lzf (LZ77+FSE), and Lzr (LZ77+rANS) pipelines.
+**Example:** The same LZ77 implementation works in Lzf (LZ77+FSE) and Lzr (LZ77+rANS) pipelines.
 
 ### 2. GPU-Friendly Design Patterns
 
@@ -210,7 +210,7 @@ Below these thresholds, kernel launch overhead dominates.
 
 ### Multi-Stream Entropy Coding
 
-LZ-based pipelines (Deflate, Lzf, Lzr) use multi-stream encoding (3 independent streams per block). See `ARCHITECTURE.md` for benchmark results (16-18% better compression, 2-8% faster decompression) and `design-docs/pipeline-architecture.md` for the stream layout.
+LZ-based pipelines (Lzf, Lzr) use multi-stream encoding (3 independent streams per block). See `ARCHITECTURE.md` for benchmark results (16-18% better compression, 2-8% faster decompression) and `design-docs/pipeline-architecture.md` for the stream layout.
 
 ### Memory Management
 

diff --git a/examples/block_size_experiment.rs b/examples/block_size_experiment.rs
@@ -35,7 +35,7 @@ const BLOCK_SIZES: &[usize] = &[
 ];
 
 /// Pipelines to test (LZ77-based, GPU-eligible).
-const PIPELINES: &[Pipeline] = &[Pipeline::Deflate, Pipeline::Lzf];
+const PIPELINES: &[Pipeline] = &[Pipeline::Lzf];
 
 fn load_data(size: usize) -> Vec<u8> {
     let manifest = Path::new(env!("CARGO_MANIFEST_DIR"));

diff --git a/examples/decode_profile.rs b/examples/decode_profile.rs
@@ -36,11 +36,7 @@ fn main() {
 
     let iters = 20;
 
-    for &(name, pipeline) in &[
-        ("LzSeqR", Pipeline::LzSeqR),
-        ("LzSeqH", Pipeline::LzSeqH),
-        ("Deflate", Pipeline::Deflate),
-    ] {
+    for &(name, pipeline) in &[("LzSeqR", Pipeline::LzSeqR), ("LzSeqH", Pipeline::LzSeqH)] {
         let options = CompressOptions::default();
         let compressed = pipeline::compress_with_options(&data, pipeline, &options).unwrap();
         let ratio = compressed.len() as f64 / data.len() as f64 * 100.0;

diff --git a/examples/explore_pipelines.rs b/examples/explore_pipelines.rs
@@ -307,16 +307,13 @@ fn main() {
     #[allow(unused_mut)]
     let mut pz_configs: Vec<(Pipeline, ParseStrategy, usize, Backend)> = vec![
         // Single-threaded CPU variants
-        (Pipeline::Deflate, ParseStrategy::Lazy, 1, Backend::Cpu),
-        (Pipeline::Deflate, ParseStrategy::Optimal, 1, Backend::Cpu),
         (Pipeline::Lzf, ParseStrategy::Lazy, 1, Backend::Cpu),
         (Pipeline::Lzf, ParseStrategy::Optimal, 1, Backend::Cpu),
         (Pipeline::Bw, ParseStrategy::Auto, 1, Backend::Cpu),
         (Pipeline::Bbw, ParseStrategy::Auto, 1, Backend::Cpu),
         // Experimental: LZSS pipeline
         (Pipeline::LzssR, ParseStrategy::Auto, 1, Backend::Cpu),
         // Multi-threaded CPU
-        (Pipeline::Deflate, ParseStrategy::Lazy, 0, Backend::Cpu),
         (Pipeline::Lzf, ParseStrategy::Lazy, 0, Backend::Cpu),
         (Pipeline::Bw, ParseStrategy::Auto, 0, Backend::Cpu),
         (Pipeline::Bbw, ParseStrategy::Auto, 0, Backend::Cpu),
@@ -327,7 +324,6 @@ fn main() {
     #[cfg(feature = "webgpu")]
     if has_webgpu {
         pz_configs.extend([
-            (Pipeline::Deflate, ParseStrategy::Auto, 1, Backend::WebGpu),
             (Pipeline::Lzf, ParseStrategy::Auto, 1, Backend::WebGpu),
             (Pipeline::Bw, ParseStrategy::Auto, 1, Backend::WebGpu),
         ]);

diff --git a/examples/pipeline_comparison.rs b/examples/pipeline_comparison.rs
@@ -17,7 +17,6 @@ fn test_pipeline_comparison(name: &str, data: Vec<u8>) {
     let pipelines = vec![
         ("Lzf (LzSeq+FSE)", Pipeline::Lzf),
         ("LzSeqR (LzSeq+rANS)", Pipeline::LzSeqR),
-        ("Deflate (LZ77+Huffman)", Pipeline::Deflate),
     ];
 
     println!(