ChrisLundquist · ChrisLundquist · Mar 12, 2026 · Mar 11, 2026 · Mar 12, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -38,14 +38,57 @@ Specialized agents in `.claude/agents/` run on cheaper models and keep verbose o
 - **tooling** — build scripts and workflow automation (Sonnet; consumes `.claude/friction/`)
 - **maintainer** — review feedback backlog, update CLAUDE.md, delegate improvements (Opus; consumes `.claude/feedback/`)
 
+## Architecture overview
+
+All LZ-based pipelines share a unified token architecture (since PR #118):
+
+```
+input → tokenize() → Vec<LzToken> → TokenEncoder::encode() → multi-stream → entropy_encode()
+```
+
+**Three pluggable wire encoders** select how tokens map to byte streams:
+- **LzSeqEncoder** (6 streams) — log2-coded offsets/lengths with repeat-offset tracking. Used by Lzf, LzSeqR, LzSeqH, SortLz. Best ratio.
+- **LzssEncoder** (4 streams) — flag-bit based, raw u16 offsets+lengths. Used by Lzfi, LzssR. Faster decode, worse ratio.
+- **Lz77Encoder** (3 streams) — DEFLATE-style. Legacy, no active pipeline uses it.
+
+**Active pipelines:**
+
+| Pipeline | Encoder | Entropy | Notes |
+|----------|---------|---------|-------|
+| **Lzf** (default) | LzSeq | FSE | General-purpose, good ratio |
+| **LzSeqR** | LzSeq | rANS | Fastest overall, ratio matches Lzf |
+| **LzSeqH** | LzSeq | Huffman | Fast decode |
+| **Lzfi** | LZSS | interleaved FSE | Fastest algorithm but poor ratio (46% vs 34%) |
+| **LzssR** | LZSS | rANS | Dominated by Lzfi, removal candidate |
+| **SortLz** | LzSeq (internal) | FSE | Deterministic GPU radix-sort matching |
+| **Bw** / **Bbw** | — | FSE | BWT-based, no LZ tokens |
+
+**Removed pipelines:** Deflate (#117), Lzr (#118), Lz78R (#116), Parlz (ratio loss)
+
+**CLI path:** `pz` always uses `streaming::compress_stream`, not `pipeline::compress_with_options`. The streaming path uses block-by-block parallelism with bounded memory.
+
+### Silesia corpus benchmarks (211MB, CLI end-to-end, verified round-trip)
+
+| Method | Compress | Ratio | Comp MB/s | Decomp MB/s |
+|---------|----------|-------|-----------|-------------|
+| gzip | 4.69s | 32.2% | 45 | 225 |
+| pz lzf | 1.90s | 34.6% | 111 | 133 |
+| pz lzseqr | 1.62s | 34.4% | 131 | 154 |
+| pz lzfi | 1.70s | 46.0% | 125 | 143 |
+
+pz compresses 2.5–7x faster than gzip with ~2pp ratio gap. Decompress is faster per-file but slower on many small files due to ~90ms per-invocation startup overhead. Criterion (pure algorithm, no I/O) measures 333–543 MB/s — the 3–4x CLI gap is in the streaming path, not the compressor.
+
+**Benchmark corpus:** `./scripts/fetch-silesia.sh` downloads the 211MB Silesia corpus to `samples/silesia/`.
+
 ## Project layout
 
 - `src/lib.rs` — crate root, `PzError`/`PzResult` types
-- `src/{algorithm}.rs` — one file per composable algorithm (bwt, crc32, deflate, fse, huffman, lz77, lz78, lzseq, lzss, mtf, rans, rle)
+- `src/lz_token.rs` — universal `LzToken` type, `TokenEncoder` trait, three encoder implementations
+- `src/{algorithm}.rs` — one file per composable algorithm (bwt, crc32, fse, huffman, lz77, lzseq, lzss, mtf, rans, rle)
 - `src/analysis.rs` — data profiling (entropy, match density, run ratio, autocorrelation)
 - `src/optimal.rs` — optimal parsing (GPU top-K + backward DP)
 - `src/simd.rs` — SIMD decode paths for rANS
-- `src/streaming.rs` — streaming compression interface
+- `src/streaming.rs` — streaming compression interface (CLI entry point)
 - `src/ffi.rs` — C FFI bindings
 - `src/pipeline/` — multi-stage compression pipelines, auto-selection, block parallelism, demux
 - `src/bin/pz.rs` — CLI binary (`pz` with `-a`/`--auto` and `--trial` flags)
@@ -63,14 +106,17 @@ Before optimizing GPU code paths, read this first — multiple agents have spent
 - **The CLI uses `streaming::compress_stream`, not `pipeline::compress_with_options`** — the parallel scheduler's GPU coordinator is not invoked by the CLI. The streaming path deliberately uses CPU rANS for entropy.
 - **The real GPU win (ring-buffered LZ77 batching) is already shipped** — delivers +7-17% throughput. See `docs/design-docs/gpu-strategy.md`.
 - **GPU device init time skews throughput benchmarks** — first-call GPU init adds significant overhead that `bench.sh` captures but Criterion amortizes across iterations. When comparing GPU vs CPU throughput, use Criterion (`cargo bench`) for apples-to-apples; `bench.sh` reflects real-world cold-start cost. Don't chase "GPU is slower" regressions that are really just init time.
-- **Compression ratio is limited by 5-byte match encoding, not match quality** — the LZ match-finder finds good matches, but the 5-byte serialized match format creates overhead on short matches. Improving ratio means fixing the encoding format, not tuning the matcher.
+- **Compression ratio is limited by wire encoding overhead, not match quality** — the LZ match-finder finds good matches. The legacy Lz77Encoder (5-byte per match) was the worst offender; LzSeqEncoder (log2-coded, 6 streams) is much better but still ~2pp behind gzip on Silesia (34.4% vs 32.2%). Further ratio gains require encoding format work, not matcher tuning.
 - **GPU Huffman is a dead end** — Huffman coding requires bit-level alignment, but GPU throughput depends on byte-aligned memory access patterns. This is a fundamental architectural mismatch; do not attempt to port Huffman to GPU.
 - **GPU hash tables for LZ matching don't work** — GPU atomics don't preserve insertion order, so hash chains lose recency information. Match quality collapses to ~6% vs CPU's 99.6% on repetitive data. Tried twice (global atomics + shared-memory variant), both catastrophically failed. See `docs/design-docs/experiments.md`.
 - **SSE2 rANS decode is 32% slower than scalar** — scalar 4-lane decode gets good ILP from out-of-order execution. SSE2 extract operations serialize and lose that parallelism. Proper SIMD rANS would need merged slot-indexed tables and SSE4.1+. The dispatch is disabled; don't re-enable it.
 - **Fully parallel GPU LZ parsing (ParlZ) has unacceptable ratio loss** — 37.6% compression gap vs serial parsing. Forward-max-propagation conflict resolution is too aggressive. Hybrid GPU match-finding + CPU serial parsing is the correct architecture.
 - **Iterative GPU algorithms have quadratic host overhead** — Repair grammar compression hit 0.4 MB/s due to 100+ rounds of buffer alloc + readback. Avoid per-round GPU↔CPU synchronization; prefer single-dispatch or persistent-buffer designs.
 - **Window-capped suffix sorts break BWT invertibility** — FWST produced 433% ratio (massive expansion). Full suffix sort is structurally required for LF-mapping; there's no shortcut.
 
+- **Streaming path is the CLI bottleneck, not the compressor** — Criterion measures 333–543 MB/s for raw algorithms, but CLI delivers 111–131 MB/s on the same data. The 3–4x gap is in `streaming::compress_stream`, not the encoder. Pipeline-level algorithmic speed differences (e.g., Lzfi vs LzSeqR) are largely invisible at the CLI level because streaming overhead dominates.
+- **LzSeqR parallel encode used to route to incompatible GPU rANS** — `run_compress_stage` in `stages.rs` sent LzSeqR entropy to `stage_rans_encode_webgpu` (GPU chunked payload format), while the single-block path used standard CPU rANS. The chunked format was incompatible with all decoders. Fixed in PR #120 by routing to CPU rANS. Don't re-enable GPU rANS for LzSeqR without fixing the wire format compatibility.
+
 For detailed history of all failed experiments, see `docs/design-docs/gpu-experiments-wave2-conclusions.md` and `docs/design-docs/experiments.md`.
 
 ## Key conventions

diff --git a/benches/throughput_lzfi.rs b/benches/throughput_lzfi.rs
@@ -8,11 +8,11 @@ use throughput_common::{run_throughput_benches, ThroughputBenchSpec};
 const SPEC: ThroughputBenchSpec = ThroughputBenchSpec {
     id: "lzfi",
     pipeline: Pipeline::Lzfi,
-    parallel: false,
-    large: false,
-    decompress_large: false,
-    webgpu: false,
-    webgpu_large: false,
+    parallel: true,
+    large: true,
+    decompress_large: true,
+    webgpu: true,
+    webgpu_large: true,
 };
 
 fn bench(c: &mut Criterion) {