diff --git a/CLAUDE.md b/CLAUDE.md index 3c2eefc..d79ca2b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -38,14 +38,57 @@ Specialized agents in `.claude/agents/` run on cheaper models and keep verbose o - **tooling** — build scripts and workflow automation (Sonnet; consumes `.claude/friction/`) - **maintainer** — review feedback backlog, update CLAUDE.md, delegate improvements (Opus; consumes `.claude/feedback/`) +## Architecture overview + +All LZ-based pipelines share a unified token architecture (since PR #118): + +``` +input → tokenize() → Vec → TokenEncoder::encode() → multi-stream → entropy_encode() +``` + +**Three pluggable wire encoders** select how tokens map to byte streams: +- **LzSeqEncoder** (6 streams) — log2-coded offsets/lengths with repeat-offset tracking. Used by Lzf, LzSeqR, LzSeqH, SortLz. Best ratio. +- **LzssEncoder** (4 streams) — flag-bit based, raw u16 offsets+lengths. Used by Lzfi, LzssR. Faster decode, worse ratio. +- **Lz77Encoder** (3 streams) — DEFLATE-style. Legacy, no active pipeline uses it. + +**Active pipelines:** + +| Pipeline | Encoder | Entropy | Notes | +|----------|---------|---------|-------| +| **Lzf** (default) | LzSeq | FSE | General-purpose, good ratio | +| **LzSeqR** | LzSeq | rANS | Fastest overall, ratio matches Lzf | +| **LzSeqH** | LzSeq | Huffman | Fast decode | +| **Lzfi** | LZSS | interleaved FSE | Fastest algorithm but poor ratio (46% vs 34%) | +| **LzssR** | LZSS | rANS | Dominated by Lzfi, removal candidate | +| **SortLz** | LzSeq (internal) | FSE | Deterministic GPU radix-sort matching | +| **Bw** / **Bbw** | — | FSE | BWT-based, no LZ tokens | + +**Removed pipelines:** Deflate (#117), Lzr (#118), Lz78R (#116), Parlz (ratio loss) + +**CLI path:** `pz` always uses `streaming::compress_stream`, not `pipeline::compress_with_options`. The streaming path uses block-by-block parallelism with bounded memory. + +### Silesia corpus benchmarks (211MB, CLI end-to-end, verified round-trip) + +| Method | Compress | Ratio | Comp MB/s | Decomp MB/s | +|---------|----------|-------|-----------|-------------| +| gzip | 4.69s | 32.2% | 45 | 225 | +| pz lzf | 1.90s | 34.6% | 111 | 133 | +| pz lzseqr | 1.62s | 34.4% | 131 | 154 | +| pz lzfi | 1.70s | 46.0% | 125 | 143 | + +pz compresses 2.5–7x faster than gzip with ~2pp ratio gap. Decompress is faster per-file but slower on many small files due to ~90ms per-invocation startup overhead. Criterion (pure algorithm, no I/O) measures 333–543 MB/s — the 3–4x CLI gap is in the streaming path, not the compressor. + +**Benchmark corpus:** `./scripts/fetch-silesia.sh` downloads the 211MB Silesia corpus to `samples/silesia/`. + ## Project layout - `src/lib.rs` — crate root, `PzError`/`PzResult` types -- `src/{algorithm}.rs` — one file per composable algorithm (bwt, crc32, deflate, fse, huffman, lz77, lz78, lzseq, lzss, mtf, rans, rle) +- `src/lz_token.rs` — universal `LzToken` type, `TokenEncoder` trait, three encoder implementations +- `src/{algorithm}.rs` — one file per composable algorithm (bwt, crc32, fse, huffman, lz77, lzseq, lzss, mtf, rans, rle) - `src/analysis.rs` — data profiling (entropy, match density, run ratio, autocorrelation) - `src/optimal.rs` — optimal parsing (GPU top-K + backward DP) - `src/simd.rs` — SIMD decode paths for rANS -- `src/streaming.rs` — streaming compression interface +- `src/streaming.rs` — streaming compression interface (CLI entry point) - `src/ffi.rs` — C FFI bindings - `src/pipeline/` — multi-stage compression pipelines, auto-selection, block parallelism, demux - `src/bin/pz.rs` — CLI binary (`pz` with `-a`/`--auto` and `--trial` flags) @@ -63,7 +106,7 @@ Before optimizing GPU code paths, read this first — multiple agents have spent - **The CLI uses `streaming::compress_stream`, not `pipeline::compress_with_options`** — the parallel scheduler's GPU coordinator is not invoked by the CLI. The streaming path deliberately uses CPU rANS for entropy. - **The real GPU win (ring-buffered LZ77 batching) is already shipped** — delivers +7-17% throughput. See `docs/design-docs/gpu-strategy.md`. - **GPU device init time skews throughput benchmarks** — first-call GPU init adds significant overhead that `bench.sh` captures but Criterion amortizes across iterations. When comparing GPU vs CPU throughput, use Criterion (`cargo bench`) for apples-to-apples; `bench.sh` reflects real-world cold-start cost. Don't chase "GPU is slower" regressions that are really just init time. -- **Compression ratio is limited by 5-byte match encoding, not match quality** — the LZ match-finder finds good matches, but the 5-byte serialized match format creates overhead on short matches. Improving ratio means fixing the encoding format, not tuning the matcher. +- **Compression ratio is limited by wire encoding overhead, not match quality** — the LZ match-finder finds good matches. The legacy Lz77Encoder (5-byte per match) was the worst offender; LzSeqEncoder (log2-coded, 6 streams) is much better but still ~2pp behind gzip on Silesia (34.4% vs 32.2%). Further ratio gains require encoding format work, not matcher tuning. - **GPU Huffman is a dead end** — Huffman coding requires bit-level alignment, but GPU throughput depends on byte-aligned memory access patterns. This is a fundamental architectural mismatch; do not attempt to port Huffman to GPU. - **GPU hash tables for LZ matching don't work** — GPU atomics don't preserve insertion order, so hash chains lose recency information. Match quality collapses to ~6% vs CPU's 99.6% on repetitive data. Tried twice (global atomics + shared-memory variant), both catastrophically failed. See `docs/design-docs/experiments.md`. - **SSE2 rANS decode is 32% slower than scalar** — scalar 4-lane decode gets good ILP from out-of-order execution. SSE2 extract operations serialize and lose that parallelism. Proper SIMD rANS would need merged slot-indexed tables and SSE4.1+. The dispatch is disabled; don't re-enable it. @@ -71,6 +114,9 @@ Before optimizing GPU code paths, read this first — multiple agents have spent - **Iterative GPU algorithms have quadratic host overhead** — Repair grammar compression hit 0.4 MB/s due to 100+ rounds of buffer alloc + readback. Avoid per-round GPU↔CPU synchronization; prefer single-dispatch or persistent-buffer designs. - **Window-capped suffix sorts break BWT invertibility** — FWST produced 433% ratio (massive expansion). Full suffix sort is structurally required for LF-mapping; there's no shortcut. +- **Streaming path is the CLI bottleneck, not the compressor** — Criterion measures 333–543 MB/s for raw algorithms, but CLI delivers 111–131 MB/s on the same data. The 3–4x gap is in `streaming::compress_stream`, not the encoder. Pipeline-level algorithmic speed differences (e.g., Lzfi vs LzSeqR) are largely invisible at the CLI level because streaming overhead dominates. +- **LzSeqR parallel encode used to route to incompatible GPU rANS** — `run_compress_stage` in `stages.rs` sent LzSeqR entropy to `stage_rans_encode_webgpu` (GPU chunked payload format), while the single-block path used standard CPU rANS. The chunked format was incompatible with all decoders. Fixed in PR #120 by routing to CPU rANS. Don't re-enable GPU rANS for LzSeqR without fixing the wire format compatibility. + For detailed history of all failed experiments, see `docs/design-docs/gpu-experiments-wave2-conclusions.md` and `docs/design-docs/experiments.md`. ## Key conventions diff --git a/benches/throughput_lzfi.rs b/benches/throughput_lzfi.rs index 0f0a4f3..f475bd6 100644 --- a/benches/throughput_lzfi.rs +++ b/benches/throughput_lzfi.rs @@ -8,11 +8,11 @@ use throughput_common::{run_throughput_benches, ThroughputBenchSpec}; const SPEC: ThroughputBenchSpec = ThroughputBenchSpec { id: "lzfi", pipeline: Pipeline::Lzfi, - parallel: false, - large: false, - decompress_large: false, - webgpu: false, - webgpu_large: false, + parallel: true, + large: true, + decompress_large: true, + webgpu: true, + webgpu_large: true, }; fn bench(c: &mut Criterion) {