Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 49 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,57 @@ Specialized agents in `.claude/agents/` run on cheaper models and keep verbose o
- **tooling** — build scripts and workflow automation (Sonnet; consumes `.claude/friction/`)
- **maintainer** — review feedback backlog, update CLAUDE.md, delegate improvements (Opus; consumes `.claude/feedback/`)

## Architecture overview

All LZ-based pipelines share a unified token architecture (since PR #118):

```
input → tokenize() → Vec<LzToken> → TokenEncoder::encode() → multi-stream → entropy_encode()
```

**Three pluggable wire encoders** select how tokens map to byte streams:
- **LzSeqEncoder** (6 streams) — log2-coded offsets/lengths with repeat-offset tracking. Used by Lzf, LzSeqR, LzSeqH, SortLz. Best ratio.
- **LzssEncoder** (4 streams) — flag-bit based, raw u16 offsets+lengths. Used by Lzfi, LzssR. Faster decode, worse ratio.
- **Lz77Encoder** (3 streams) — DEFLATE-style. Legacy, no active pipeline uses it.

**Active pipelines:**

| Pipeline | Encoder | Entropy | Notes |
|----------|---------|---------|-------|
| **Lzf** (default) | LzSeq | FSE | General-purpose, good ratio |
| **LzSeqR** | LzSeq | rANS | Fastest overall, ratio matches Lzf |
| **LzSeqH** | LzSeq | Huffman | Fast decode |
| **Lzfi** | LZSS | interleaved FSE | Fastest algorithm but poor ratio (46% vs 34%) |
| **LzssR** | LZSS | rANS | Dominated by Lzfi, removal candidate |
| **SortLz** | LzSeq (internal) | FSE | Deterministic GPU radix-sort matching |
| **Bw** / **Bbw** | — | FSE | BWT-based, no LZ tokens |

**Removed pipelines:** Deflate (#117), Lzr (#118), Lz78R (#116), Parlz (ratio loss)

**CLI path:** `pz` always uses `streaming::compress_stream`, not `pipeline::compress_with_options`. The streaming path uses block-by-block parallelism with bounded memory.

### Silesia corpus benchmarks (211MB, CLI end-to-end, verified round-trip)

| Method | Compress | Ratio | Comp MB/s | Decomp MB/s |
|---------|----------|-------|-----------|-------------|
| gzip | 4.69s | 32.2% | 45 | 225 |
| pz lzf | 1.90s | 34.6% | 111 | 133 |
| pz lzseqr | 1.62s | 34.4% | 131 | 154 |
| pz lzfi | 1.70s | 46.0% | 125 | 143 |

pz compresses 2.5–7x faster than gzip with ~2pp ratio gap. Decompress is faster per-file but slower on many small files due to ~90ms per-invocation startup overhead. Criterion (pure algorithm, no I/O) measures 333–543 MB/s — the 3–4x CLI gap is in the streaming path, not the compressor.

**Benchmark corpus:** `./scripts/fetch-silesia.sh` downloads the 211MB Silesia corpus to `samples/silesia/`.

## Project layout

- `src/lib.rs` — crate root, `PzError`/`PzResult` types
- `src/{algorithm}.rs` — one file per composable algorithm (bwt, crc32, deflate, fse, huffman, lz77, lz78, lzseq, lzss, mtf, rans, rle)
- `src/lz_token.rs` — universal `LzToken` type, `TokenEncoder` trait, three encoder implementations
- `src/{algorithm}.rs` — one file per composable algorithm (bwt, crc32, fse, huffman, lz77, lzseq, lzss, mtf, rans, rle)
- `src/analysis.rs` — data profiling (entropy, match density, run ratio, autocorrelation)
- `src/optimal.rs` — optimal parsing (GPU top-K + backward DP)
- `src/simd.rs` — SIMD decode paths for rANS
- `src/streaming.rs` — streaming compression interface
- `src/streaming.rs` — streaming compression interface (CLI entry point)
- `src/ffi.rs` — C FFI bindings
- `src/pipeline/` — multi-stage compression pipelines, auto-selection, block parallelism, demux
- `src/bin/pz.rs` — CLI binary (`pz` with `-a`/`--auto` and `--trial` flags)
Expand All @@ -63,14 +106,17 @@ Before optimizing GPU code paths, read this first — multiple agents have spent
- **The CLI uses `streaming::compress_stream`, not `pipeline::compress_with_options`** — the parallel scheduler's GPU coordinator is not invoked by the CLI. The streaming path deliberately uses CPU rANS for entropy.
- **The real GPU win (ring-buffered LZ77 batching) is already shipped** — delivers +7-17% throughput. See `docs/design-docs/gpu-strategy.md`.
- **GPU device init time skews throughput benchmarks** — first-call GPU init adds significant overhead that `bench.sh` captures but Criterion amortizes across iterations. When comparing GPU vs CPU throughput, use Criterion (`cargo bench`) for apples-to-apples; `bench.sh` reflects real-world cold-start cost. Don't chase "GPU is slower" regressions that are really just init time.
- **Compression ratio is limited by 5-byte match encoding, not match quality** — the LZ match-finder finds good matches, but the 5-byte serialized match format creates overhead on short matches. Improving ratio means fixing the encoding format, not tuning the matcher.
- **Compression ratio is limited by wire encoding overhead, not match quality** — the LZ match-finder finds good matches. The legacy Lz77Encoder (5-byte per match) was the worst offender; LzSeqEncoder (log2-coded, 6 streams) is much better but still ~2pp behind gzip on Silesia (34.4% vs 32.2%). Further ratio gains require encoding format work, not matcher tuning.
- **GPU Huffman is a dead end** — Huffman coding requires bit-level alignment, but GPU throughput depends on byte-aligned memory access patterns. This is a fundamental architectural mismatch; do not attempt to port Huffman to GPU.
- **GPU hash tables for LZ matching don't work** — GPU atomics don't preserve insertion order, so hash chains lose recency information. Match quality collapses to ~6% vs CPU's 99.6% on repetitive data. Tried twice (global atomics + shared-memory variant), both catastrophically failed. See `docs/design-docs/experiments.md`.
- **SSE2 rANS decode is 32% slower than scalar** — scalar 4-lane decode gets good ILP from out-of-order execution. SSE2 extract operations serialize and lose that parallelism. Proper SIMD rANS would need merged slot-indexed tables and SSE4.1+. The dispatch is disabled; don't re-enable it.
- **Fully parallel GPU LZ parsing (ParlZ) has unacceptable ratio loss** — 37.6% compression gap vs serial parsing. Forward-max-propagation conflict resolution is too aggressive. Hybrid GPU match-finding + CPU serial parsing is the correct architecture.
- **Iterative GPU algorithms have quadratic host overhead** — Repair grammar compression hit 0.4 MB/s due to 100+ rounds of buffer alloc + readback. Avoid per-round GPU↔CPU synchronization; prefer single-dispatch or persistent-buffer designs.
- **Window-capped suffix sorts break BWT invertibility** — FWST produced 433% ratio (massive expansion). Full suffix sort is structurally required for LF-mapping; there's no shortcut.

- **Streaming path is the CLI bottleneck, not the compressor** — Criterion measures 333–543 MB/s for raw algorithms, but CLI delivers 111–131 MB/s on the same data. The 3–4x gap is in `streaming::compress_stream`, not the encoder. Pipeline-level algorithmic speed differences (e.g., Lzfi vs LzSeqR) are largely invisible at the CLI level because streaming overhead dominates.
- **LzSeqR parallel encode used to route to incompatible GPU rANS** — `run_compress_stage` in `stages.rs` sent LzSeqR entropy to `stage_rans_encode_webgpu` (GPU chunked payload format), while the single-block path used standard CPU rANS. The chunked format was incompatible with all decoders. Fixed in PR #120 by routing to CPU rANS. Don't re-enable GPU rANS for LzSeqR without fixing the wire format compatibility.

For detailed history of all failed experiments, see `docs/design-docs/gpu-experiments-wave2-conclusions.md` and `docs/design-docs/experiments.md`.

## Key conventions
Expand Down
10 changes: 5 additions & 5 deletions benches/throughput_lzfi.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ use throughput_common::{run_throughput_benches, ThroughputBenchSpec};
const SPEC: ThroughputBenchSpec = ThroughputBenchSpec {
id: "lzfi",
pipeline: Pipeline::Lzfi,
parallel: false,
large: false,
decompress_large: false,
webgpu: false,
webgpu_large: false,
parallel: true,
large: true,
decompress_large: true,
webgpu: true,
webgpu_large: true,
};

fn bench(c: &mut Criterion) {
Expand Down
Loading