Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 9 additions & 39 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ For day-to-day development instructions, see `CLAUDE.md`.

## Completed milestones (12/12)
- **Algorithms:** LZ77 (brute, hashchain, lazy, parallel), LzSeq (code+extra-bits, repeat offsets, 128KB window), Huffman, BWT (SA-IS), MTF, RLE, FSE, rANS
- **Pipelines:** Deflate (LZ77+Huffman), Bw (BWT+MTF+RLE+FSE), Lzr (LZ77+rANS), Lzf (LZ77+FSE), LzSeqR (LzSeq+rANS), LzSeqH (LzSeq+Huffman), SortLz (sort-LZ77+FSE) — Deflate, Lzr, and Lzf use multi-stream entropy coding for ~16-18% better compression; LzSeqR/LzSeqH use zstd-style code+extra-bits encoding with 6-stream demux; SortLz uses sort-based match finding (GPU-accelerated)
- **Pipelines:** Bw (BWT+MTF+RLE+FSE), Lzf (LzSeq+FSE), LzSeqR (LzSeq+rANS), LzSeqH (LzSeq+Huffman), SortLz (sort-LZ77+FSE) — Lzf and LzSeqR/LzSeqH use multi-stream entropy coding for ~16-18% better compression; LzSeqR/LzSeqH use zstd-style code+extra-bits encoding with 6-stream demux; SortLz uses sort-based match finding (GPU-accelerated)
- **Auto-selection:** Heuristic (`select_pipeline`) and trial-based (`select_pipeline_trial`) pipeline selection using data analysis (entropy, match density, run ratio, autocorrelation); LzSeqR included in trial candidates
- **Data analysis:** `src/analysis.rs` — statistical profiling (Shannon entropy, autocorrelation, run ratio, match density, distribution shape) with sampling support
- **Optimal parsing:** GPU top-K match table → CPU backward DP (4-6% better compression)
- **Multi-threading:** Block-parallel and pipeline-parallel via V2 container format; within-block parallel LZ77 match finding (`compress_lazy_parallel`)
- **SortLZ:** Sort-based match finder — standalone pipeline (ID 10) and pluggable `MatchFinder::SortLz` for Deflate/Lzr/Lzf/LzSeqR/LzSeqH; GPU radix sort batched (single submit); adaptive `select_match_finder()` heuristic; u64-optimized `extend_match`; 39.6% ratio (beats Deflate 43.4%)
- **GPU kernels:** LZ77 hash-table (fast), LZ77 batch/per-position (legacy), LZ77 top-K, BWT radix sort + parallel rank assignment, SortLZ radix sort + match verification, Huffman encode (two-pass with Blelloch prefix sum), GPU Deflate chaining (LZ77→Huffman on device)
- **SortLZ:** Sort-based match finder — standalone pipeline (ID 10) and pluggable `MatchFinder::SortLz` for Lzf/LzSeqR/LzSeqH; GPU radix sort batched (single submit); adaptive `select_match_finder()` heuristic; u64-optimized `extend_match`; 39.6% ratio
- **GPU kernels:** LZ77 hash-table (fast), LZ77 batch/per-position (legacy), LZ77 top-K, BWT radix sort + parallel rank assignment, SortLZ radix sort + match verification, Huffman encode (two-pass with Blelloch prefix sum)
- **Tooling:** CLI (`pz` with `-a`/`--auto` and `--trial` flags), C FFI, Criterion benchmarks, CI (3 OS)
- **Fuzz testing (M5.3):** `cargo-fuzz` infrastructure with 12 targets covering all algorithms and pipelines (roundtrip + crash resistance)

Expand All @@ -23,9 +23,9 @@ For day-to-day development instructions, see `CLAUDE.md`.
- **CPU:** Uses SA-IS (Suffix Array by Induced Sorting) — O(n) linear time via doubled-text-with-sentinel strategy.
- **GPU:** Uses LSB-first 8-bit radix sort with prefix-doubling for suffix array construction. Replaced earlier bitonic sort (PR #21). Features adaptive key width (skip zero-digit radix passes) and event chain batching (one host sync per doubling step). Rank assignment runs on GPU via Blelloch prefix sum + scatter. Still slower than CPU SA-IS at all sizes but dramatically improved from bitonic sort (7-14x faster). The GPU uses circular comparison `(sa[i]+k) % n` vs CPU SA-IS's doubled-text approach — both produce valid BWTs that round-trip correctly.

## Multi-stream Deflate
## Multi-stream entropy coding

The Deflate, Lzr, and Lzf pipelines use **multi-stream entropy coding** to improve
The Lzf and LzSeqR pipelines use **multi-stream entropy coding** to improve
compression ratio by separating LZ77 output into independent byte streams with
tighter symbol distributions. Instead of feeding one mixed stream to the entropy
coder, the encoder deinterleaves tokens into three streams:
Expand All @@ -36,7 +36,7 @@ coder, the encoder deinterleaves tokens into three streams:
| **Lengths** | Match lengths (capped to u8) | Length distribution is highly skewed (short matches dominate) |
| **Literals** | Literal bytes + low offset bytes + next bytes | Natural-language / binary byte distribution |

Each stream gets its own Huffman tree (Deflate), FSE table (Lzf), or rANS context (Lzr),
Each stream gets its own FSE table (Lzf) or rANS context (LzSeqR),
yielding lower per-stream entropy than a single combined stream.

### Encoding format
Expand Down Expand Up @@ -65,14 +65,12 @@ Comparison on Canterbury + Large corpus (14 files, 13.3 MB total), averaged over

| Pipeline | Before (bytes) | After (bytes) | Size delta | Throughput delta |
|----------|---------------|--------------|------------|-----------------|
| Deflate | 6,319,168 | 5,301,184 | **-16.1%** | +1.6% faster |
| Lzf | 6,199,044 | 5,107,601 | **-17.6%** | +2.8% faster |

**Decompression throughput:**

| Pipeline | Throughput delta |
|----------|-----------------|
| Deflate | **+8.4%** faster |
| Lzf | **+2.4%** faster |

Multi-stream is a pure win: better compression **and** faster speed. The largest
Expand Down Expand Up @@ -132,16 +130,9 @@ Every symbol present in the input gets at least frequency 1. Excess is trimmed
from the most-frequent symbol; deficit is added to it. The normalization code
is shared conceptually with `src/fse.rs` (both operate on power-of-2 tables).

### Pipeline: Lzr (LZ77 + rANS)

`Pipeline::Lzr` (ID 3) reuses the existing multi-stream LZ77 architecture
(offsets, lengths, literals) with rANS as the entropy coder instead of Huffman
(Deflate) or FSE (Lzf). It participates in auto-selection via
`select_pipeline_trial()`.

### Forward TODOs

See `docs/exec-plans/tech-debt-tracker.md` for rANS SIMD decode and reciprocal multiplication work items. Benchmark integration (rANS/Lzr in `benches/throughput.rs` and `benches/stages.rs`) is also pending.
See `docs/exec-plans/tech-debt-tracker.md` for rANS SIMD decode and reciprocal multiplication work items.

## SIMD acceleration
`src/simd.rs` provides runtime-dispatched SIMD for CPU hot paths:
Expand Down Expand Up @@ -170,8 +161,7 @@ match verification. Zero atomics, fully deterministic — ideal for GPU executio
| **`MatchFinder::SortLz`** | Pluggable match finder for other pipelines | Host pipeline's format |

When used as a `MatchFinder`, SortLZ is transparent to the wire format — the
output is 100% compatible with the host pipeline (Deflate, Lzr, Lzf, LzSeqR,
LzSeqH). The consumer and decompressor see no difference.
output is 100% compatible with the host pipeline (Lzf, LzSeqR, LzSeqH). The consumer and decompressor see no difference.

### Pipeline::SortLz wire format (per block)

Expand Down Expand Up @@ -221,19 +211,7 @@ Uses GPU radix sort (same kernels as BWT) + GPU match verification:
| 256KB| 131 MB/s | 31 MB/s | 53 MB/s | **1.7x faster** |
| 4MB | 142 MB/s | 8 MB/s | 89 MB/s | **10.6x faster** |

SortLZ compression ratio: **39.6%** (vs hashchain+Deflate 43.4%, BWT 32.7%).

## GPU stage chaining
The Deflate GPU path chains LZ77 → Huffman on the GPU with minimized transfers:
1. GPU: LZ77 hash-table kernel → download match array → CPU dedupe + serialize
2. GPU: upload LZ77 bytes once → `ByteHistogram` kernel → download only 256×u32 (1KB)
3. CPU: build Huffman tree from histogram, produce code LUT
4. GPU: Huffman encode (reusing LZ77 buffer) with Blelloch prefix sum
5. GPU: download final encoded bitstream

The `ByteHistogram` kernel eliminates the need to scan LZ77 data on CPU for frequency counting — only 1KB of histogram data is transferred instead of the full LZ77 stream.

This is activated automatically when a GPU backend is selected and input ≥ `MIN_GPU_INPUT_SIZE`.
SortLZ compression ratio: **39.6%** (vs BWT 32.7%).

## Parallel LZ77
`compress_lazy_parallel(input, num_threads)` pre-computes matches in parallel (each thread builds its own hash chain), then serializes sequentially with lazy evaluation. Thresholds:
Expand Down Expand Up @@ -262,14 +240,6 @@ This is activated automatically when a GPU backend is selected and input ≥ `MI

GPU Huffman with Blelloch prefix sum crosses over ~128KB. At 256KB the GPU scan path is 3.4x faster than CPU.

### Deflate chained (GPU LZ77 → GPU Huffman)

| Size | CPU 1-thread | GPU chained | Speedup |
|------|-------------|-------------|---------|
| 64KB | 1.63ms (38 MiB/s) | 3.01ms (21 MiB/s) | CPU 1.8x faster |
| 256KB | 6.06ms (41 MiB/s) | 4.93ms (51 MiB/s) | **GPU 1.2x faster** |
| 1MB | 23.5ms (43 MiB/s) | 18.3ms (55 MiB/s) | **GPU 1.3x faster** |

### BWT GPU (radix sort)

| Size | GPU radix | Throughput | Old bitonic | Speedup vs bitonic |
Expand Down
8 changes: 0 additions & 8 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,6 @@ path = "src/bin/pz.rs"
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }

[[bench]]
name = "throughput_deflate"
harness = false

[[bench]]
name = "throughput_bw"
harness = false
Expand Down Expand Up @@ -101,10 +97,6 @@ harness = false
name = "stages_auto_select"
harness = false

[[bench]]
name = "stages_deflate_webgpu_chained"
harness = false

[[bench]]
name = "rans_decode_bench"
harness = false
Expand Down
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,11 @@ Lossless data compression library with GPU acceleration, written in Rust.

| Pipeline | Stages | Similar to |
|----------|--------|------------|
| **Deflate** | LZ77 + Multi-stream Huffman | gzip |
| **BW** | BWT + MTF + RLE + FSE | bzip2 |
| **LZR** | LZ77 + Multi-stream rANS | — |
| **LZF** | LZ77 + Multi-stream FSE | zstd-like |

Deflate, Lzr, and Lzf use **multi-stream entropy coding**: LZ77 output is split into separate offset, length, and literal streams, each with its own entropy coder. This yields ~16-18% better compression than single-stream encoding with no speed penalty. LZR uses rANS (range ANS) — a multiply-shift entropy coder designed for SIMD and GPU parallelism via interleaved decode states. LZF uses FSE (Finite State Entropy) — a fast table-driven tANS coder similar to zstd.
Lzr and Lzf use **multi-stream entropy coding**: LZ77 output is split into separate offset, length, and literal streams, each with its own entropy coder. This yields ~16-18% better compression than single-stream encoding with no speed penalty. LZR uses rANS (range ANS) — a multiply-shift entropy coder designed for SIMD and GPU parallelism via interleaved decode states. LZF uses FSE (Finite State Entropy) — a fast table-driven tANS coder similar to zstd.

Optional WebGPU support for GPU-accelerated LZ77 match finding and BWT suffix array construction.

Expand Down Expand Up @@ -77,7 +76,7 @@ Uses [criterion](https://github.com/bheisler/criterion.rs) for statistical bench
```
cargo bench # CPU benchmarks
cargo bench --features webgpu # includes GPU benchmarks
cargo bench --bench throughput_deflate
cargo bench --bench throughput_lzf
cargo bench --bench stages_lz77
```

Expand Down
70 changes: 0 additions & 70 deletions benches/stages_deflate_webgpu_chained.rs

This file was deleted.

12 changes: 4 additions & 8 deletions benches/stages_lz77.rs
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ fn bench_lz77_webgpu_batched(c: &mut Criterion) {

let eng = engine.clone();
group.bench_with_input(
BenchmarkId::new("gpu_batched_deflate", size),
BenchmarkId::new("gpu_batched_lzf", size),
&data,
move |b, data| {
let opts = CompressOptions {
Expand All @@ -100,23 +100,19 @@ fn bench_lz77_webgpu_batched(c: &mut Criterion) {
threads: 4,
..Default::default()
};
b.iter(|| {
pz::pipeline::compress_with_options(data, Pipeline::Deflate, &opts).unwrap()
});
b.iter(|| pz::pipeline::compress_with_options(data, Pipeline::Lzf, &opts).unwrap());
},
);

group.bench_with_input(
BenchmarkId::new("cpu_parallel_deflate", size),
BenchmarkId::new("cpu_parallel_lzf", size),
&data,
|b, data| {
let opts = CompressOptions {
threads: 4,
..Default::default()
};
b.iter(|| {
pz::pipeline::compress_with_options(data, Pipeline::Deflate, &opts).unwrap()
});
b.iter(|| pz::pipeline::compress_with_options(data, Pipeline::Lzf, &opts).unwrap());
},
);
}
Expand Down
8 changes: 2 additions & 6 deletions benches/stages_match_finders.rs
Original file line number Diff line number Diff line change
Expand Up @@ -203,19 +203,15 @@ fn bench_gpu_match_finding(c: &mut Criterion) {
}
group.finish();

// --- Part 3: Cross-pipeline GPU sortlz (Deflate, LzSeqR, Lzf) at 256K ---
// --- Part 3: Cross-pipeline GPU sortlz (LzSeqR, Lzf) at 256K ---
let mut group = c.benchmark_group("gpu_sortlz_pipelines");
cap(&mut group);

let size = 262_144;
let data = get_test_data(size);
group.throughput(Throughput::Bytes(size as u64));

for (name, pipeline) in [
("deflate", Pipeline::Deflate),
("lzseqr", Pipeline::LzSeqR),
("lzf", Pipeline::Lzf),
] {
for (name, pipeline) in [("lzseqr", Pipeline::LzSeqR), ("lzf", Pipeline::Lzf)] {
// CPU hashchain baseline
let opts = CompressOptions {
parse_strategy: ParseStrategy::Lazy,
Expand Down
23 changes: 0 additions & 23 deletions benches/throughput_deflate.rs

This file was deleted.

8 changes: 4 additions & 4 deletions docs/DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ Each algorithm (`bwt`, `huffman`, `lz77`, `fse`, `rans`, etc.) is:
- Usable standalone or in pipelines
- Not tied to a specific compression format

Pipelines (`deflate`, `bw`, `lzr`, `lzf`) combine algorithms via the demuxer pattern:
Pipelines (`bw`, `lzf`, `lzseqr`) combine algorithms via the demuxer pattern:
```rust
// Pipeline = sequence of stages
LZ77 → demux into streams → Huffman encode → merge streams
LZ77 → demux into streams → FSE/rANS encode → merge streams
```

**Why:** Flexibility. New pipelines reuse existing algorithms. New algorithms enhance all pipelines.

**Example:** The same LZ77 implementation works in Deflate (LZ77+Huffman), Lzf (LZ77+FSE), and Lzr (LZ77+rANS) pipelines.
**Example:** The same LZ77 implementation works in Lzf (LZ77+FSE) and Lzr (LZ77+rANS) pipelines.

### 2. GPU-Friendly Design Patterns

Expand Down Expand Up @@ -210,7 +210,7 @@ Below these thresholds, kernel launch overhead dominates.

### Multi-Stream Entropy Coding

LZ-based pipelines (Deflate, Lzf, Lzr) use multi-stream encoding (3 independent streams per block). See `ARCHITECTURE.md` for benchmark results (16-18% better compression, 2-8% faster decompression) and `design-docs/pipeline-architecture.md` for the stream layout.
LZ-based pipelines (Lzf, Lzr) use multi-stream encoding (3 independent streams per block). See `ARCHITECTURE.md` for benchmark results (16-18% better compression, 2-8% faster decompression) and `design-docs/pipeline-architecture.md` for the stream layout.

### Memory Management

Expand Down
2 changes: 1 addition & 1 deletion examples/block_size_experiment.rs
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ const BLOCK_SIZES: &[usize] = &[
];

/// Pipelines to test (LZ77-based, GPU-eligible).
const PIPELINES: &[Pipeline] = &[Pipeline::Deflate, Pipeline::Lzf];
const PIPELINES: &[Pipeline] = &[Pipeline::Lzf];

fn load_data(size: usize) -> Vec<u8> {
let manifest = Path::new(env!("CARGO_MANIFEST_DIR"));
Expand Down
6 changes: 1 addition & 5 deletions examples/decode_profile.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,7 @@ fn main() {

let iters = 20;

for &(name, pipeline) in &[
("LzSeqR", Pipeline::LzSeqR),
("LzSeqH", Pipeline::LzSeqH),
("Deflate", Pipeline::Deflate),
] {
for &(name, pipeline) in &[("LzSeqR", Pipeline::LzSeqR), ("LzSeqH", Pipeline::LzSeqH)] {
let options = CompressOptions::default();
let compressed = pipeline::compress_with_options(&data, pipeline, &options).unwrap();
let ratio = compressed.len() as f64 / data.len() as f64 * 100.0;
Expand Down
4 changes: 0 additions & 4 deletions examples/explore_pipelines.rs
Original file line number Diff line number Diff line change
Expand Up @@ -307,16 +307,13 @@ fn main() {
#[allow(unused_mut)]
let mut pz_configs: Vec<(Pipeline, ParseStrategy, usize, Backend)> = vec![
// Single-threaded CPU variants
(Pipeline::Deflate, ParseStrategy::Lazy, 1, Backend::Cpu),
(Pipeline::Deflate, ParseStrategy::Optimal, 1, Backend::Cpu),
(Pipeline::Lzf, ParseStrategy::Lazy, 1, Backend::Cpu),
(Pipeline::Lzf, ParseStrategy::Optimal, 1, Backend::Cpu),
(Pipeline::Bw, ParseStrategy::Auto, 1, Backend::Cpu),
(Pipeline::Bbw, ParseStrategy::Auto, 1, Backend::Cpu),
// Experimental: LZSS pipeline
(Pipeline::LzssR, ParseStrategy::Auto, 1, Backend::Cpu),
// Multi-threaded CPU
(Pipeline::Deflate, ParseStrategy::Lazy, 0, Backend::Cpu),
(Pipeline::Lzf, ParseStrategy::Lazy, 0, Backend::Cpu),
(Pipeline::Bw, ParseStrategy::Auto, 0, Backend::Cpu),
(Pipeline::Bbw, ParseStrategy::Auto, 0, Backend::Cpu),
Expand All @@ -327,7 +324,6 @@ fn main() {
#[cfg(feature = "webgpu")]
if has_webgpu {
pz_configs.extend([
(Pipeline::Deflate, ParseStrategy::Auto, 1, Backend::WebGpu),
(Pipeline::Lzf, ParseStrategy::Auto, 1, Backend::WebGpu),
(Pipeline::Bw, ParseStrategy::Auto, 1, Backend::WebGpu),
]);
Expand Down
1 change: 0 additions & 1 deletion examples/pipeline_comparison.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ fn test_pipeline_comparison(name: &str, data: Vec<u8>) {
let pipelines = vec![
("Lzf (LzSeq+FSE)", Pipeline::Lzf),
("LzSeqR (LzSeq+rANS)", Pipeline::LzSeqR),
("Deflate (LZ77+Huffman)", Pipeline::Deflate),
];

println!(
Expand Down
Loading