Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 47 additions & 28 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,52 +283,71 @@ GPU Huffman with Blelloch prefix sum crosses over ~128KB. At 256KB the GPU scan

GPU BWT radix sort is 7-14x faster than the old bitonic sort. Still slower than CPU SA-IS at small sizes but becoming competitive at 64KB+ (CPU SA-IS ~1ms at 64KB vs GPU 4.1ms). The gap narrows at larger sizes where GPU parallelism helps more.

## GPU/CPU strategy (settled)

The optimal split for libpz is **GPU for LZ77 match-finding, CPU for entropy coding**,
overlapped via the unified scheduler with ring-buffered streaming.

**Why GPU wins on LZ77:** Match-finding is embarrassingly parallel — each position's
search is independent. The cooperative-stitch kernel does 1,788 probes/position and
is 2x faster than CPU at 256KB+. Ring-buffered batching (`find_matches_batched`)
adds +7-17% throughput by amortizing buffer allocation and overlapping GPU compute
with CPU readback.

**Why CPU wins on entropy:** rANS/FSE/Huffman are serial state machines — each
symbol depends on the previous state. GPU entropy has been tried extensively
(500+ iterations: single-stream, independent blocks, Recoil checkpoints, batched
cross-block) and is 0.77x CPU on encode, 0.54x on decode. The serial dependency
limits GPU to ~300 useful threads when saturation needs ~8K-16K. PCIe transfer
overhead dominates at typical block sizes (128KB-256KB).

**Architecture:** The unified scheduler dispatches LZ77 to GPU and entropy to CPU
workers in parallel. While CPU thread N entropy-encodes block K, the GPU is already
match-finding block K+1. The `GPU_ENTROPY_THRESHOLD` (256KB) is deliberately set
above `DEFAULT_GPU_BLOCK_SIZE` (128KB) to prevent routing entropy to the slower
GPU path.

See `docs/design-docs/gpu-strategy.md` for full analysis and `CLAUDE.md` "Known
dead ends" for the complete list of GPU optimization attempts that failed.

## Remaining GPU bottlenecks

1. **GPU BWT still slower than CPU SA-IS** — Radix sort improved 7-14x over bitonic
sort, but CPU SA-IS (O(n)) remains faster at small/medium sizes. GPU catches up
at 64KB+ but prefix-doubling's O(n log n) work is inherently more than SA-IS's O(n).

2. **No shared memory usage** — LZ77 hash kernel uses only global memory.
Loading hash buckets into `__local` memory could help at larger sizes.

3. **Hash bucket overflow** — Fixed BUCKET_CAP=64 means highly repetitive data
2. **Hash bucket overflow** — Fixed BUCKET_CAP=64 means highly repetitive data
may miss good matches. Adaptive bucket sizing could help.

4. **Huffman WriteCodes atomic contention** — Per-bit atomic_or on the output
buffer limits scaling. Chunk-based packing could reduce contention.

5. **LZ77 match array still downloaded for dedupe** — GPU match dedup is sequential
3. **LZ77 match array still downloaded for dedupe** — GPU match dedup is sequential
and runs on CPU. Keeping serialized LZ77 bytes on GPU for histogram+Huffman
is already done (ByteHistogram optimization), but the match download is unavoidable.

## Next steps

### Priority 0: rANS SIMD completion

See `docs/exec-plans/tech-debt-tracker.md` for full details. Two items: SIMD decode paths (SSE2/AVX2 intrinsics for interleaved decode) and reciprocal multiplication (eliminate data-dependent division in encode).
### Priority 0: Close the gzip compression ratio gap

### Priority 1: Use local/shared memory in LZ77 hash kernel
- Load hash buckets into `__local` memory for faster repeated access
- Could improve GPU LZ77 performance at mid-range sizes (64KB-256KB)
- May lower the GPU crossover point from 256KB toward 64-128KB
LzSeqR is our best pipeline at 35.1% vs gzip's 28.6% (6.5pp gap). The gap is
encoding efficiency, not match quality (see `CLAUDE.md` "Known dead ends"). The
format is pre-release so all changes are free. Key opportunities:

### Priority 3: Chunk-based Huffman bit packing
- Replace per-bit `atomic_or` in WriteCodes with work-group-local packing
- Each work-group packs its chunk into local memory (no atomics within WG)
- Single copy from local → global per chunk
- Could 5-10x GPU Huffman throughput
- **Zstd-style literal-run sequences** — replace per-token flags with
`(literal_run_length, offset, match_length)` tuples, eliminating the flags
stream entirely. Highest ceiling.
- **Larger repeat offset cache** (4→8) — each additional repeat saves all extra
bits for that match.
- **Entropy-code the extra bits** — `offset_extra` and `length_extra` currently
bypass rANS; if values are skewed, 5-15% savings.
- **Sparse frequency tables** — 512 bytes per rANS stream → ~61 bytes for
narrow-alphabet streams. Saves ~1.3KB/block.

### Priority 4: Fuzz testing
- Set up `cargo-fuzz` for round-trip correctness on all pipelines
- Target edge cases in LZ77, Huffman, BWT decode paths
### Priority 1: LzSeq-specific optimal parser

### Priority 5: Auto-selection threshold tuning
- Run all 3 pipelines on Canterbury + Silesia corpora
- Measure actual compression ratios vs analysis metrics
- Tune heuristic decision tree thresholds empirically
The optimal parser currently uses LZ77's `match_cost` approximation. A dedicated
LzSeq optimal parser that tracks repeat offset state through the backward DP would
find more repeat matches, directly improving ratio.

### Priority 6: aarch64 NEON/SVE SIMD implementation
### Priority 2: aarch64 NEON/SVE SIMD implementation
- Replace scalar stubs in `src/simd.rs` with actual NEON intrinsics
- `compare_bytes`: `vceqq_u8` + `vmovn_u16` for 16-byte comparison
- `byte_frequencies`: 4-bank unrolled (NEON lacks efficient gather/scatter)
Expand Down
19 changes: 19 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,25 @@ Specialized agents in `.claude/agents/` run on cheaper models and keep verbose o
- `scripts/` — test, bench, profile, setup, and analysis tools
- `docs/` — design docs, quality status, exec plans, references

## Known dead ends

Before optimizing GPU code paths, read this first — multiple agents have spent full sessions rediscovering these:

- **GPU entropy (rANS/FSE) is slower than CPU** — 0.77x on encode, 0.54x on decode. This has been proven across 500+ optimization iterations. The serial state dependency in rANS limits GPU to ~300 threads; saturation needs ~8K-16K. Do not attempt to batch, parallelize, or "pipeline" GPU entropy encoding.
- **`gpu_fused_span()` returning `Some((0,1))` is counterproductive** — it routes entropy to GPU (slower). It exists as architectural prep for if GPU entropy ever becomes competitive. The `GPU_ENTROPY_THRESHOLD` (256KB > default GPU block size 128KB) intentionally prevents this path from activating.
- **The CLI uses `streaming::compress_stream`, not `pipeline::compress_with_options`** — the parallel scheduler's GPU coordinator is not invoked by the CLI. The streaming path deliberately uses CPU rANS for entropy.
- **The real GPU win (ring-buffered LZ77 batching) is already shipped** — delivers +7-17% throughput. See `docs/design-docs/gpu-strategy.md`.
- **GPU device init time skews throughput benchmarks** — first-call GPU init adds significant overhead that `bench.sh` captures but Criterion amortizes across iterations. When comparing GPU vs CPU throughput, use Criterion (`cargo bench`) for apples-to-apples; `bench.sh` reflects real-world cold-start cost. Don't chase "GPU is slower" regressions that are really just init time.
- **Compression ratio is limited by 5-byte match encoding, not match quality** — the LZ match-finder finds good matches, but the 5-byte serialized match format creates overhead on short matches. Improving ratio means fixing the encoding format, not tuning the matcher.
- **GPU Huffman is a dead end** — Huffman coding requires bit-level alignment, but GPU throughput depends on byte-aligned memory access patterns. This is a fundamental architectural mismatch; do not attempt to port Huffman to GPU.
- **GPU hash tables for LZ matching don't work** — GPU atomics don't preserve insertion order, so hash chains lose recency information. Match quality collapses to ~6% vs CPU's 99.6% on repetitive data. Tried twice (global atomics + shared-memory variant), both catastrophically failed. See `docs/design-docs/experiments.md`.
- **SSE2 rANS decode is 32% slower than scalar** — scalar 4-lane decode gets good ILP from out-of-order execution. SSE2 extract operations serialize and lose that parallelism. Proper SIMD rANS would need merged slot-indexed tables and SSE4.1+. The dispatch is disabled; don't re-enable it.
- **Fully parallel GPU LZ parsing (ParlZ) has unacceptable ratio loss** — 37.6% compression gap vs serial parsing. Forward-max-propagation conflict resolution is too aggressive. Hybrid GPU match-finding + CPU serial parsing is the correct architecture.
- **Iterative GPU algorithms have quadratic host overhead** — Repair grammar compression hit 0.4 MB/s due to 100+ rounds of buffer alloc + readback. Avoid per-round GPU↔CPU synchronization; prefer single-dispatch or persistent-buffer designs.
- **Window-capped suffix sorts break BWT invertibility** — FWST produced 433% ratio (massive expansion). Full suffix sort is structurally required for LF-mapping; there's no shortcut.

For detailed history of all failed experiments, see `docs/design-docs/gpu-experiments-wave2-conclusions.md` and `docs/design-docs/experiments.md`.

## Key conventions

See **docs/DESIGN.md** for full design principles and **docs/design-docs/core-beliefs.md** for agent-first operating principles.
Expand Down
74 changes: 37 additions & 37 deletions scripts/fetch-silesia.sh
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,37 +1,37 @@
#!/usr/bin/env bash
# Download and extract the Silesia compression corpus.
# https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
#
# Usage: ./scripts/fetch-silesia.sh [--force]
#
# Downloads to samples/silesia/. Skips if already present unless --force.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
DEST="${PROJECT_DIR}/samples/silesia"
URL="https://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip"
EXPECTED_FILES=12
if [[ -d "$DEST" ]] && [[ "$(ls "$DEST" | wc -l)" -ge "$EXPECTED_FILES" ]] && [[ "${1:-}" != "--force" ]]; then
echo "Silesia corpus already present in $DEST ($(ls "$DEST" | wc -l) files)"
echo "Use --force to re-download."
exit 0
fi
TMPFILE="$(mktemp /tmp/silesia-XXXXXX.zip)"
trap 'rm -f "$TMPFILE"' EXIT
echo "Downloading Silesia corpus (66 MB)..."
curl -fSL "$URL" -o "$TMPFILE"
mkdir -p "$DEST"
echo "Extracting to $DEST..."
unzip -o "$TMPFILE" -d "$DEST"
# Clean up any .pz artifacts from previous runs
rm -f "$DEST"/*.pz
echo "Done: $(ls "$DEST" | wc -l) files, $(du -sh "$DEST" | cut -f1) total"
#!/usr/bin/env bash
# Download and extract the Silesia compression corpus.
# https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
#
# Usage: ./scripts/fetch-silesia.sh [--force]
#
# Downloads to samples/silesia/. Skips if already present unless --force.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"

DEST="${PROJECT_DIR}/samples/silesia"
URL="https://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip"
EXPECTED_FILES=12

if [[ -d "$DEST" ]] && [[ "$(ls "$DEST" | wc -l)" -ge "$EXPECTED_FILES" ]] && [[ "${1:-}" != "--force" ]]; then
echo "Silesia corpus already present in $DEST ($(ls "$DEST" | wc -l) files)"
echo "Use --force to re-download."
exit 0
fi

TMPFILE="$(mktemp /tmp/silesia-XXXXXX.zip)"
trap 'rm -f "$TMPFILE"' EXIT

echo "Downloading Silesia corpus (66 MB)..."
curl -fSL "$URL" -o "$TMPFILE"

mkdir -p "$DEST"
echo "Extracting to $DEST..."
unzip -o "$TMPFILE" -d "$DEST"

# Clean up any .pz artifacts from previous runs
rm -f "$DEST"/*.pz

echo "Done: $(ls "$DEST" | wc -l) files, $(du -sh "$DEST" | cut -f1) total"
7 changes: 5 additions & 2 deletions src/pipeline/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,11 @@ const DEFAULT_BW_BLOCK_SIZE: usize = 512 * 1024;
/// which also improves throughput on streaming GPU paths.
const DEFAULT_GPU_BLOCK_SIZE: usize = 128 * 1024;

/// Minimum block size for GPU entropy to win over CPU (empirical from Phase 4).
/// Applies to both individual blocks and total stream byte count.
/// Gate that **prevents** routing entropy to the GPU, which is slower than CPU
/// (0.77x encode, 0.54x decode — see `docs/design-docs/gpu-strategy.md`).
/// Set deliberately above `DEFAULT_GPU_BLOCK_SIZE` (128KB) so GPU blocks never
/// qualify for the fused entropy path. Do NOT lower this threshold — it will
/// regress throughput by ~10-15x. See "Known dead ends" in CLAUDE.md.
/// 256KB = 262144 bytes (aligns with AC3.2 threshold).
pub const GPU_ENTROPY_THRESHOLD: usize = 256 * 1024;

Expand Down
11 changes: 11 additions & 0 deletions src/pipeline/parallel.rs
Original file line number Diff line number Diff line change
Expand Up @@ -191,12 +191,23 @@ fn should_route_block_to_gpu_stage0(
/// Returns `Some((start, end))` when every stage in `start..=end` has a GPU
/// implementation, allowing the GPU coordinator to run them sequentially
/// without intermediate queue round-trips or CPU readback.
///
/// **WARNING — GPU entropy is currently SLOWER than CPU** (0.77x encode, 0.54x decode).
/// This function exists as architectural preparation for when GPU entropy becomes
/// competitive. In practice, `GPU_ENTROPY_THRESHOLD` (256KB) is deliberately set above
/// `DEFAULT_GPU_BLOCK_SIZE` (128KB) so the fused path is never activated for GPU blocks.
/// Lowering the threshold or bypassing it will REGRESS throughput by ~10-15x because
/// it serializes all work onto the single coordinator thread AND uses the slower GPU
/// entropy path. See `docs/design-docs/gpu-strategy.md` and
/// `.claude/feedback/2026-03-01-gpu-wins-on-lz77-loses-on-entropy.md`.
#[cfg(feature = "webgpu")]
fn gpu_fused_span(pipeline: Pipeline) -> Option<(usize, usize)> {
match pipeline {
// Both stages have GPU paths: LZ77 match-finding + rANS encode
// NOTE: GPU rANS is slower than CPU — fused path is gated by GPU_ENTROPY_THRESHOLD
Pipeline::Lzr => Some((0, 1)),
// Both stages have GPU paths: LzSeq fused match+demux + rANS encode
// NOTE: GPU rANS is slower than CPU — fused path is gated by GPU_ENTROPY_THRESHOLD
Pipeline::LzSeqR => Some((0, 1)),
_ => None,
}
Expand Down
53 changes: 53 additions & 0 deletions src/pipeline/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -899,6 +899,59 @@ mod gpu_batched_tests {
let decompressed = decompress(&compressed).unwrap();
assert_eq!(decompressed, input);
}

/// Multi-block pipeline pipelining regression test.
///
/// Creates input large enough to produce many blocks (>= 8), then
/// compresses with GPU and verifies the round-trip. This exercises
/// the full GPU coordinator pipeline: ring-buffered LZ77 stage 0
/// batching, StageN entropy encoding, and result assembly.
///
/// Correctness failures here indicate pipelining bugs (e.g., staging
/// buffer reuse before readback, ring slot cross-contamination, or
/// incomplete synchronization between GPU compute and CPU readback).
#[test]
fn test_gpu_pipeline_multiblock_correctness() {
let engine = match crate::webgpu::WebGpuEngine::new() {
Ok(e) => std::sync::Arc::new(e),
Err(_) => return,
};
let opts = CompressOptions {
backend: Backend::WebGpu,
webgpu_engine: Some(engine),
block_size: 64 * 1024, // 64KB blocks → 8 blocks for 512KB input
threads: 2,
..CompressOptions::default()
};

// 512KB input with distinct pattern per 64KB region to detect
// cross-contamination between blocks.
let mut input = Vec::with_capacity(512 * 1024);
let patterns: Vec<&[u8]> = vec![
b"alpha pattern block one data here. ",
b"beta different content for block two. ",
b"gamma third block uses gamma pattern. ",
b"delta fourth block with delta stuff. ",
b"epsilon five five five five five. ",
b"zeta block six zeta zeta zeta. ",
b"eta seventh block eta eta eta. ",
b"theta eighth block theta theta. ",
];
for p in &patterns {
let block: Vec<u8> = p.iter().cycle().take(64 * 1024).copied().collect();
input.extend_from_slice(&block);
}

for pipeline in [Pipeline::Deflate, Pipeline::Lzr, Pipeline::LzSeqR] {
let compressed = compress_with_options(&input, pipeline, &opts).unwrap();
let decompressed = decompress(&compressed).unwrap();
assert_eq!(
decompressed, input,
"{:?} multi-block GPU pipeline round-trip failed",
pipeline
);
}
}
}

// --- LzSeqR optimal parsing tests (Task 6) ---
Expand Down
Loading