feat(encoding): numeric compression levels (1-22) API by polaz · Pull Request #63 · structured-world/structured-zstd

polaz · 2026-04-04T21:12:09Z

Summary

Add CompressionLevel::Level(i32) variant and from_level() constructor for C zstd compatible numeric level API
Port level parameter table (levels 1–22) with per-level window size, hash/chain configuration, search depth, and lazy depth
Support negative levels for ultra-fast compression mode
Source-size-aware selection: set_source_size_hint() on Matcher trait, FrameCompressor, and StreamingEncoder — caps matcher/window parameters from ceil(log2(src_size)) with a MIN_WINDOW_LOG floor
Update CLI to accept zstd-compatible numeric levels with --store flag for uncompressed frames

Technical Details

Each numeric level resolves to a LevelParams struct containing the matcher backend, window log, hash fill step, lazy depth, and hash-chain config. Named variants map exactly to their numeric equivalents: Fastest=1, Default=3, Better=7, Best=11. Level 0 = default (level 3), matching C zstd semantics.

For small inputs, set_source_size_hint() adjusts window_log and HC table sizes proportionally to source size, including src_size == 0 (clamped to minimum window), avoiding multi-MiB allocations for tiny payloads. StreamingEncoder::set_pledged_content_size() also forwards the pledged size as matcher hint.

Window sizes are capped at 64 MiB (window_log=26) to stay within the decoder's 100 MiB limit.

Known Limitations

Levels 4 (greedy), 13+ (btopt/btultra) are approximated with the closest available backend — compression ratios won't match C zstd at those levels
Dfast backend parameters (hash bits, search depth) are not yet per-level configurable — levels 2–4 share the same hash table size

Test Plan

from_level(n) matches named variants for n=1,3,7,11
All 22 levels produce valid output and roundtrip correctly (with source-size hint applied in sampled tiny-input loop)
Negative levels roundtrip correctly
Sampled level validity: non-empty output + roundtrip for [1,3,7,11]
Source-size hint: small input roundtrip, window reduction, streaming+pledged, all levels with tiny input
Streaming encoder works with numeric levels
Out-of-range levels are clamped safely on the default matcher path (MatchGeneratorDriver)
cargo nextest run --workspace, cargo build --workspace, cargo fmt --all

Closes #21

Summary by CodeRabbit

New Features
- Numeric compression levels (0, 1–22 and negative ultra‑fast) with canonical mappings; CLI accepts negative levels, validates range, and adds a --store flag for uncompressed output.
- Encoders and streaming tools accept per‑frame/source size hints and pledged content size to influence windowing and compression behavior.
Documentation
- README updated with numeric‑level semantics, examples, and C‑compatible numbering notes.
Tests
- Expanded mapping, range/edge‑case, round‑trip, and source‑size‑hint/windowing tests.

Compatibility Note

CompressionLevel::Level(i32) expands the public enum shape and is semver-breaking for downstream crates that exhaustively match on CompressionLevel without a wildcard arm.

Note

Reopened as a fresh PR after history squash/force-push (supersedes closed feat(encoding): numeric compression levels (1-22) API #61).

- add CompressionLevel::Level(i32) and canonical from_level mapping - port level table, source-size hinting, and matcher sizing behavior - align CLI parsing/range validation and update docs/tests

coderabbitai · 2026-04-04T21:12:16Z

📝 Walkthrough

Walkthrough

Adds C zstd–compatible numeric compression levels via CompressionLevel::Level(i32) and from_level(), propagates set_source_size_hint() through Matcher/FrameCompressor/StreamingEncoder, makes matcher/table sizing source-size-aware, routes numeric levels into the compressed path, updates CLI to accept numeric --level and --store, and expands tests/docs.

Changes

Cohort / File(s)	Summary
Public API & Level Mapping `zstd/src/encoding/mod.rs`	Adds `CompressionLevel::Level(i32)` and `Debug`; introduces `MIN_LEVEL`/`MAX_LEVEL`/`DEFAULT_LEVEL` and `pub const fn from_level(i32)` mapping numeric inputs to named presets or `Level(n)`; adds `Matcher::set_source_size_hint` hook.
Match generation & dynamic params `zstd/src/encoding/match_generator.rs`	Replaces fixed per-level constants with table-driven `LevelParams` (`LEVEL_TABLE`); adds `set_source_size_hint` to driver and clamps/resolves `window_log`/slice sizes from source hint; makes Dfast hash-table sizing dynamic and adjusts pooled buffer reuse; updates tests and backend config flow.
Frame compressor `zstd/src/encoding/frame_compressor.rs`	Adds `source_size_hint: Option<u64>` field and `pub fn set_source_size_hint(&mut self, size: u64)`; consumes hint before `compress()` and forwards to matcher; routes `CompressionLevel::Level(_)` into compressed block path; test updates.
Streaming encoder `zstd/src/encoding/streaming_encoder.rs`	Forwards pledged content size into matcher via `set_pledged_content_size`; adds `pub fn set_source_size_hint(u64) -> Result<_,_>` (error if frame started); treats `Level(_)` as compressed mode and requests matcher space for compressed levels; adds tests for hinted behavior and late-hint errors.
Encoding wiring & logic `zstd/src/encoding/...`	Multiple encoding components updated to accept matcher hints, consistently route numeric `Level(_)` to compressed matcher-backed path, and adjust allocation/parameter selection based on resolved `LevelParams`.
CLI & UX `cli/src/main.rs`	`--level LEVEL` parsed as `i32` with clap range validation using `CompressionLevel::{MIN_LEVEL,MAX_LEVEL}` (allows negatives via `allow_hyphen_values`); adds `--store` flag that forces uncompressed output while still validating `--level`; `compress()` signature updated and encoder configured with source-size hint.
Docs & Tests `README.md`, `zstd/src/tests/roundtrip_integrity.rs`	README updated to document numeric levels (0 = default, 1–22, negative ultra-fast) and show `CompressionLevel::from_level(7)` example; extensive tests added for numeric mappings, roundtrips across levels (including negatives), clamping/out-of-range handling, streaming encoder, and source-size-hint effects.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/User
    participant Encoder as StreamingEncoder
    participant Matcher as MatchGenerator
    participant Compressor as FrameCompressor

    CLI->>Encoder: set_source_size_hint(size)
    activate Encoder
    Encoder->>Matcher: set_source_size_hint(size)
    activate Matcher
    Matcher->>Matcher: store hint for next reset
    deactivate Matcher
    deactivate Encoder

    CLI->>Encoder: compress(data, level, store)
    activate Encoder
    Encoder->>Matcher: reset(level)
    activate Matcher
    Matcher->>Matcher: resolve LevelParams\napply source_size_hint\nclamp window_log
    deactivate Matcher
    Encoder->>Compressor: compress_block_encoded(...)
    activate Compressor
    Compressor->>Matcher: get_next_space()
    Matcher-->>Compressor: buffer sized by adjusted window (rgba(70,130,180,0.5))
    deactivate Compressor
    deactivate Encoder

    Encoder-->>CLI: emit compressed frame (header window adjusted)

sequenceDiagram
    participant Caller as Caller
    participant API as CompressionLevel::from_level
    participant Table as Level Parameter Table

    Caller->>API: from_level(7)
    activate API
    API->>Table: lookup level 7
    activate Table
    Table-->>API: LevelParams{matcher: Lazy2, window_log: X, ...}
    deactivate Table
    API-->>Caller: CompressionLevel::Better
    deactivate API

    Caller->>API: from_level(15)
    activate API
    API->>Table: lookup level 15
    activate Table
    Table-->>API: LevelParams{matcher: BtLazy2, window_log: Y, ...}
    deactivate Table
    API-->>Caller: CompressionLevel::Level(15)
    deactivate API

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat(encoding): add Best compression level (zstd level 11, btlazy2 strategy) #53 — touches CompressionLevel dispatch, frame_compressor, match_generator and streaming encoder similarly; likely overlaps in numeric-level routing and matcher integration.
feat(encoding): write frame content size in encoder output #60 — overlaps on forwarding pledged/size hints to matcher and streaming encoder behavior.
perf(encoding): align fastest matcher with zstd fast path #39 — modifies match_generator internals (hash-table sizing/reset logic) that intersect with the level-driven sizing changes here.

Poem

🐰
I nibble levels from one up to twenty-two,
Tiny hints tuck windows small and true,
Numeric crumbs from C’s burrow I find,
Matchers hum and encoders fall in line,
Tests clap paws — the bytes hop back, refined!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding a numeric compression levels API supporting levels 1-22, which is the primary objective of this PR.
Linked Issues check	✅ Passed	The PR comprehensively implements all objectives from linked issues `#21` and `#61`: numeric level API via CompressionLevel::Level(i32) and from_level(), parameter table porting, source-size-aware selection, negative level support, CLI updates, and extensive testing.
Out of Scope Changes check	✅ Passed	All code changes are directly aligned with implementing numeric compression levels: CompressionLevel enum extension, parameter tables, source-size hints, matcher updates, CLI enhancements, and comprehensive tests—no unrelated changes detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/#21-feat-numeric-compression-levels-1-22-api

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-04T21:14:02Z

Codecov Report

❌ Patch coverage is 99.16667% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/encoding/match_generator.rs	99.00%	2 Missing ⚠️
zstd/src/encoding/mod.rs	88.88%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

polaz · 2026-04-04T21:15:07Z

@coderabbitai re-review

coderabbitai · 2026-04-04T21:15:17Z

@polaz Sure, kicking off a full re-review of PR #63 now!

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/encoding/frame_compressor.rs:280-287
Timestamp: 2026-04-02T16:31:58.167Z
Learning: In `structured-zstd`, `CompressionLevel::Best` intentionally uses `MatcherBackend::HashChain` (not a binary-tree/DUBT backend) due to memory constraints: a BT matcher for a 16 MiB window requires ~128–256 MB of tables (C zstd level 11 uses chainLog=25 → 256 MB), which is impractical. Best is distinguished from Better by deeper HC search (32 vs 16 candidates), larger hash/chain tables (2M/1M vs 1M/512K), and higher target match length (128 vs 48), keeping memory under 20 MB. It achieves a proportional compression gain (~0.27% better than Better on corpus) comparable to C zstd's L7→L11 delta (~0.14%). A proper BT backend is deferred as a future optimization. Do not flag Best's use of HashChain as missing a BT implementation.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:479-523
Timestamp: 2026-04-02T19:19:44.750Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), the per-level roundtrip test suites (Better, Best) share a common `level_roundtrip_suite!` macro that expands the same 7-test matrix (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for each level via module-scoped macro invocations. Do not suggest re-inlining these as separate flat test functions — the macro was added intentionally to eliminate cross-level drift.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:40.012Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

✅ Actions performed

Full review triggered.

Copilot

Pull request overview

Adds a C zstd-compatible numeric compression level API (including negative “ultra-fast” levels) and introduces source-size-aware tuning so small inputs can avoid oversized windows/tables, with corresponding CLI and test updates.

Changes:

Introduces CompressionLevel::Level(i32) + CompressionLevel::from_level(i32) and a 1–22 level parameter table (plus negative levels).
Adds set_source_size_hint() plumbing through Matcher, FrameCompressor, and StreamingEncoder, and forwards pledged content size as a matcher hint.
Updates CLI to accept numeric levels (including negative) and adds --store; expands roundtrip/tests for new behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
zstd/src/encoding/mod.rs	Extends `CompressionLevel` and `Matcher` with numeric levels + source-size hint hook.
zstd/src/encoding/match_generator.rs	Implements level parameter resolution, source-size-based parameter capping, and negative-level handling.
zstd/src/encoding/frame_compressor.rs	Exposes `set_source_size_hint()` and supports `Level(_)` in encoded compression path.
zstd/src/encoding/streaming_encoder.rs	Forwards pledged size/hints to matcher; supports `Level(_)` across streaming paths; adds tests.
zstd/src/tests/roundtrip_integrity.rs	Adds comprehensive roundtrip coverage for numeric levels, clamping, negative levels, and size-hint behavior.
cli/src/main.rs	Switches `--level` to i32 numeric levels with bounds checking; adds `--store`; forwards source size hint.
README.md	Documents numeric levels and usage examples.

zstd/src/encoding/frame_compressor.rs

polaz · 2026-04-04T21:27:02Z

@coderabbitai re-review

coderabbitai · 2026-04-04T21:27:13Z

@polaz Sure, kicking off another full re-review of PR #63 now!

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/encoding/frame_compressor.rs:280-287
Timestamp: 2026-04-02T16:31:58.167Z
Learning: In `structured-zstd`, `CompressionLevel::Best` intentionally uses `MatcherBackend::HashChain` (not a binary-tree/DUBT backend) due to memory constraints: a BT matcher for a 16 MiB window requires ~128–256 MB of tables (C zstd level 11 uses chainLog=25 → 256 MB), which is impractical. Best is distinguished from Better by deeper HC search (32 vs 16 candidates), larger hash/chain tables (2M/1M vs 1M/512K), and higher target match length (128 vs 48), keeping memory under 20 MB. It achieves a proportional compression gain (~0.27% better than Better on corpus) comparable to C zstd's L7→L11 delta (~0.14%). A proper BT backend is deferred as a future optimization. Do not flag Best's use of HashChain as missing a BT implementation.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:479-523
Timestamp: 2026-04-02T19:19:44.750Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), the per-level roundtrip test suites (Better, Best) share a common `level_roundtrip_suite!` macro that expands the same 7-test matrix (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for each level via module-scoped macro invocations. Do not suggest re-inlining these as separate flat test functions — the macro was added intentionally to eliminate cross-level drift.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:40.012Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1470-1490
Timestamp: 2026-04-02T07:27:15.870Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::pick_lazy_match` intentionally queries `find_best_match(abs_pos + 1, ...)` and `find_best_match(abs_pos + 2, ...)` *before* those positions are inserted into the hash-chain tables. This matches C zstd lazy evaluation semantics: pre-seeding those positions would allow a position to match against itself, producing invalid self-matches. Do not flag the absent pre-seeding as a bug.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1531-1535
Timestamp: 2026-04-01T15:33:17.229Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` intentionally casts `abs_pos: usize` to `u32` (wrapping on streams > 4 GiB). This is safe by design: (1) the chain table slot is computed as `(pos_u32 as usize) & chain_mask`, so only the low bits matter and wrapping leaves the addressed slot unchanged; (2) `chain_candidates` filters any recovered `cur as usize` value against `history_abs_start` — a wrapped (small) value is always less than the current sliding-window start and is discarded. Do not flag this cast as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

✅ Actions performed

Full review triggered.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

zstd/src/encoding/match_generator.rs

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

README.md

sw-release-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite	Current: `88f0576`	Previous: `bbd826e`	Ratio
`compress/default/small-1k-random/matrix/pure_rust`	`6.186` ms	`4.409` ms	`1.40`
`compress/default/small-10k-random/matrix/pure_rust`	`7.582` ms	`5.932` ms	`1.28`
`compress/default/low-entropy-1m/matrix/pure_rust`	`10.553` ms	`8.754` ms	`1.21`
`compress/default/large-log-stream/matrix/c_ffi`	`3.905` ms	`3.324` ms	`1.17`
`compress/best/large-log-stream/matrix/c_ffi`	`12.561` ms	`10.602` ms	`1.18`
`decompress/fastest/high-entropy-1m/c_stream/matrix/c_ffi`	`0.035` ms	`0.026` ms	`1.35`
`decompress/default/high-entropy-1m/c_stream/matrix/c_ffi`	`0.035` ms	`0.027` ms	`1.30`
`decompress/better/high-entropy-1m/c_stream/matrix/c_ffi`	`0.035` ms	`0.026` ms	`1.35`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

polaz · 2026-04-04T22:50:25Z

@coderabbitai re-review

coderabbitai · 2026-04-04T22:50:36Z

@polaz Sure, kicking off another full re-review of PR #63 now!

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:479-523
Timestamp: 2026-04-02T19:19:44.750Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), the per-level roundtrip test suites (Better, Best) share a common `level_roundtrip_suite!` macro that expands the same 7-test matrix (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for each level via module-scoped macro invocations. Do not suggest re-inlining these as separate flat test functions — the macro was added intentionally to eliminate cross-level drift.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 33
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T00:05:14.600Z
Learning: In structured-world/structured-zstd, the Clippy requirement is: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`. The `rustc-dep-of-std` feature is excluded (internal feature for Rust stdlib builds only) and `fuzz_exports` is excluded (fuzzing-specific entry points validated separately from the regular lint gate). The `--all-features` flag should NOT be used.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/encoding/frame_compressor.rs:280-287
Timestamp: 2026-04-02T16:31:58.167Z
Learning: In `structured-zstd`, `CompressionLevel::Best` intentionally uses `MatcherBackend::HashChain` (not a binary-tree/DUBT backend) due to memory constraints: a BT matcher for a 16 MiB window requires ~128–256 MB of tables (C zstd level 11 uses chainLog=25 → 256 MB), which is impractical. Best is distinguished from Better by deeper HC search (32 vs 16 candidates), larger hash/chain tables (2M/1M vs 1M/512K), and higher target match length (128 vs 48), keeping memory under 20 MB. It achieves a proportional compression gain (~0.27% better than Better on corpus) comparable to C zstd's L7→L11 delta (~0.14%). A proper BT backend is deferred as a future optimization. Do not flag Best's use of HashChain as missing a BT implementation.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1470-1490
Timestamp: 2026-04-02T07:27:15.870Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::pick_lazy_match` intentionally queries `find_best_match(abs_pos + 1, ...)` and `find_best_match(abs_pos + 2, ...)` *before* those positions are inserted into the hash-chain tables. This matches C zstd lazy evaluation semantics: pre-seeding those positions would allow a position to match against itself, producing invalid self-matches. Do not flag the absent pre-seeding as a bug.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Avoid suggesting architectural changes that diverge too far from upstream, as this is a fork of ruzstd (KillingSpark/zstd-rs)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:40.012Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

✅ Actions performed

Full review triggered.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

cli/src/main.rs

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 49-50: Remove the duplicated checklist entry by consolidating the
two bullets into one: keep the existing "Numeric levels `0` (default), `1–22`,
and negative ultra-fast levels via `CompressionLevel::from_level(n)` (C zstd
compatible numbering)" bullet and delete the separate "Negative levels for
ultra-fast compression" bullet so the README checklist only mentions negative
levels once.

In `@zstd/src/encoding/match_generator.rs`:
- Around line 402-403: The Simple backend's reuse of pooled SuffixStore buffers
doesn't validate that a reused store's slot count matches the current block
size: when self.slice_size is reduced (set by
self.base_slice_size.min(max_window_size) and self.reported_window_size =
max_window_size) a previously pooled SuffixStore may be too small for the new
space.len(), causing high collision rates; update the reuse logic that pulls
from suffix_pool (and the analogous code at the other occurrence) to check the
chosen pool entry's capacity/slot count against space.len() and recreate a fresh
SuffixStore when the pooled store's slot count is smaller than space.len(),
ensuring we only reuse pools that are at least as large as the current space.
- Around line 76-79: The reset() path currently derives each level's hash_bits
via dfast_hash_bits_for_window(max_window_size), which changes defaults when no
source-size hint is provided; instead, only shrink the Dfast table when a hint
exists. In reset(), gate the call to dfast_hash_bits_for_window behind
hint.is_some() (or otherwise only apply the smaller value when hint.is_some()),
and otherwise set hash_bits to the documented default DFAST_HASH_BITS
(respecting MIN_WINDOW_LOG where applicable). Apply the same guard to the other
occurrences noted (around the lines referenced) so unhinted resets preserve the
original DFAST defaults.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6d342997-4603-4877-a6a5-1ca2e012f54e

📥 Commits

Reviewing files that changed from the base of the PR and between 5ec3b5a and f051085.

📒 Files selected for processing (2)

README.md
zstd/src/encoding/match_generator.rs

README.md

zstd/src/encoding/match_generator.rs

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

zstd/src/encoding/match_generator.rs (2)

563-566: ⚠️ Potential issue | 🟠 Major

Reject undersized pooled SuffixStores when the slice grows again.

get_next_space() now resizes pooled Vec<u8> buffers correctly, but Lines 563-566 still reuse whatever suffix_pool.pop() returns. After a 1 KiB hinted frame, the next full-size Fastest/negative-level frame can index a much larger block with that 1 KiB suffix table, which stays correct but tanks match quality. Recreate the store when store.slots.len() < space.len().

🔧 Suggested capacity check for `suffix_pool` reuse

-                let suffixes = self
-                    .suffix_pool
-                    .pop()
-                    .unwrap_or_else(|| SuffixStore::with_capacity(space.len()));
+                let suffixes = match self.suffix_pool.pop() {
+                    Some(store) if store.slots.len() >= space.len() => store,
+                    _ => SuffixStore::with_capacity(space.len()),
+                };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@zstd/src/encoding/match_generator.rs` around lines 563 - 566, The pooled
SuffixStore popped from self.suffix_pool may be undersized for the new space;
update the reuse logic around the suffixes assignment (the pop/unwrap_or_else
block that currently constructs suffixes) to check the popped store's
capacity/slot count (e.g., store.slots.len()) against space.len() and if
store.slots.len() < space.len() create a new
SuffixStore::with_capacity(space.len()) instead of reusing the undersized store;
keep using self.suffix_pool.pop() when the capacity is sufficient so we only
recreate when necessary.

359-360: ⚠️ Potential issue | 🟠 Major

Keep unhinted Dfast levels on the documented default table size.

Line 424 still derives hash_bits from max_window_size even when Line 359 consumed no hint, so plain CompressionLevel::Level(2) now gets a 19-bit Dfast table while levels 3/4 get 20 bits. That changes the default Dfast profile on the non-hinted path; if levels 2–4 are still supposed to share sizing until explicit per-level tuning exists, keep DFAST_HASH_BITS unless a source-size hint was supplied, or make hash_bits an explicit LevelParams field.

🔧 One safe way to keep shrinking on the hint path only

     fn reset(&mut self, level: CompressionLevel) {
         let hint = self.source_size_hint.take();
+        let hinted = hint.is_some();
         let params = Self::level_params(level, hint);
         let max_window_size = 1usize << params.window_log;
@@
                 let dfast = self
                     .dfast_match_generator
                     .get_or_insert_with(|| DfastMatchGenerator::new(max_window_size));
                 dfast.max_window_size = max_window_size;
                 dfast.lazy_depth = params.lazy_depth;
-                dfast.set_hash_bits(dfast_hash_bits_for_window(max_window_size));
+                dfast.set_hash_bits(if hinted {
+                    dfast_hash_bits_for_window(max_window_size)
+                } else {
+                    DFAST_HASH_BITS
+                });
                 let vec_pool = &mut self.vec_pool;
                 dfast.reset(|mut data| {

Also applies to: 423-424

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@zstd/src/encoding/match_generator.rs` around lines 359 - 360, The code
currently derives hash_bits from max_window_size in Self::level_params even when
no source_size_hint was provided (because self.source_size_hint was consumed
earlier), causing unhinted CompressionLevel::Level(2) to get an unexpectedly
larger Dfast table; change level parameter logic so that when no hint was
supplied (i.e., source_size_hint is None) level_params uses the documented
default DFAST_HASH_BITS instead of computing from max_window_size—either by
adding an explicit flag/field to LevelParams indicating whether a hint was
present or by making hash_bits an explicit LevelParams field and only
compute/override it when source_size_hint.is_some(); update functions
referencing hash_bits (e.g., level_params and any callers that use
LevelParams.hash_bits) to follow this contract so unhinted levels retain the
default DFAST_HASH_BITS.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/mod.rs`:
- Around line 91-110: Update the CompressionLevel docs to explain that
Level(i32) values above 11 (e.g., Level(12..=22)) exceed the named preset
hierarchy so “Best” is only the best named preset, not the absolute
highest-ratio mode, and add the same HC/hash-chain caveat already present on
Better/Best: large numeric levels rely on the lazy2 hash-chain backend and are
subject to the ~4 GiB + window limitation caused by the early-return guard in
HcMatchGenerator::insert_position (abs_pos >= u32::MAX) until table rebasing is
implemented; mirror the wording and placement used for Better/Best so readers
understand Level(n) may be stronger but not safer for very large streams.

---

Duplicate comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 563-566: The pooled SuffixStore popped from self.suffix_pool may
be undersized for the new space; update the reuse logic around the suffixes
assignment (the pop/unwrap_or_else block that currently constructs suffixes) to
check the popped store's capacity/slot count (e.g., store.slots.len()) against
space.len() and if store.slots.len() < space.len() create a new
SuffixStore::with_capacity(space.len()) instead of reusing the undersized store;
keep using self.suffix_pool.pop() when the capacity is sufficient so we only
recreate when necessary.
- Around line 359-360: The code currently derives hash_bits from max_window_size
in Self::level_params even when no source_size_hint was provided (because
self.source_size_hint was consumed earlier), causing unhinted
CompressionLevel::Level(2) to get an unexpectedly larger Dfast table; change
level parameter logic so that when no hint was supplied (i.e., source_size_hint
is None) level_params uses the documented default DFAST_HASH_BITS instead of
computing from max_window_size—either by adding an explicit flag/field to
LevelParams indicating whether a hint was present or by making hash_bits an
explicit LevelParams field and only compute/override it when
source_size_hint.is_some(); update functions referencing hash_bits (e.g.,
level_params and any callers that use LevelParams.hash_bits) to follow this
contract so unhinted levels retain the default DFAST_HASH_BITS.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bd1c5550-fd80-4298-9c10-c14dacdad625

📥 Commits

Reviewing files that changed from the base of the PR and between bbd826e and f051085.

📒 Files selected for processing (7)

README.md
cli/src/main.rs
zstd/src/encoding/frame_compressor.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs
zstd/src/encoding/streaming_encoder.rs
zstd/src/tests/roundtrip_integrity.rs

zstd/src/encoding/mod.rs

polaz · 2026-04-05T00:09:38Z

@coderabbitai full review

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 968-993: The test
source_size_hint_does_not_shrink_window_below_dictionary_history should also
perform a decode roundtrip to ensure the produced frame is actually decodable;
after using FrameCompressor (set_dictionary, set_source, set_drain, compress)
and reading the frame header via crate::decoding::frame::read_frame_header,
invoke the crate's decoder with the same crate::decoding::Dictionary used
earlier to decompress output.as_slice() and assert the decompressed bytes equal
the original payload; update the test to call the existing project decode helper
(the same approach used in the companion test) or the appropriate decode API so
the test both verifies the advertised window and that the frame roundtrips
correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cb5c60f9-d99a-46a9-bdfb-a41d8cf6add8

📥 Commits

Reviewing files that changed from the base of the PR and between 8265e69 and 61ea66a.

📒 Files selected for processing (1)

zstd/src/encoding/frame_compressor.rs

zstd/src/encoding/frame_compressor.rs

polaz · 2026-04-05T01:01:40Z

@coderabbitai full review

coderabbitai · 2026-04-05T01:01:46Z

✅ Actions performed

Full review triggered.

polaz · 2026-04-05T01:09:23Z

@coderabbitai full review

coderabbitai · 2026-04-05T01:09:31Z

✅ Actions performed

Full review triggered.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

polaz · 2026-04-05T01:18:24Z

@coderabbitai full review

coderabbitai · 2026-04-05T01:18:29Z

✅ Actions performed

Full review triggered.

polaz · 2026-04-05T06:57:58Z

@coderabbitai full review

coderabbitai · 2026-04-05T06:58:04Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cli/src/main.rs (1)
132-148: ⚠️ Potential issue | 🟠 Major

Use metadata length as a hint, not a hard pledge.

Line 133 reads an advisory size from filesystem metadata, and line 147 turns it into a fatal contract. This breaks readable pseudo-files (e.g., /proc/self/status reports size 0 but reads 64+ bytes) and any input that changes while being compressed: the first extra byte fails with write would exceed pledged content size. For a path-based CLI, set_source_size_hint() is safer since it only optimises matcher parameters without enforcing exact byte count.
🔧 Minimal fix
-        encoder
-            .set_pledged_content_size(source_size as u64)
-            .wrap_err("failed to configure pledged content size")?;
+        encoder
+            .set_source_size_hint(source_size as u64)
+            .wrap_err("failed to configure source size hint")?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/main.rs` around lines 132 - 148, The code currently reads file
metadata into source_size and then calls
StreamingEncoder::set_pledged_content_size(...), which enforces an exact byte
count and breaks on pseudo-files or growing inputs; change that to use
StreamingEncoder::set_source_size_hint(source_size as u64) so the size is only a
performance hint (remove the wrap_err on the pledged call). Locate the creation
of encoder in the closure (structured_zstd::encoding::StreamingEncoder::new) and
replace the set_pledged_content_size call with set_source_size_hint on the same
encoder variable; keep source_file, source_size, encoder_input and
create_temporary_output_file usage the same. Ensure any error handling around
the old pledged call is removed or adjusted since the hint setter does not need
to be treated as a fatal error.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1525-1528: Rename the method hash_bits(&self, value: u64) to a
clearer name (e.g., compute_hash_bits or hash_index) to avoid collision with the
field hash_bits: usize; change the function signature accordingly and update
every call site that invokes the method (calls like self.hash_bits(value) or
hash_bits(self, value)) to the new name, and ensure the method body still
references self.hash_bits (the field) for the shift amount.

In `@zstd/src/encoding/streaming_encoder.rs`:
- Around line 1019-1052: Add a small regression case after writing to the
StreamingEncoder in the existing source_size_hint_directly_reduces_window_header
test that calls StreamingEncoder::set_source_size_hint() once data has been
written and asserts it returns an error with io::ErrorKind::InvalidInput
(matching the existing late-call regression for set_pledged_content_size); place
this check after with_hint.write_all(...) and before finish(), referencing
StreamingEncoder::set_source_size_hint, write_all, and set_pledged_content_size
so the post-write rejection behavior is enforced and cannot drift.

---

Outside diff comments:
In `@cli/src/main.rs`:
- Around line 132-148: The code currently reads file metadata into source_size
and then calls StreamingEncoder::set_pledged_content_size(...), which enforces
an exact byte count and breaks on pseudo-files or growing inputs; change that to
use StreamingEncoder::set_source_size_hint(source_size as u64) so the size is
only a performance hint (remove the wrap_err on the pledged call). Locate the
creation of encoder in the closure
(structured_zstd::encoding::StreamingEncoder::new) and replace the
set_pledged_content_size call with set_source_size_hint on the same encoder
variable; keep source_file, source_size, encoder_input and
create_temporary_output_file usage the same. Ensure any error handling around
the old pledged call is removed or adjusted since the hint setter does not need
to be treated as a fatal error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3133f5b6-c129-493a-a61c-8929ccc2b8f8

📥 Commits

Reviewing files that changed from the base of the PR and between bbd826e and 8f5e59b.

📒 Files selected for processing (7)

README.md
cli/src/main.rs
zstd/src/encoding/frame_compressor.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/mod.rs
zstd/src/encoding/streaming_encoder.rs
zstd/src/tests/roundtrip_integrity.rs

zstd/src/encoding/match_generator.rs

zstd/src/encoding/streaming_encoder.rs

- use source_size_hint in CLI instead of hard pledged size from metadata - rename dfast hash helper to hash_index for clearer field/method separation - pin late set_source_size_hint InvalidInput behavior after write

polaz · 2026-04-05T07:19:06Z

@coderabbitai Fixed outside-diff note in cli/src/main.rs: path-based compression now calls set_source_size_hint(source_size as u64) instead of set_pledged_content_size(...), so filesystem metadata stays advisory and no longer enforces a hard byte contract for pseudo-files or growing inputs.

coderabbitai · 2026-04-05T07:19:27Z

Tip

For best results, initiate chat on the files or code changes.

@polaz Thanks for the fix! Using set_source_size_hint(source_size as u64) is the right call for path-based compression — filesystem metadata().len() is unreliable for pseudo-files (/proc/*), FIFOs, or inputs that grow between stat and read. Keeping it advisory means the encoder only uses the hint to scale down matcher/window parameters for small inputs, without binding itself to a hard byte contract that could cause mid-stream errors.

✏️ Learnings added

Learnt from: polaz
URL: https://github.com/structured-world/structured-zstd/pull/63

Timestamp: 2026-04-05T07:19:26.142Z
Learning: In `cli/src/main.rs` (structured-world/structured-zstd), path-based compression calls `set_source_size_hint(source_size as u64)` (NOT `set_pledged_content_size`) for the filesystem metadata size. This is intentional: `metadata().len()` is unreliable for pseudo-files (`/proc/*`), FIFOs, and growing inputs, so the hint must remain advisory (only scaling matcher/window parameters) and must not impose a hard byte contract on the encoder. Do not flag the absence of `set_pledged_content_size` in path-based compression as a missing feature.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:479-523
Timestamp: 2026-04-02T19:19:44.750Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), the per-level roundtrip test suites (Better, Best) share a common `level_roundtrip_suite!` macro that expands the same 7-test matrix (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for each level via module-scoped macro invocations. Do not suggest re-inlining these as separate flat test functions — the macro was added intentionally to eliminate cross-level drift.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/encoding/frame_compressor.rs:280-287
Timestamp: 2026-04-02T16:31:58.167Z
Learning: In `structured-zstd`, `CompressionLevel::Best` intentionally uses `MatcherBackend::HashChain` (not a binary-tree/DUBT backend) due to memory constraints: a BT matcher for a 16 MiB window requires ~128–256 MB of tables (C zstd level 11 uses chainLog=25 → 256 MB), which is impractical. Best is distinguished from Better by deeper HC search (32 vs 16 candidates), larger hash/chain tables (2M/1M vs 1M/512K), and higher target match length (128 vs 48), keeping memory under 20 MB. It achieves a proportional compression gain (~0.27% better than Better on corpus) comparable to C zstd's L7→L11 delta (~0.14%). A proper BT backend is deferred as a future optimization. Do not flag Best's use of HashChain as missing a BT implementation.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Avoid suggesting architectural changes that diverge too far from upstream, as this is a fork of ruzstd (KillingSpark/zstd-rs)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 61
File: zstd/src/encoding/mod.rs:169-179
Timestamp: 2026-04-04T20:45:27.472Z
Learning: In `zstd/src/encoding/mod.rs`, `Matcher::set_source_size_hint()` has a default no-op implementation intentionally. Only the built-in `MatchGeneratorDriver` overrides it to apply the hint during level resolution. `NoDictionaryMatcher` and `TinyMatcher` are test stubs and are explicitly exempt from implementing this hook. Do not flag the absent override in non-driver matchers as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 45
File: cli/src/main.rs:195-202
Timestamp: 2026-03-31T12:34:08.051Z
Learning: In `cli/src/main.rs` (structured-world/structured-zstd), the Windows `std::os::windows::fs::MetadataExt` trait exposes `file_index() -> Option<u64>` (a single 64-bit value stabilized in Rust 1.83) — NOT `file_index_high()` / `file_index_low()` split methods, which do not exist. The correct same-file check on Windows uses `volume_serial_number()` and `file_index()`.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-03T17:39:58.627Z
Learning: In `zstd/src/decoding/frame_decoder.rs`, `FrameDecoderState::reset()` intentionally does NOT reserve the frame-content-size (FCS) delta beyond `window_size`. Two cases cover the design space: (1) multi-segment frames where FCS >> window_size — the delta can reach ~99 MiB, and eagerly reserving it would penalize streaming callers; (2) single-segment frames where window_size == FCS — the delta is always 0 so there is nothing to reserve. Incremental growth during decoding is handled by the per-block `MAX_BLOCK_SIZE` (128 KiB) reserve in `execute_sequences`. Do not flag the absence of an FCS-delta reserve in `reset()` as a missing pre-allocation.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-03T16:36:03.248Z
Learning: In `zstd/src/decoding/frame_decoder.rs`, `FrameDecoderState::reset()` appears asymmetric to `new()` in its pre-allocation, but the asymmetry is intentional: `decoder_scratch.reset(window_size)` internally calls `DecodeBuffer::reset()` which already reserves `window_size`. Therefore `reset()` only needs to call `scratch.buffer.reserve(frame_content_size)` on top of that for the FCS delta, rather than explicitly reserving `window_size` again as `new()` does. Do not re-flag this as a bug or asymmetry.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:40.012Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

zstd/src/encoding/frame_compressor.rs

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 2196-2202: Replace the loose inequality check with a deterministic
assertion pinning the hinted Dfast table size: instead of asserting
hinted_tables < full_tables, assert that driver.dfast_matcher().short_hash.len()
equals the expected size computed by calling
dfast_hash_bits_for_window(MIN_WINDOW_LOG) (or the equivalent constant value for
MIN_WINDOW_LOG) so regressions in dfast_hash_bits_for_window() are caught.

In `@zstd/src/encoding/streaming_encoder.rs`:
- Around line 104-109: Update the doc comment for set_source_size_hint to
describe its visible effects: state that besides tuning matcher parameters it
can also reduce the emitted frame window and influence block sizing (thus
changing the layout of emitted blocks/frames), and note it must be called before
the first write and that it is different from set_pledged_content_size;
reference set_source_size_hint, set_pledged_content_size, and write so reviewers
can locate the docs and the related behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a253bb40-d7e1-4ccf-9b54-bbb91762387f

📥 Commits

Reviewing files that changed from the base of the PR and between 8f5e59b and eb362d0.

📒 Files selected for processing (3)

cli/src/main.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/streaming_encoder.rs

zstd/src/encoding/match_generator.rs

zstd/src/encoding/streaming_encoder.rs

- keep dictionary source-size hint scoped to payload bytes in FrameCompressor - relax dictionary window assertions to roundtrip + nonincreasing-window guarantees - pin dfast hinted table size and document visible set_source_size_hint effects

polaz · 2026-04-05T08:01:35Z

@coderabbitai re-review

coderabbitai · 2026-04-05T08:01:48Z

@polaz Sure, kicking off another full re-review of PR #63 now!

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/encoding/frame_compressor.rs:280-287
Timestamp: 2026-04-02T16:31:58.167Z
Learning: In `structured-zstd`, `CompressionLevel::Best` intentionally uses `MatcherBackend::HashChain` (not a binary-tree/DUBT backend) due to memory constraints: a BT matcher for a 16 MiB window requires ~128–256 MB of tables (C zstd level 11 uses chainLog=25 → 256 MB), which is impractical. Best is distinguished from Better by deeper HC search (32 vs 16 candidates), larger hash/chain tables (2M/1M vs 1M/512K), and higher target match length (128 vs 48), keeping memory under 20 MB. It achieves a proportional compression gain (~0.27% better than Better on corpus) comparable to C zstd's L7→L11 delta (~0.14%). A proper BT backend is deferred as a future optimization. Do not flag Best's use of HashChain as missing a BT implementation.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-05T07:19:26.142Z
Learning: In `cli/src/main.rs` (structured-world/structured-zstd), path-based compression calls `set_source_size_hint(source_size as u64)` (NOT `set_pledged_content_size`) for the filesystem metadata size. This is intentional: `metadata().len()` is unreliable for pseudo-files (`/proc/*`), FIFOs, and growing inputs, so the hint must remain advisory (only scaling matcher/window parameters) and must not impose a hard byte contract on the encoder. Do not flag the absence of `set_pledged_content_size` in path-based compression as a missing feature.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:479-523
Timestamp: 2026-04-02T19:19:44.750Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), the per-level roundtrip test suites (Better, Best) share a common `level_roundtrip_suite!` macro that expands the same 7-test matrix (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for each level via module-scoped macro invocations. Do not suggest re-inlining these as separate flat test functions — the macro was added intentionally to eliminate cross-level drift.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:40.012Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1470-1490
Timestamp: 2026-04-02T07:27:15.870Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::pick_lazy_match` intentionally queries `find_best_match(abs_pos + 1, ...)` and `find_best_match(abs_pos + 2, ...)` *before* those positions are inserted into the hash-chain tables. This matches C zstd lazy evaluation semantics: pre-seeding those positions would allow a position to match against itself, producing invalid self-matches. Do not flag the absent pre-seeding as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 61
File: zstd/src/encoding/mod.rs:169-179
Timestamp: 2026-04-04T20:45:27.472Z
Learning: In `zstd/src/encoding/mod.rs`, `Matcher::set_source_size_hint()` has a default no-op implementation intentionally. Only the built-in `MatchGeneratorDriver` overrides it to apply the hint during level resolution. `NoDictionaryMatcher` and `TinyMatcher` are test stubs and are explicitly exempt from implementing this hook. Do not flag the absent override in non-driver matchers as a bug.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Avoid suggesting architectural changes that diverge too far from upstream, as this is a fork of ruzstd (KillingSpark/zstd-rs)

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

zstd/src/encoding/frame_compressor.rs

coderabbitai

♻️ Duplicate comments (1)

zstd/src/encoding/frame_compressor.rs (1)

163-170: ⚠️ Potential issue | 🟡 Minor

Update the doc comment to match the new payload-only hint semantics.

Lines 168-170 still describe the old dict + payload behavior, but Lines 190-193 now forward only size_hint, and the new dictionary tests expect a non-increasing advertised window. As written, this public contract promises a larger window than callers will actually get.

📝 Suggested doc fix

-    /// When dictionary priming is active, the effective matcher hint also
-    /// includes primed dictionary history bytes in addition to frame payload
-    /// bytes so the advertised window can cover both.
+    /// When dictionary priming is active, this hint still refers to frame
+    /// payload bytes only. Dictionary history is applied separately during
+    /// priming and does not expand the advertised frame window.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@zstd/src/encoding/frame_compressor.rs` around lines 163 - 170, The doc
comment for the frame size hint still claims the hint includes primed dictionary
bytes, but the implementation now forwards only size_hint; update the comment
above the Provide a hint... block to state that the hint is payload-only
(referring to size_hint) and does NOT include any primed dictionary history, and
that the advertised window is computed from that payload-only size_hint (so
callers may observe a non-increasing advertised window when a dictionary is
present). Mention the symbol size_hint and the frame compressor behavior
(advertised window selection) so readers know the contract matches the new
forwarding logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@zstd/src/encoding/frame_compressor.rs`:
- Around line 163-170: The doc comment for the frame size hint still claims the
hint includes primed dictionary bytes, but the implementation now forwards only
size_hint; update the comment above the Provide a hint... block to state that
the hint is payload-only (referring to size_hint) and does NOT include any
primed dictionary history, and that the advertised window is computed from that
payload-only size_hint (so callers may observe a non-increasing advertised
window when a dictionary is present). Mention the symbol size_hint and the frame
compressor behavior (advertised window selection) so readers know the contract
matches the new forwarding logic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ab25ddc-cfba-4e85-898b-9d03286e0989

📥 Commits

Reviewing files that changed from the base of the PR and between eb362d0 and 88f0576.

📒 Files selected for processing (3)

zstd/src/encoding/frame_compressor.rs
zstd/src/encoding/match_generator.rs
zstd/src/encoding/streaming_encoder.rs

polaz · 2026-04-05T08:17:36Z

@coderabbitai Fixed duplicate note in zstd/src/encoding/frame_compressor.rs: updated set_source_size_hint docs to payload-only semantics, explicitly stating dictionary priming is separate and does not inflate the advertised frame window.

coderabbitai · 2026-04-05T08:17:54Z

Tip

For best results, initiate chat on the files or code changes.

@polaz Thanks for the clarification! Scoping set_source_size_hint to payload bytes only is the correct semantic — dictionary priming bytes operate on pre-existing shared history and should have no bearing on the advertised frame window size. The updated doc comment makes that contract explicit and prevents callers from mistakenly including dictionary bytes in the hint.

(ﾐ･ᴥ･ﾐ) 🐇 noted!

✏️ Learnings added

Learnt from: polaz
URL: https://github.com/structured-world/structured-zstd/pull/63

Timestamp: 2026-04-05T08:17:53.828Z
Learning: In `zstd/src/encoding/frame_compressor.rs` (structured-world/structured-zstd), `set_source_size_hint` applies only to the payload (uncompressed source) bytes, not to dictionary priming. Dictionary priming is separate and does not inflate the advertised frame window size. Do not suggest including dictionary bytes in the size hint passed to `set_source_size_hint`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/encoding/frame_compressor.rs:280-287
Timestamp: 2026-04-02T16:31:58.167Z
Learning: In `structured-zstd`, `CompressionLevel::Best` intentionally uses `MatcherBackend::HashChain` (not a binary-tree/DUBT backend) due to memory constraints: a BT matcher for a 16 MiB window requires ~128–256 MB of tables (C zstd level 11 uses chainLog=25 → 256 MB), which is impractical. Best is distinguished from Better by deeper HC search (32 vs 16 candidates), larger hash/chain tables (2M/1M vs 1M/512K), and higher target match length (128 vs 48), keeping memory under 20 MB. It achieves a proportional compression gain (~0.27% better than Better on corpus) comparable to C zstd's L7→L11 delta (~0.14%). A proper BT backend is deferred as a future optimization. Do not flag Best's use of HashChain as missing a BT implementation.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-05T07:19:26.142Z
Learning: In `cli/src/main.rs` (structured-world/structured-zstd), path-based compression calls `set_source_size_hint(source_size as u64)` (NOT `set_pledged_content_size`) for the filesystem metadata size. This is intentional: `metadata().len()` is unreliable for pseudo-files (`/proc/*`), FIFOs, and growing inputs, so the hint must remain advisory (only scaling matcher/window parameters) and must not impose a hard byte contract on the encoder. Do not flag the absence of `set_pledged_content_size` in path-based compression as a missing feature.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:479-523
Timestamp: 2026-04-02T19:19:44.750Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), the per-level roundtrip test suites (Better, Best) share a common `level_roundtrip_suite!` macro that expands the same 7-test matrix (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for each level via module-scoped macro invocations. Do not suggest re-inlining these as separate flat test functions — the macro was added intentionally to eliminate cross-level drift.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:40.012Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1470-1490
Timestamp: 2026-04-02T07:27:15.870Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::pick_lazy_match` intentionally queries `find_best_match(abs_pos + 1, ...)` and `find_best_match(abs_pos + 2, ...)` *before* those positions are inserted into the hash-chain tables. This matches C zstd lazy evaluation semantics: pre-seeding those positions would allow a position to match against itself, producing invalid self-matches. Do not flag the absent pre-seeding as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 61
File: zstd/src/encoding/mod.rs:169-179
Timestamp: 2026-04-04T20:45:27.472Z
Learning: In `zstd/src/encoding/mod.rs`, `Matcher::set_source_size_hint()` has a default no-op implementation intentionally. Only the built-in `MatchGeneratorDriver` overrides it to apply the hint during level resolution. `NoDictionaryMatcher` and `TinyMatcher` are test stubs and are explicitly exempt from implementing this hook. Do not flag the absent override in non-driver matchers as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-03T16:36:03.248Z
Learning: In `zstd/src/decoding/frame_decoder.rs`, `FrameDecoderState::reset()` appears asymmetric to `new()` in its pre-allocation, but the asymmetry is intentional: `decoder_scratch.reset(window_size)` internally calls `DecodeBuffer::reset()` which already reserves `window_size`. Therefore `reset()` only needs to call `scratch.buffer.reserve(frame_content_size)` on top of that for the FCS delta, rather than explicitly reserving `window_size` again as `new()` does. Do not re-flag this as a bug or asymmetry.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-03T17:39:58.627Z
Learning: In `zstd/src/decoding/frame_decoder.rs`, `FrameDecoderState::reset()` intentionally does NOT reserve the frame-content-size (FCS) delta beyond `window_size`. Two cases cover the design space: (1) multi-segment frames where FCS >> window_size — the delta can reach ~99 MiB, and eagerly reserving it would penalize streaming callers; (2) single-segment frames where window_size == FCS — the delta is always 0 so there is nothing to reserve. Incremental growth during decoding is handled by the per-block `MAX_BLOCK_SIZE` (128 KiB) reserve in `execute_sequences`. Do not flag the absence of an FCS-delta reserve in `reset()` as a missing pre-allocation.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Avoid suggesting architectural changes that diverge too far from upstream, as this is a fork of ruzstd (KillingSpark/zstd-rs)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:336-339
Timestamp: 2026-04-02T12:03:03.856Z
Learning: In `zstd/src/encoding/match_generator.rs`, `min_primed_tail = 4` is correct for both `MatcherBackend::Dfast` and `MatcherBackend::HashChain` because `HcMatchGenerator::insert_position` unconditionally requires 4 bytes of hash lookahead (`if idx + 4 > concat.len() { return; }`). `backfill_boundary_positions` only re-inserts positions already fully within extended history; it cannot make a <4-byte dictionary tail fragment hashable. Do not suggest lowering `min_primed_tail` below 4 for the `HashChain` backend.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1531-1535
Timestamp: 2026-04-01T15:33:17.229Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` intentionally casts `abs_pos: usize` to `u32` (wrapping on streams > 4 GiB). This is safe by design: (1) the chain table slot is computed as `(pos_u32 as usize) & chain_mask`, so only the low bits matter and wrapping leaves the addressed slot unchanged; (2) `chain_candidates` filters any recovered `cur as usize` value against `history_abs_start` — a wrapped (small) value is always less than the current sliding-window start and is discarded. Do not flag this cast as a bug.

feat(encoding): add numeric compression levels api

96f4e78

- add CompressionLevel::Level(i32) and canonical from_level mapping - port level table, source-size hinting, and matcher sizing behavior - align CLI parsing/range validation and update docs/tests

Copilot AI review requested due to automatic review settings April 4, 2026 21:12

Copilot started reviewing on behalf of polaz April 4, 2026 21:12 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

zstd/src/encoding/frame_compressor.rs Outdated Show resolved Hide resolved

fix(encoding): clamp source-size hint to dictionary history

5ec3b5a

polaz requested a review from Copilot April 4, 2026 21:26

Copilot started reviewing on behalf of polaz April 4, 2026 21:27 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

zstd/src/encoding/match_generator.rs Show resolved Hide resolved

fix(encoding): avoid shrinking pooled matcher buffers

a2ec747

polaz requested a review from Copilot April 4, 2026 21:48

Copilot started reviewing on behalf of polaz April 4, 2026 21:48 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

sw-release-bot bot reviewed Apr 4, 2026

View reviewed changes

docs(readme): clarify numeric-level compatibility scope

f051085

polaz requested a review from Copilot April 4, 2026 22:50

Copilot started reviewing on behalf of polaz April 4, 2026 22:51 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

cli/src/main.rs Show resolved Hide resolved

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

zstd/src/encoding/match_generator.rs Show resolved Hide resolved

zstd/src/encoding/match_generator.rs Show resolved Hide resolved

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

zstd/src/encoding/mod.rs Show resolved Hide resolved

fix(encoding): tighten level hint behavior and docs

f0593ee

polaz requested a review from Copilot April 5, 2026 00:09

coderabbitai bot reviewed Apr 5, 2026

View reviewed changes

zstd/src/encoding/frame_compressor.rs Show resolved Hide resolved

test(encoding): add decode roundtrip for dictionary hint floor

8f5e59b

polaz requested a review from Copilot April 5, 2026 01:09

Copilot started reviewing on behalf of polaz April 5, 2026 01:10 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

coderabbitai bot reviewed Apr 5, 2026

View reviewed changes

zstd/src/encoding/match_generator.rs Outdated Show resolved Hide resolved

zstd/src/encoding/streaming_encoder.rs Show resolved Hide resolved

fix(encoding): align size-hint handling and regressions

eb362d0

- use source_size_hint in CLI instead of hard pledged size from metadata - rename dfast hash helper to hash_index for clearer field/method separation - pin late set_source_size_hint InvalidInput behavior after write

polaz requested a review from Copilot April 5, 2026 07:20

Copilot started reviewing on behalf of polaz April 5, 2026 07:21 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

zstd/src/encoding/frame_compressor.rs Outdated Show resolved Hide resolved

zstd/src/encoding/frame_compressor.rs Show resolved Hide resolved

zstd/src/encoding/frame_compressor.rs Show resolved Hide resolved

coderabbitai bot reviewed Apr 5, 2026

View reviewed changes

zstd/src/encoding/match_generator.rs Show resolved Hide resolved

zstd/src/encoding/streaming_encoder.rs Outdated Show resolved Hide resolved

polaz requested a review from Copilot April 5, 2026 08:01

Copilot started reviewing on behalf of polaz April 5, 2026 08:02 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

zstd/src/encoding/frame_compressor.rs Show resolved Hide resolved

coderabbitai bot reviewed Apr 5, 2026

View reviewed changes

Conversation

polaz commented Apr 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Technical Details

Known Limitations

Test Plan

Summary by CodeRabbit

Compatibility Note

Note

Uh oh!

coderabbitai bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

polaz commented Apr 4, 2026

Uh oh!

coderabbitai bot commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

polaz commented Apr 4, 2026

Uh oh!

coderabbitai bot commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

sw-release-bot bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

polaz commented Apr 4, 2026

Uh oh!

coderabbitai bot commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

polaz commented Apr 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

polaz commented Apr 5, 2026

Uh oh!

polaz commented Apr 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 4, 2026 •

edited

Loading

codecov bot commented Apr 4, 2026 •

edited

Loading

sw-release-bot bot left a comment •

edited

Loading