Skip to content

Fix Snappy decompressor buffer corruption and add ClickBench tests#36

Merged
unexge merged 1 commit intomainfrom
push-uwuzsurnykvt
Feb 5, 2026
Merged

Fix Snappy decompressor buffer corruption and add ClickBench tests#36
unexge merged 1 commit intomainfrom
push-uwuzsurnykvt

Conversation

@unexge
Copy link
Copy Markdown
Owner

@unexge unexge commented Feb 5, 2026

Summary

Fixes buffer corruption issues in the Snappy decompressor that occurred when processing large files with copy operations.

Changes

Snappy Decompressor Fix

  • Root cause: The decompressor was using the writer's buffer for copy back-references, but this buffer could be invalidated/flushed between writes, causing corruption
  • Fix: Introduced a dedicated circular window buffer (128KB) that maintains the last bytes for copy operations, independent of the writer's state
  • Added proper bounds checking for copy offsets (copy.offset > d.total_written or copy.offset == 0)
  • Added EOF handling when remaining == 0 in both literal and copy states
  • Added buffer length assertion in init()

ClickBench Conformance Tests

  • Added download script for ClickBench partitioned Parquet files (~450MB total, CI only)
  • Added conformance test that reads all 3 ClickBench files (105 columns each)
  • Updated README with ClickBench dataset documentation

Testing

  • All existing Snappy tests pass
  • ClickBench tests exercise the fix with real-world large Snappy-compressed data

Snappy fix:
- The decompressor was sharing the same buffer for both the circular
  window (for back-references) and the Reader's internal buffer.
- When the Reader framework used the buffer, it corrupted the window.
- Fix: Use separate buffer regions - first 128KB for the window,
  remaining 4KB for the Reader's output buffer.
- Increased window to 128KB to support copy 4-byte test (offset 65540).

ClickBench dataset:
- Added ClickBench web analytics dataset (3 parquet files, ~450MB total)
- 105 columns with diverse types, real-world 'dirty' data
- CI-only tests that read all columns from all row groups
- Updated download script and documentation
@unexge unexge enabled auto-merge (squash) February 5, 2026 09:25
@unexge unexge merged commit 59251e1 into main Feb 5, 2026
1 check passed
@unexge unexge deleted the push-uwuzsurnykvt branch February 5, 2026 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant