Skip to content

fix: use BitVec<u8, Lsb0> for platform-independent bit ordering#7

Closed
eywalker wants to merge 1 commit intomainfrom
fix/bitvec-lsb0-platform-independent
Closed

fix: use BitVec<u8, Lsb0> for platform-independent bit ordering#7
eywalker wants to merge 1 commit intomainfrom
fix/bitvec-lsb0-platform-independent

Conversation

@eywalker
Copy link
Copy Markdown
Contributor

@eywalker eywalker commented Mar 7, 2026

Summary

  • Validity bitmap: Changed from default BitVec (BitVec<usize, Lsb0>) to BitVec<u8, Lsb0> — fixes platform-dependent word size (usize is 4 bytes on 32-bit, 8 bytes on 64-bit)
  • Boolean value packing: Changed from BitVec<u8, Msb0> to BitVec<u8, Lsb0> — consistent with Arrow's native bit ordering
  • Bitmap length serialization: Cast len() to u64 before to_le_bytes() for stable 8-byte output regardless of platform

All bit vectors now use BitVec<u8, Lsb0>, matching Arrow's native validity bitmap layout.

Test plan

  • All 32 lib tests pass (including updated byte-level boolean packing tests)
  • All 19 integration tests pass (golden hash values updated)
  • 13 ignored tests (pre-existing, for known unfixed bugs) unchanged

🤖 Generated with Claude Code

…ordering

The validity bitmap and boolean value packing previously used inconsistent
BitVec parameterizations:

- Validity bitmap: default BitVec (BitVec<usize, Lsb0>) — platform-dependent
  word size (4 bytes on 32-bit, 8 bytes on 64-bit)
- Boolean packing: BitVec<u8, Msb0> — fixed size but non-Arrow bit order

This changes both to BitVec<u8, Lsb0>, which:
1. Matches Arrow's native validity bitmap layout (LSB-first, u8 words)
2. Is platform-independent (u8 words are always 1 byte)
3. Uses consistent bit ordering throughout the codebase
4. Casts bitmap length to u64 before to_le_bytes() for stable serialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 02:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes Arrow digest hashing deterministic across platforms by standardizing bitmap storage to BitVec<u8, Lsb0>, aligning boolean bit-packing with Arrow’s native LSB-first ordering, and ensuring bitmap length serialization is stable.

Changes:

  • Switch validity bitmaps from default BitVec (usize-backed) to BitVec<u8, Lsb0> for platform-independent raw storage.
  • Change boolean value packing from Msb0 to Lsb0 to match Arrow’s bit ordering.
  • Serialize bitmap bit-length as u64 before hashing to ensure consistent 8-byte output.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/arrow_digester_core.rs Standardizes bitmap representation and boolean packing; updates bitmap-length hashing for cross-platform stability.
tests/arrow_digester.rs Updates golden digests to reflect the new stable bit ordering/serialization.
Comments suppressed due to low confidence (1)

src/arrow_digester_core.rs:338

  • Boolean hashing for non-nullable arrays only feeds the packed bytes into the digest. Because the last byte is padded, different boolean lengths can produce identical byte streams (e.g., len=1 and len=8 with the same first bit), leading to guaranteed hash collisions. Prefix the boolean bit-length (or element count) into the digest before updating with bit_vec.as_raw_slice() so arrays with different lengths cannot collide.
                        for i in 0..bool_array.len() {
                            bit_vec.push(bool_array.value(i));
                        }

                        data_digest.update(bit_vec.as_raw_slice());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +730 to +731


Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two extra blank lines before the use crate::arrow_digester_core::{...}; import; please remove them to keep the test module formatting consistent.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines 171 to 175
DigestBufferType::Nullable(null_bit_digest, data_digest) => {
final_digest.update(null_bit_digest.len().to_le_bytes());
final_digest.update((null_bit_digest.len() as u64).to_le_bytes());
for &word in null_bit_digest.as_raw_slice() {
final_digest.update(word.to_be_bytes());
}
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the validity bitmap uses BitVec<u8, Lsb0>, iterating each byte and calling to_be_bytes() is redundant and adds per-byte update overhead. You can update the digest in one call with the raw slice (and drop the clippy::big_endian_bytes expect) to keep this path simpler and faster.

Copilot uses AI. Check for mistakes.
@eywalker
Copy link
Copy Markdown
Contributor Author

eywalker commented Mar 7, 2026

This is superseded by PR #9

@eywalker eywalker closed this Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants