fix: use BitVec<u8, Lsb0> for platform-independent bit ordering#7
fix: use BitVec<u8, Lsb0> for platform-independent bit ordering#7
Conversation
…ordering The validity bitmap and boolean value packing previously used inconsistent BitVec parameterizations: - Validity bitmap: default BitVec (BitVec<usize, Lsb0>) — platform-dependent word size (4 bytes on 32-bit, 8 bytes on 64-bit) - Boolean packing: BitVec<u8, Msb0> — fixed size but non-Arrow bit order This changes both to BitVec<u8, Lsb0>, which: 1. Matches Arrow's native validity bitmap layout (LSB-first, u8 words) 2. Is platform-independent (u8 words are always 1 byte) 3. Uses consistent bit ordering throughout the codebase 4. Casts bitmap length to u64 before to_le_bytes() for stable serialization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR makes Arrow digest hashing deterministic across platforms by standardizing bitmap storage to BitVec<u8, Lsb0>, aligning boolean bit-packing with Arrow’s native LSB-first ordering, and ensuring bitmap length serialization is stable.
Changes:
- Switch validity bitmaps from default
BitVec(usize-backed) toBitVec<u8, Lsb0>for platform-independent raw storage. - Change boolean value packing from
Msb0toLsb0to match Arrow’s bit ordering. - Serialize bitmap bit-length as
u64before hashing to ensure consistent 8-byte output.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/arrow_digester_core.rs |
Standardizes bitmap representation and boolean packing; updates bitmap-length hashing for cross-platform stability. |
tests/arrow_digester.rs |
Updates golden digests to reflect the new stable bit ordering/serialization. |
Comments suppressed due to low confidence (1)
src/arrow_digester_core.rs:338
- Boolean hashing for non-nullable arrays only feeds the packed bytes into the digest. Because the last byte is padded, different boolean lengths can produce identical byte streams (e.g., len=1 and len=8 with the same first bit), leading to guaranteed hash collisions. Prefix the boolean bit-length (or element count) into the digest before updating with
bit_vec.as_raw_slice()so arrays with different lengths cannot collide.
for i in 0..bool_array.len() {
bit_vec.push(bool_array.value(i));
}
data_digest.update(bit_vec.as_raw_slice());
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
There was a problem hiding this comment.
There are two extra blank lines before the use crate::arrow_digester_core::{...}; import; please remove them to keep the test module formatting consistent.
| DigestBufferType::Nullable(null_bit_digest, data_digest) => { | ||
| final_digest.update(null_bit_digest.len().to_le_bytes()); | ||
| final_digest.update((null_bit_digest.len() as u64).to_le_bytes()); | ||
| for &word in null_bit_digest.as_raw_slice() { | ||
| final_digest.update(word.to_be_bytes()); | ||
| } |
There was a problem hiding this comment.
Now that the validity bitmap uses BitVec<u8, Lsb0>, iterating each byte and calling to_be_bytes() is redundant and adds per-byte update overhead. You can update the digest in one call with the raw slice (and drop the clippy::big_endian_bytes expect) to keep this path simpler and faster.
|
This is superseded by PR #9 |
Summary
BitVec(BitVec<usize, Lsb0>) toBitVec<u8, Lsb0>— fixes platform-dependent word size (usize is 4 bytes on 32-bit, 8 bytes on 64-bit)BitVec<u8, Msb0>toBitVec<u8, Lsb0>— consistent with Arrow's native bit orderinglen()tou64beforeto_le_bytes()for stable 8-byte output regardless of platformAll bit vectors now use
BitVec<u8, Lsb0>, matching Arrow's native validity bitmap layout.Test plan
🤖 Generated with Claude Code