Skip to content

feat: complete logical hashing implementation with platform-independent encoding and type normalization#9

Closed
eywalker wants to merge 17 commits intomainfrom
fix/bitvec-lsb0-platform-independent-v2
Closed

feat: complete logical hashing implementation with platform-independent encoding and type normalization#9
eywalker wants to merge 17 commits intomainfrom
fix/bitvec-lsb0-platform-independent-v2

Conversation

@eywalker
Copy link
Copy Markdown
Contributor

@eywalker eywalker commented Mar 7, 2026

Summary

Complete rewrite of the Arrow logical hashing implementation, delivering all guarantees described in the design spec. Supersedes #6.

Logical hashing core (arrow_digester_core.rs)

  • Struct field-order independence: Struct children are sorted alphabetically by name during both schema serialization (data_type_to_value) and data hashing (array_digest_update). Composite struct hashing via finalize_child_into_data — each child is independently hashed then finalized into the parent's data stream.
  • Struct-level null propagation: When a struct is nullable, struct-level nulls are ANDed with child validity bitmaps so child digests reflect the combined nullability.
  • Dictionary resolution: Dictionary-encoded arrays are cast to their plain value type before hashing, so DictionaryArray<Int32, Utf8> hashes identically to StringArray.
  • List structural hashing: List element counts are written to a separate structural digest, separating structure from leaf data. This prevents collisions between differently-partitioned lists (e.g., [[1,2],[3]] vs [[1],[2,3]]).
  • Recursive type normalization: normalize_data_type() / normalize_schema() recursively normalize Utf8→LargeUtf8, Binary→LargeBinary, List→LargeList, Dictionary→value_type at all nesting levels. Applied at schema storage (new()), schema serialization (data_type_to_value), and the data path (array_digest_update casts arrays to large equivalents).
  • Platform-independent bit ordering: Validity bitmaps use BitVec<u8, Lsb0> (1-byte words) instead of BitVec<usize, Lsb0> (platform-dependent word size). Boolean packing uses Lsb0 to match Arrow's native layout. Bitmap lengths are cast to u64 before serialization.
  • Logical schema comparison: update() compares schemas via serialized_schema() (canonical JSON), allowing batches with different column orders or equivalent type variants.
  • Canonical schema JSON: element_type_to_value omits Arrow-internal field names (e.g., "item") for list/fixed-size-list elements. sort_json_value recursively sorts all JSON object keys.

API changes

  • ArrowDigester::new() and ArrowDigesterCore::new() now take &Schema instead of Schema.

Documentation

  • docs/design-spec.md: Rewritten to match actual implementation (Lsb0, structural digest, composite struct hashing, element_type_to_value, all known issues resolved).
  • docs/byte-layout-spec.md: Comprehensive byte-level specification with 14 manually verified worked examples.
  • CLAUDE.md: Project instructions for cargo formatting and TDD workflow.

Tests

  • 85 total tests (32 unit + 39 integration + 14 byte-level verification)
  • Type equivalence: Utf8/LargeUtf8, Binary/LargeBinary, List/LargeList at array, schema, and record-batch levels
  • Deeply nested normalization: List(Struct({name: Utf8})) vs LargeList(Struct({name: LargeUtf8}))
  • Streaming with type-equivalent and column-reordered schemas
  • Dictionary equivalence with and without nulls
  • Struct field-order independence at schema and record-batch levels
  • Collision resistance for binary, string, and list partitioning

Test plan

  • cargo test — 85 tests pass
  • cargo clippy --tests — no errors
  • cargo fmt --check — formatted

🤖 Generated with Claude Code

claude and others added 8 commits March 5, 2026 12:05
Address all 7 design-spec issues to make starfix produce identical
hashes for logically equivalent Arrow tables regardless of column order,
struct field order, encoding, or type variant.

Core implementation changes (src/arrow_digester_core.rs):
- Issue 1: Sort struct fields alphabetically in data_type_to_value
- Issue 2: Apply sort_json_value recursively for deterministic JSON
- Issue 3: Use u64 (not usize) for binary length prefixes
- Issue 4: Remove NULL_BYTES sentinel from binary/string nullable paths
- Issue 5: Canonicalize Binary→LargeBinary, Utf8→LargeUtf8, List→LargeList
- Issue 6: Resolve dictionary arrays to plain arrays before hashing
- Issue 7: Use logical schema comparison in update() (canonical serialization)

Also improved schema JSON format for cross-language stability by dropping
Arrow-internal field names (e.g. "item") from List element serialization.

All 13 previously-ignored tests now pass. Updated golden hash values and
golden schema JSON to reflect the new canonical serialization.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add docs/byte-layout-spec.md describing the exact byte-level serialization
for schema JSON, fixed-size types, booleans, variable-length types, lists,
validity bitmaps, and the final combining digest. Every byte fed into
SHA-256 is specified, making cross-language reimplementation possible.

Add 10 verification tests in tests/digest_bytes.rs that manually construct
the expected SHA-256 hash from raw bytes and assert equality with the
library output. Covers:
- Example A: two-column record batch (Int32 + nullable LargeUtf8)
- Example B: boolean array with nulls (Msb0 bit packing)
- Example C: non-nullable Int32 array
- Example D: binary array with type canonicalization (Binary→LargeBinary)
- Example E: column-order independence proof
- Example F: Utf8/LargeUtf8 type equivalence proof
- Example G: nullable Int32 with nulls
- Example H: nullable string array with nulls and type canonicalization
- Example I: empty table (schema only, no data)
- Example J: multi-batch streaming equals single combined batch

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
…t examples

Implement DataType::Struct in array_digest_update for composite hashing
of struct arrays (previously todo!()). Struct children are sorted
alphabetically, each gets an independent digest that is finalized into
the parent's data stream. Struct-level nulls propagate to children via
combined validity buffers to avoid hashing undefined data.

Add finalize_child_into_data helper for writing child digest bytes into
a parent's data stream. Add four new manual verification tests (Examples
K-N) covering struct columns in record batches, hash_array on structs
with and without nulls, and list-of-struct columns. Update byte-layout
spec with corresponding worked examples and updated Section 3.5.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Refactor DigestBufferType from enum to struct with optional `structural`
digest field. For list columns, element counts (sizes) now accumulate in
a separate SHA-256 stream from leaf data, producing: null_bits ||
structural_digest || leaf_digest at finalization. This cleanly separates
structure from data, making collision prevention easier to reason about
while preserving streaming compatibility. Non-list types are unchanged.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
List types now separate element counts into a dedicated structural SHA-256
digest stream, while leaf data flows into the data digest. This ensures
differently-grouped lists (e.g. [[1,2],[3]] vs [[1],[2,3]]) produce
different hashes even when their leaf values are identical.

Updated sections: field digest buffer description (Section 3), list types
(Section 3.4), struct composite children (Section 3.5), finalization
(Section 4), hash_array API (Section 6), and Example N.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add clippy expects for similar_names, redundant_clone, and absolute_paths
in digest_bytes tests. Run cargo fmt to fix all formatting issues across
source and test files.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add four examples that had tests but were missing from the spec:
- Example G: Nullable Int32 array with nulls (hash_array API)
- Example H: Nullable String array with nulls and type canonicalization
- Example I: Empty table with no data batches
- Example J: Multi-batch streaming batch-split independence

All 14 byte-level spec tests (A-N) now have corresponding worked examples
in the documentation.

https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
…ordering

- Change validity bitmap from `BitVec` (default `BitVec<usize, Lsb0>`,
  platform-dependent word size) to `BitVec<u8, Lsb0>` (1-byte words,
  platform-independent)
- Change boolean value packing from `BitVec<u8, Msb0>` to
  `BitVec<u8, Lsb0>` to match Arrow's native bit layout
- Cast `null_bit_vec.len()` to `u64` before `to_le_bytes()` in both
  `finalize_digest` and `finalize_child_into_data` for consistent
  8-byte length encoding across platforms

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 03:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Starfix’s Arrow hashing to be platform-independent by standardizing validity bitmap storage/serialization and aligning boolean bit-packing with Arrow’s native LSB-first layout.

Changes:

  • Switch validity bitmap storage from default BitVec (platform word-sized) to BitVec<u8, Lsb0> for deterministic cross-platform encoding.
  • Change boolean value packing from Msb0 to Lsb0 to match Arrow’s boolean buffer bit layout.
  • Encode validity bit-length as u64 before to_le_bytes() to ensure an 8-byte length prefix on all platforms.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
src/arrow_digester_core.rs Standardizes bitmap storage and length serialization; updates boolean packing order.
tests/arrow_digester.rs Updates golden hash expectations to match the new canonical encoding.
tests/digest_bytes.rs Updates expected outputs, but currently hardcodes digests in places that used to validate the manual byte-level construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/digest_bytes.rs
Comment thread tests/digest_bytes.rs
Comment thread tests/digest_bytes.rs
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
eywalker and others added 3 commits March 6, 2026 19:59
Cast Utf8→LargeUtf8, Binary→LargeBinary, List→LargeList at the top of
array_digest_update so every code path goes through a single canonical
representation.  Inner element types are normalized recursively when
hash_list_array re-enters array_digest_update for each sub-array.

Also updates the design spec to match the current implementation
(Lsb0 booleans, structural digest for lists, composite struct hashing,
element_type_to_value, resolved known issues) and adds equivalence tests
for List/LargeList arrays and record batches, plus Utf8/LargeUtf8
record batches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sive type normalization

Introduce normalize_data_type(), normalize_field(), and normalize_schema()
as reusable functions that recursively normalize Arrow types to their
canonical large equivalents (Utf8→LargeUtf8, Binary→LargeBinary,
List→LargeList, Dictionary→value type) at all nesting levels including
struct children, list elements, and map entries.

Apply normalization at every boundary:
- Schema is normalized at ArrowDigesterCore::new() so all stored state
  uses canonical types
- data_type_to_value() uses normalize_data_type before serialization
- hash_array() normalizes the effective type for metadata
- array_digest_update() casts arrays to large equivalents in the data path

API change: ArrowDigester::new() and ArrowDigesterCore::new() now take
&Schema instead of Schema by value, since the input is normalized
internally and the original is not consumed.

Add deeply nested normalization tests:
- List(Utf8) vs LargeList(LargeUtf8) array and schema equivalence
- Struct({items: List(Utf8), name: Utf8}) vs Struct({items: LargeList(LargeUtf8), name: LargeUtf8}) record batch
- Streaming with type-equivalent schemas (Utf8 digester accepting LargeUtf8 batch)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eywalker eywalker changed the title fix: use BitVec<u8, Lsb0> for platform-independent bit ordering fix: platform-independent bit ordering and recursive type normalization Mar 7, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Comment thread src/arrow_digester_core.rs
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eywalker eywalker changed the title fix: platform-independent bit ordering and recursive type normalization feat: complete logical hashing implementation with platform-independent encoding and type normalization Mar 7, 2026
@eywalker eywalker changed the base branch from claude/fix-logical-hashing-tests-Z52tP to main March 7, 2026 04:28
@eywalker eywalker requested review from Synicix and Copilot March 7, 2026 04:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/byte-layout-spec.md Outdated
Comment thread docs/byte-layout-spec.md
Comment thread docs/byte-layout-spec.md Outdated
Comment thread tests/digest_bytes.rs
Comment thread tests/digest_bytes.rs Outdated
Comment thread tests/digest_bytes.rs Outdated
Comment thread src/arrow_digester_core.rs
- Cache serialized_schema in ArrowDigesterCore to avoid re-serializing
  on every update() call; remove now-unused schema field
- Add clarifying comment on the (normalized_type, cast_array) lifetime
  extension pattern in array_digest_update
- Fix 8 digest_bytes tests: change validity types from usize to u8/u64,
  fix boolean packing from Msb0 to Lsb0, rename _expected to expected
  and assert against manual computation instead of hardcoded byte vectors
- Update byte-layout-spec.md: BitVec<u8, Lsb0> throughout, u8 validity
  words (1 byte) instead of usize (8 bytes), Lsb0 boolean packing,
  platform-independent hashes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +438 to +462
// These variables extend the lifetime of cast results. They are only
// initialized (and read) in branches that perform a cast; the default
// branch never touches them, which Rust's initialization analysis accepts.
let (normalized_type, cast_array);
let (effective_type, effective_array): (&DataType, &dyn Array) = match data_type {
DataType::Utf8 => {
normalized_type = DataType::LargeUtf8;
cast_array =
cast(array, &normalized_type).expect("Failed to cast Utf8 to LargeUtf8");
(&normalized_type, cast_array.as_ref())
}
DataType::Binary => {
normalized_type = DataType::LargeBinary;
cast_array =
cast(array, &normalized_type).expect("Failed to cast Binary to LargeBinary");
(&normalized_type, cast_array.as_ref())
}
DataType::List(field) => {
normalized_type = DataType::LargeList(Arc::clone(field));
cast_array =
cast(array, &normalized_type).expect("Failed to cast List to LargeList");
(&normalized_type, cast_array.as_ref())
}
_ => (data_type, array),
};
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array_digest_update declares normalized_type/cast_array without initializing them on the _ => (data_type, array) path. In Rust, locals with Drop must be definitely initialized before the end of scope, so this pattern will fail to compile. Refactor the normalization block to avoid partially-initialized locals (e.g., have the match return an Option<ArrayRef> cast-guard plus a DataType, or scope each cast inside its branch and return (effective_type, effective_array, cast_guard) so all locals are initialized on every path).

Suggested change
// These variables extend the lifetime of cast results. They are only
// initialized (and read) in branches that perform a cast; the default
// branch never touches them, which Rust's initialization analysis accepts.
let (normalized_type, cast_array);
let (effective_type, effective_array): (&DataType, &dyn Array) = match data_type {
DataType::Utf8 => {
normalized_type = DataType::LargeUtf8;
cast_array =
cast(array, &normalized_type).expect("Failed to cast Utf8 to LargeUtf8");
(&normalized_type, cast_array.as_ref())
}
DataType::Binary => {
normalized_type = DataType::LargeBinary;
cast_array =
cast(array, &normalized_type).expect("Failed to cast Binary to LargeBinary");
(&normalized_type, cast_array.as_ref())
}
DataType::List(field) => {
normalized_type = DataType::LargeList(Arc::clone(field));
cast_array =
cast(array, &normalized_type).expect("Failed to cast List to LargeList");
(&normalized_type, cast_array.as_ref())
}
_ => (data_type, array),
};
// These variables extend the lifetime of cast results by keeping them
// owned in a guard tuple returned from the match. This avoids partially
// initialized locals while still providing a unified view for hashing.
let (effective_type, effective_array, _cast_guard): (&DataType, &dyn Array, Option<(DataType, arrow::array::ArrayRef)>) =
match data_type {
DataType::Utf8 => {
let normalized_type = DataType::LargeUtf8;
let cast_array =
cast(array, &normalized_type).expect("Failed to cast Utf8 to LargeUtf8");
(&normalized_type, cast_array.as_ref(), Some((normalized_type, cast_array)))
}
DataType::Binary => {
let normalized_type = DataType::LargeBinary;
let cast_array = cast(array, &normalized_type)
.expect("Failed to cast Binary to LargeBinary");
(&normalized_type, cast_array.as_ref(), Some((normalized_type, cast_array)))
}
DataType::List(field) => {
let normalized_type = DataType::LargeList(Arc::clone(field));
let cast_array =
cast(array, &normalized_type).expect("Failed to cast List to LargeList");
(&normalized_type, cast_array.as_ref(), Some((normalized_type, cast_array)))
}
_ => (data_type, array, None),
};

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern compiles and works correctly — Rust's definite initialization analysis accepts it because normalized_type and cast_array are only read in branches that initialize them. The default branch returns (data_type, array) without touching either variable. A clarifying comment was added in 49c61c5 explaining the lifetime-extension intent. The code compiles cleanly under cargo clippy --tests -- -D warnings.

Comment thread tests/digest_bytes.rs Outdated

// Field "name" (nullable → bit_count + validity words + data digest)
final_digest.update(bit_count.to_le_bytes()); // 02 00 00 00 00 00 00 00
final_digest.update(validity_word.to_be_bytes()); // 00 00 00 00 00 00 00 01
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline byte comment for validity_word.to_be_bytes() is incorrect: validity_word is a u8, so to_be_bytes() writes a single byte (0x01 here), not an 8-byte big-endian word. This comment is misleading for a byte-level spec test and should be updated to reflect the actual emitted bytes.

Suggested change
final_digest.update(validity_word.to_be_bytes()); // 00 00 00 00 00 00 00 01
final_digest.update(validity_word.to_be_bytes()); // 01

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in c8272c9 — inline comment now correctly shows // 01 for u8 to_be_bytes().

eywalker and others added 2 commits March 6, 2026 22:29
Covers all identified gaps: unimplemented data types (Timestamp, Duration,
Interval, FixedSizeList, Map, Null, Union, RunEndEncoded, View types),
missing test coverage (multi-word validity bitmaps, nullable list elements),
and documentation gaps (metadata exclusion, platform considerations).

Organized into three tiers with flagged design decisions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The comment on validity_word.to_be_bytes() still showed the old 8-byte
usize representation (00 00 00 00 00 00 00 01). Since validity_word is
now u8, to_be_bytes() produces a single byte (01).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eywalker
Copy link
Copy Markdown
Contributor Author

eywalker commented Mar 9, 2026

This PR is superseded by #10

@eywalker eywalker closed this Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants