Implement Starfix byte-level specification with type canonicalization#6
Closed
Implement Starfix byte-level specification with type canonicalization#6
Conversation
Address all 7 design-spec issues to make starfix produce identical hashes for logically equivalent Arrow tables regardless of column order, struct field order, encoding, or type variant. Core implementation changes (src/arrow_digester_core.rs): - Issue 1: Sort struct fields alphabetically in data_type_to_value - Issue 2: Apply sort_json_value recursively for deterministic JSON - Issue 3: Use u64 (not usize) for binary length prefixes - Issue 4: Remove NULL_BYTES sentinel from binary/string nullable paths - Issue 5: Canonicalize Binary→LargeBinary, Utf8→LargeUtf8, List→LargeList - Issue 6: Resolve dictionary arrays to plain arrays before hashing - Issue 7: Use logical schema comparison in update() (canonical serialization) Also improved schema JSON format for cross-language stability by dropping Arrow-internal field names (e.g. "item") from List element serialization. All 13 previously-ignored tests now pass. Updated golden hash values and golden schema JSON to reflect the new canonical serialization. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add docs/byte-layout-spec.md describing the exact byte-level serialization for schema JSON, fixed-size types, booleans, variable-length types, lists, validity bitmaps, and the final combining digest. Every byte fed into SHA-256 is specified, making cross-language reimplementation possible. Add 10 verification tests in tests/digest_bytes.rs that manually construct the expected SHA-256 hash from raw bytes and assert equality with the library output. Covers: - Example A: two-column record batch (Int32 + nullable LargeUtf8) - Example B: boolean array with nulls (Msb0 bit packing) - Example C: non-nullable Int32 array - Example D: binary array with type canonicalization (Binary→LargeBinary) - Example E: column-order independence proof - Example F: Utf8/LargeUtf8 type equivalence proof - Example G: nullable Int32 with nulls - Example H: nullable string array with nulls and type canonicalization - Example I: empty table (schema only, no data) - Example J: multi-batch streaming equals single combined batch https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
…t examples Implement DataType::Struct in array_digest_update for composite hashing of struct arrays (previously todo!()). Struct children are sorted alphabetically, each gets an independent digest that is finalized into the parent's data stream. Struct-level nulls propagate to children via combined validity buffers to avoid hashing undefined data. Add finalize_child_into_data helper for writing child digest bytes into a parent's data stream. Add four new manual verification tests (Examples K-N) covering struct columns in record batches, hash_array on structs with and without nulls, and list-of-struct columns. Update byte-layout spec with corresponding worked examples and updated Section 3.5. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Refactor DigestBufferType from enum to struct with optional `structural` digest field. For list columns, element counts (sizes) now accumulate in a separate SHA-256 stream from leaf data, producing: null_bits || structural_digest || leaf_digest at finalization. This cleanly separates structure from data, making collision prevention easier to reason about while preserving streaming compatibility. Non-list types are unchanged. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
List types now separate element counts into a dedicated structural SHA-256 digest stream, while leaf data flows into the data digest. This ensures differently-grouped lists (e.g. [[1,2],[3]] vs [[1],[2,3]]) produce different hashes even when their leaf values are identical. Updated sections: field digest buffer description (Section 3), list types (Section 3.4), struct composite children (Section 3.5), finalization (Section 4), hash_array API (Section 6), and Example N. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add clippy expects for similar_names, redundant_clone, and absolute_paths in digest_bytes tests. Run cargo fmt to fix all formatting issues across source and test files. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
Add four examples that had tests but were missing from the spec: - Example G: Nullable Int32 array with nulls (hash_array API) - Example H: Nullable String array with nulls and type canonicalization - Example I: Empty table with no data batches - Example J: Multi-batch streaming batch-split independence All 14 byte-level spec tests (A-N) now have corresponding worked examples in the documentation. https://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX
3 tasks
Contributor
Author
|
Closing this PR as it has been superseded by PR #9, which includes all changes from this PR plus additional improvements. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements the complete Starfix byte-level hashing specification as documented in
docs/byte-layout-spec.md. The changes ensure deterministic, language-agnostic hashing of Apache Arrow schemas and record batches by precisely specifying every byte fed into SHA-256.Key Changes
Type Canonicalization: Implement logical type equivalence classes:
BinaryandLargeBinaryboth canonicalize to"LargeBinary"Utf8andLargeUtf8both canonicalize to"LargeUtf8"ListandLargeListboth canonicalize to{"LargeList": ...}Schema Serialization: Generate canonical JSON with:
BTreeMap)name,data_type, andnullablekeysField Digest Finalization: Implement proper nullable field serialization:
usizelittle-endian (8 bytes)usizebig-endian (8 bytes per word)Dictionary Array Handling: Cast dictionary arrays to their value type before hashing to ensure consistent results regardless of encoding
Comprehensive Test Suite: Add 10 worked examples (
example_athroughexample_j) that manually compute expected SHA-256 hashes and verify library conformance:Schema Validation: Update record batch schema validation to compare canonical serializations rather than exact object equality, allowing batches with reordered columns to be accepted
Notable Implementation Details
BitVecusesLsb0(least significant bit first) for storage andMsb0(most significant bit first) for Boolean data packingu64little-endian (8 bytes), regardless of Arrow's offset typeaddress/citybeforeaddress/zip)0x00 0x00 0x01) + 32-byte SHA-256 digesthttps://claude.ai/code/session_01FdWd9bkZjS3c7oUuo8QSPX