Skip to content

feat: normalize logically equivalent types (Utf8→LargeUtf8, etc.)#8

Closed
eywalker wants to merge 1 commit intofix/bitvec-lsb0-platform-independentfrom
fix/normalize-logical-types
Closed

feat: normalize logically equivalent types (Utf8→LargeUtf8, etc.)#8
eywalker wants to merge 1 commit intofix/bitvec-lsb0-platform-independentfrom
fix/normalize-logical-types

Conversation

@eywalker
Copy link
Copy Markdown
Contributor

@eywalker eywalker commented Mar 7, 2026

Summary

  • Add normalize_data_type() that canonicalizes Arrow types differing only in offset size: Utf8LargeUtf8, BinaryLargeBinary, ListLargeList
  • Applied in both schema serialization (data_type_to_value) and array hashing (hash_array), so schemas and arrays using either variant produce identical hashes
  • Un-ignores all Issue 5 tests, which now pass (6 tests)

Test plan

  • All 32 unit tests pass
  • 25 integration tests pass (7 remaining ignored are Issues 1, 6, 7)
  • Cross-language golden values verified against nauticalab/starfix-python

🤖 Generated with Claude Code

Add `normalize_data_type()` that canonicalizes types differing only in
offset size: Utf8→LargeUtf8, Binary→LargeBinary, List→LargeList.

Applied in both `data_type_to_value` (schema serialization) and
`hash_array` (array type prefix), so schemas and arrays using either
variant produce identical hashes.

Un-ignores all Issue 5 tests which now pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 03:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces canonicalization of Arrow DataTypes that are logically equivalent but differ by offset size (e.g., Utf8 vs LargeUtf8) so that schema serialization and array hashing produce identical digests across these variants.

Changes:

  • Added normalize_data_type() to canonicalize Utf8LargeUtf8, BinaryLargeBinary, and ListLargeList.
  • Applied normalization to schema serialization (data_type_to_value) and array hashing (hash_array) to align hashes across equivalent types.
  • Updated golden outputs and un-ignored Issue 5 tests, updating expected hash vectors accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/arrow_digester_core.rs Adds type normalization and applies it in schema serialization and hash_array() metadata hashing.
tests/arrow_digester.rs Updates expected digests and enables Issue 5 equality tests for normalized type variants.
tests/golden_files/schema_serialization_pretty.json Updates schema serialization golden file to reflect canonical Large* forms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +208 to +214
fn normalize_data_type(data_type: &DataType) -> DataType {
match data_type {
DataType::Utf8 => DataType::LargeUtf8,
DataType::Binary => DataType::LargeBinary,
DataType::List(field) => DataType::LargeList(field.clone()),
_ => data_type.clone(),
}
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize_data_type() only normalizes the top-level DataType. In hash_array(), this means nested logically-equivalent types can still hash differently (e.g., List<Utf8> vs LargeList<LargeUtf8> will serialize different DataType metadata even though the underlying values are equivalent). Consider making normalization recursive for list element fields (and also normalizing the child field type for LargeList), or reusing data_type_to_value()/canonical JSON for array type metadata so nested offset-size variants are canonicalized too.

Copilot uses AI. Check for mistakes.
Comment thread tests/arrow_digester.rs
"00000105fc3ecc3e20fea732e2a4bedbbd58ab40b5d1f19ca324b5f3d8116b21c0d649"
"00000186ba22789af5e728982c0ed5c78dcc382e8cc9124bcdbd794638ee05a79f6796"
);

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type normalization is now covered for Binary/LargeBinary and Utf8/LargeUtf8 arrays, but there isn’t an analogous assertion for List vs LargeList arrays producing the same hash. Adding a small LargeListArray case here (with the same logical values as list_array) would prevent regressions in the new ListLargeList normalization behavior.

Suggested change
// LargeList array with the same logical values should produce the same hash
let large_list_array = LargeListArray::from_iter_primitive::<Int32Type, _, _>(vec![
Some(vec![Some(1), Some(2), Some(3)]),
None,
Some(vec![Some(4), Some(5)]),
Some(vec![Some(6)]),
]);
let hash_large = hex::encode(ArrowDigester::hash_array(&large_list_array));
assert_eq!(
hash_large, hash,
"ListArray and LargeListArray with the same logical values should hash identically"
);

Copilot uses AI. Check for mistakes.
@eywalker
Copy link
Copy Markdown
Contributor Author

eywalker commented Mar 7, 2026

superseded by #9

@eywalker eywalker closed this Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants