feat: normalize logically equivalent types (Utf8→LargeUtf8, etc.)#8
feat: normalize logically equivalent types (Utf8→LargeUtf8, etc.)#8eywalker wants to merge 1 commit intofix/bitvec-lsb0-platform-independentfrom
Conversation
Add `normalize_data_type()` that canonicalizes types differing only in offset size: Utf8→LargeUtf8, Binary→LargeBinary, List→LargeList. Applied in both `data_type_to_value` (schema serialization) and `hash_array` (array type prefix), so schemas and arrays using either variant produce identical hashes. Un-ignores all Issue 5 tests which now pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces canonicalization of Arrow DataTypes that are logically equivalent but differ by offset size (e.g., Utf8 vs LargeUtf8) so that schema serialization and array hashing produce identical digests across these variants.
Changes:
- Added
normalize_data_type()to canonicalizeUtf8→LargeUtf8,Binary→LargeBinary, andList→LargeList. - Applied normalization to schema serialization (
data_type_to_value) and array hashing (hash_array) to align hashes across equivalent types. - Updated golden outputs and un-ignored Issue 5 tests, updating expected hash vectors accordingly.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/arrow_digester_core.rs |
Adds type normalization and applies it in schema serialization and hash_array() metadata hashing. |
tests/arrow_digester.rs |
Updates expected digests and enables Issue 5 equality tests for normalized type variants. |
tests/golden_files/schema_serialization_pretty.json |
Updates schema serialization golden file to reflect canonical Large* forms. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fn normalize_data_type(data_type: &DataType) -> DataType { | ||
| match data_type { | ||
| DataType::Utf8 => DataType::LargeUtf8, | ||
| DataType::Binary => DataType::LargeBinary, | ||
| DataType::List(field) => DataType::LargeList(field.clone()), | ||
| _ => data_type.clone(), | ||
| } |
There was a problem hiding this comment.
normalize_data_type() only normalizes the top-level DataType. In hash_array(), this means nested logically-equivalent types can still hash differently (e.g., List<Utf8> vs LargeList<LargeUtf8> will serialize different DataType metadata even though the underlying values are equivalent). Consider making normalization recursive for list element fields (and also normalizing the child field type for LargeList), or reusing data_type_to_value()/canonical JSON for array type metadata so nested offset-size variants are canonicalized too.
| "00000105fc3ecc3e20fea732e2a4bedbbd58ab40b5d1f19ca324b5f3d8116b21c0d649" | ||
| "00000186ba22789af5e728982c0ed5c78dcc382e8cc9124bcdbd794638ee05a79f6796" | ||
| ); | ||
|
|
There was a problem hiding this comment.
Type normalization is now covered for Binary/LargeBinary and Utf8/LargeUtf8 arrays, but there isn’t an analogous assertion for List vs LargeList arrays producing the same hash. Adding a small LargeListArray case here (with the same logical values as list_array) would prevent regressions in the new List→LargeList normalization behavior.
| // LargeList array with the same logical values should produce the same hash | |
| let large_list_array = LargeListArray::from_iter_primitive::<Int32Type, _, _>(vec![ | |
| Some(vec![Some(1), Some(2), Some(3)]), | |
| None, | |
| Some(vec![Some(4), Some(5)]), | |
| Some(vec![Some(6)]), | |
| ]); | |
| let hash_large = hex::encode(ArrowDigester::hash_array(&large_list_array)); | |
| assert_eq!( | |
| hash_large, hash, | |
| "ListArray and LargeListArray with the same logical values should hash identically" | |
| ); |
|
superseded by #9 |
Summary
normalize_data_type()that canonicalizes Arrow types differing only in offset size:Utf8→LargeUtf8,Binary→LargeBinary,List→LargeListdata_type_to_value) and array hashing (hash_array), so schemas and arrays using either variant produce identical hashesTest plan
nauticalab/starfix-python🤖 Generated with Claude Code