feat: normalize logically equivalent types (Utf8→LargeUtf8, etc.) by eywalker · Pull Request #8 · nauticalab/starfix

eywalker · 2026-03-07T03:23:51Z

Summary

Add normalize_data_type() that canonicalizes Arrow types differing only in offset size: Utf8→LargeUtf8, Binary→LargeBinary, List→LargeList
Applied in both schema serialization (data_type_to_value) and array hashing (hash_array), so schemas and arrays using either variant produce identical hashes
Un-ignores all Issue 5 tests, which now pass (6 tests)

Test plan

All 32 unit tests pass
25 integration tests pass (7 remaining ignored are Issues 1, 6, 7)
Cross-language golden values verified against nauticalab/starfix-python

🤖 Generated with Claude Code

Add `normalize_data_type()` that canonicalizes types differing only in offset size: Utf8→LargeUtf8, Binary→LargeBinary, List→LargeList. Applied in both `data_type_to_value` (schema serialization) and `hash_array` (array type prefix), so schemas and arrays using either variant produce identical hashes. Un-ignores all Issue 5 tests which now pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces canonicalization of Arrow DataTypes that are logically equivalent but differ by offset size (e.g., Utf8 vs LargeUtf8) so that schema serialization and array hashing produce identical digests across these variants.

Changes:

Added normalize_data_type() to canonicalize Utf8→LargeUtf8, Binary→LargeBinary, and List→LargeList.
Applied normalization to schema serialization (data_type_to_value) and array hashing (hash_array) to align hashes across equivalent types.
Updated golden outputs and un-ignored Issue 5 tests, updating expected hash vectors accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`src/arrow_digester_core.rs`	Adds type normalization and applies it in schema serialization and `hash_array()` metadata hashing.
`tests/arrow_digester.rs`	Updates expected digests and enables Issue 5 equality tests for normalized type variants.
`tests/golden_files/schema_serialization_pretty.json`	Updates schema serialization golden file to reflect canonical `Large*` forms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T03:27:38Z

+    fn normalize_data_type(data_type: &DataType) -> DataType {
+        match data_type {
+            DataType::Utf8 => DataType::LargeUtf8,
+            DataType::Binary => DataType::LargeBinary,
+            DataType::List(field) => DataType::LargeList(field.clone()),
+            _ => data_type.clone(),
+        }


normalize_data_type() only normalizes the top-level DataType. In hash_array(), this means nested logically-equivalent types can still hash differently (e.g., List<Utf8> vs LargeList<LargeUtf8> will serialize different DataType metadata even though the underlying values are equivalent). Consider making normalization recursive for list element fields (and also normalizing the child field type for LargeList), or reusing data_type_to_value()/canonical JSON for array type metadata so nested offset-size variants are canonicalized too.

Copilot · 2026-03-07T03:27:39Z

-            "00000105fc3ecc3e20fea732e2a4bedbbd58ab40b5d1f19ca324b5f3d8116b21c0d649"
+            "00000186ba22789af5e728982c0ed5c78dcc382e8cc9124bcdbd794638ee05a79f6796"
        );



Type normalization is now covered for Binary/LargeBinary and Utf8/LargeUtf8 arrays, but there isn’t an analogous assertion for List vs LargeList arrays producing the same hash. Adding a small LargeListArray case here (with the same logical values as list_array) would prevent regressions in the new List→LargeList normalization behavior.

Suggested change

// LargeList array with the same logical values should produce the same hash

let large_list_array = LargeListArray::from_iter_primitive::<Int32Type, _, _>(vec![

Some(vec![Some(1), Some(2), Some(3)]),

None,

Some(vec![Some(4), Some(5)]),

Some(vec![Some(6)]),

]);

let hash_large = hex::encode(ArrowDigester::hash_array(&large_list_array));

assert_eq!(

hash_large, hash,

"ListArray and LargeListArray with the same logical values should hash identically"

);

eywalker · 2026-03-07T04:19:38Z

superseded by #9

Copilot AI review requested due to automatic review settings March 7, 2026 03:23

Copilot started reviewing on behalf of eywalker March 7, 2026 03:24 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

eywalker closed this Mar 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: normalize logically equivalent types (Utf8→LargeUtf8, etc.)#8

feat: normalize logically equivalent types (Utf8→LargeUtf8, etc.)#8
eywalker wants to merge 1 commit intofix/bitvec-lsb0-platform-independentfrom
fix/normalize-logical-types

eywalker commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

eywalker commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+        // LargeList array with the same logical values should produce the same hash
+        let large_list_array = LargeListArray::from_iter_primitive::<Int32Type, _, _>(vec![
+            Some(vec![Some(1), Some(2), Some(3)]),
+            None,
+            Some(vec![Some(4), Some(5)]),
+            Some(vec![Some(6)]),
+        ]);
+        let hash_large = hex::encode(ArrowDigester::hash_array(&large_list_array));
+        assert_eq!(
+            hash_large, hash,
+            "ListArray and LargeListArray with the same logical values should hash identically"
+        );

Conversation

eywalker commented Mar 7, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

eywalker commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants