diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..44548be --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,14 @@ +# Project Instructions + +## Formatting + +Always run `cargo fmt` before committing. Formatting is enforced in CI via GitHub Actions. + +## Test-Driven Development + +When implementing new features or fixing bugs: + +1. Write tests first that check the desired behavior. +2. Verify the new tests fail (confirming they catch the issue / check the right thing). +3. Implement the fix or feature. +4. Verify all previously failing tests now pass. diff --git a/docs/byte-layout-spec.md b/docs/byte-layout-spec.md new file mode 100644 index 0000000..40abf27 --- /dev/null +++ b/docs/byte-layout-spec.md @@ -0,0 +1,995 @@ +# Starfix Byte Layout Specification + +This document describes the **exact byte-level serialization** used by Starfix to compute deterministic hashes of Apache Arrow schemas and record batches. Every byte fed into SHA-256 is specified here, making it possible to implement a compatible hasher in any language. + +All multi-byte integers use **little-endian** byte order unless explicitly stated otherwise. + +--- + +## 1. Output Format + +Every Starfix hash is **35 bytes**: + +``` +[version: 3 bytes] [SHA-256 digest: 32 bytes] +``` + +The version prefix is currently `0x00 0x00 0x01` (version 0.0.1). + +When displayed as hex, a hash looks like: + +``` +000001 <64 hex chars of SHA-256> +``` + +--- + +## 2. Schema Serialization + +### 2.1 Canonical JSON String + +The schema is serialized as a **compact JSON string** (no whitespace) of an object where: + +- **Keys** are field names, sorted alphabetically (via `BTreeMap`). +- **Values** are objects with keys `"data_type"` and `"nullable"`, with JSON keys sorted alphabetically within every nested object (recursively). + +Because all JSON object keys are sorted recursively, the key order is always `"data_type"` before `"nullable"` (and `"data_type"` before `"name"` before `"nullable"` for struct children). + +#### Type Canonicalization + +Before serialization, these logical equivalence classes are collapsed: + +| Arrow type(s) | Canonical JSON form | +|----------------------------|-------------------------------| +| `Binary`, `LargeBinary` | `"LargeBinary"` | +| `Utf8`, `LargeUtf8` | `"LargeUtf8"` | +| `List(f)`, `LargeList(f)` | `{"LargeList": }` | +| `Dictionary(k, v)` | canonical form of `v` | + +#### Nested Type Serialization + +**Struct fields** are serialized as: +```json +{"Struct": []} +``` +Each child object: `{"data_type": ..., "name": "", "nullable": }`. + +**List / LargeList elements** are serialized as: +```json +{"LargeList": {"data_type": ..., "nullable": }} +``` +Note: the Arrow-internal field name (typically `"item"`) is **omitted** — only `data_type` and `nullable` are included. + +**Primitive types** use Arrow's built-in serde: +- `"Int32"`, `"Boolean"`, `"Float64"`, `"LargeBinary"`, `"LargeUtf8"`, etc. +- `{"Decimal128": [38, 5]}`, `{"Time32": "Second"}`, etc. + +### 2.2 Schema Digest + +``` +schema_digest = SHA-256(canonical_json_string_bytes) +``` + +The UTF-8 bytes of the JSON string are fed directly into SHA-256. The result is 32 bytes. + +### 2.3 Concrete Example + +Schema: `{name: LargeUtf8 nullable, age: Int32 non-nullable}` + +Canonical JSON string (compact, keys sorted): +``` +{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}} +``` + +Note: `"age"` comes before `"name"` alphabetically, and `"data_type"` comes before `"nullable"`. + +``` +schema_digest = SHA-256(b'{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}}') +``` + +--- + +## 3. Field Data Serialization + +The schema is recursively decomposed into a `BTreeMap` of entries. **Leaf fields** and **list intermediate nodes** get their own entries. **Struct fields are transparent** — they do not create entries themselves; instead, their null validity is AND-propagated to descendant entries, and their children are recursively traversed. + +Each entry has a **digest buffer** containing up to three **optional** components: + +| Component | Present when | Purpose | +|-----------|-------------|---------| +| `null_bits` (BitVec) | field is nullable | Tracks which elements are valid vs null | +| `structural` (SHA-256) | entry is a list type (`List` or `LargeList`) | Accumulates element counts (structure) | +| `data` (SHA-256) | leaf fields and list-leaf entries | Accumulates leaf data bytes | + +There are four entry types: + +| Entry type | `null_bits` | `structural` | `data` | Example | +|------------|:-----------:|:------------:|:------:|---------| +| **data-only** | — | — | yes | Non-nullable leaf field (e.g., `Int32`) | +| **validity + data** | yes | — | yes | Nullable leaf field | +| **validity-only** | yes | — | — | Nullable parent whose descendants have their own entries | +| **structural-only** | — | yes | — | Non-nullable list whose value type is a struct or nested list | +| **list_leaf** | optional | yes | yes | List whose value type is a leaf (e.g., `List`) | + +**Naming convention**: Struct adds `/fieldname` to the path. List adds a trailing `/`. Nested lists add `//`, etc. + +This separation of structural information from leaf data ensures that list element boundaries are hashed independently from the values they contain. For example, `[[1,2],[3]]` and `[[1],[2,3]]` differ in their structural digest (element counts `[2,1]` vs `[1,2]`) even though their leaf data digest is identical (`[1,2,3]`). + +### 3.1 Fixed-Size Types + +**Types**: `Int8`, `UInt8`, `Int16`, `UInt16`, `Int32`, `UInt32`, `Int64`, `UInt64`, `Float16`, `Float32`, `Float64`, `Date32`, `Date64`, `Time32(*)`, `Time64(*)`, `Decimal32`, `Decimal64`, `Decimal128`, `Decimal256`, `FixedSizeBinary(n)`. + +| Type | Bytes per element | +|------|-------------------| +| Int8 / UInt8 | 1 | +| Int16 / UInt16 / Float16 | 2 | +| Int32 / UInt32 / Float32 / Date32 / Decimal32 / Time32 | 4 | +| Int64 / UInt64 / Float64 / Date64 / Decimal64 / Time64 | 8 | +| Decimal128 | 16 | +| Decimal256 | 32 | +| FixedSizeBinary(n) | n | + +**Non-nullable path**: The entire contiguous byte buffer (all elements concatenated, little-endian) is fed into the data digest in a single update. + +**Nullable path**: +1. For each element `i`, push `is_valid(i)` (true=1, false=0) into the validity `BitVec`. +2. For each **valid** element, feed its little-endian bytes into the data digest. +3. **Null elements are skipped entirely** — no data bytes are fed. + +If a nullable field has no actual nulls (null buffer absent), all elements are marked valid and the entire buffer is fed in one update (same as non-nullable data path). + +### 3.2 Boolean Type + +Boolean values are **bit-packed** using **LSB-first** (`Lsb0`) ordering into bytes. + +**Non-nullable**: All values are packed sequentially into a `BitVec`, then the raw bytes are fed into the data digest. + +**Nullable**: +1. Extend the validity `BitVec` as usual. +2. Only **valid** values are packed (nulls are skipped). +3. The packed bytes are fed into the data digest. + +**Example**: `[true, NULL, false, true]` (nullable, 4 elements) +- Validity bits: `[1, 0, 1, 1]` +- Data bits (valid only): `[true, false, true]` → Lsb0 packed: `00000_1_0_1` = `0x05` +- Bytes fed to data digest: `[0x05]` + +### 3.3 Variable-Length Types (Binary, String) + +**Types**: `Binary`, `LargeBinary`, `Utf8`, `LargeUtf8`. + +Each element is serialized as: +``` +[length as u64 little-endian: 8 bytes] [raw bytes: length bytes] +``` + +The length prefix is **always `u64`** (8 bytes, little-endian) regardless of the Arrow offset type. + +**Non-nullable**: For each element, feed `(len as u64).to_le_bytes()` then the raw bytes. + +**Nullable**: +1. Extend the validity `BitVec`. +2. For valid elements: feed length prefix + raw bytes. +3. For null elements: **skip entirely** — no bytes fed to data digest. + +### 3.4 List Types (Record-Batch Path) + +**Types**: `List(field)`, `LargeList(field)`. + +List columns are **recursively decomposed** into separate BTreeMap entries. A list creates an intermediate entry at `path/` (path + delimiter). The value type is then recursively traversed to create further entries. + +**Decomposition by value type:** + +- **`List`** (e.g., `List`): The entry at `path/` is a **list-leaf** with both structural and data digests. List lengths go to structural; leaf values go to data. +- **`List>`**: The entry at `path/` is **structural-only** (list lengths). The struct is transparent, and each struct child creates its own entry at `path//childname`. +- **`List>`**: The entry at `path/` is structural-only. The inner list creates another entry at `path//`, and so on recursively. + +**Nullable list columns**: The column-level entry at `path` (without trailing `/`) is **validity-only**, recording which rows are null vs valid. Null list elements are not traversed — no structural or data bytes are written for them. + +**Traversal**: For each non-null list element, write the sub-array length (u64 LE) to the structural digest at `path/`, then recurse into the sub-array using the value type. + +#### Concrete Example: Structural vs Leaf Separation + +For `LargeList` (non-nullable) with data `[[1,2],[3]]`: + +The single entry at `col/` is a list-leaf: + +``` +structural digest receives: + 02 00 00 00 00 00 00 00 (element 0: 2 items, u64 LE) + 01 00 00 00 00 00 00 00 (element 1: 1 item, u64 LE) + +data digest receives: + 01 00 00 00 (1 as i32 LE) + 02 00 00 00 (2 as i32 LE) + 03 00 00 00 (3 as i32 LE) +``` + +Compare with `[[1],[2,3]]`: same data digest but different structural digest — so the final hashes differ. + +### 3.5 Struct Types (Record-Batch Path) + +Struct fields are **transparent** in the record-batch path — they do not create a BTreeMap entry. Instead: + +1. **Children are traversed** in alphabetical order by field name. +2. **Struct-level nulls are AND-propagated** to all descendant entries. If a struct row is null, none of its children's data is hashed for that row, and the null is reflected in each descendant's effective validity. +3. Each child is recursively decomposed (leaf → data entry, list → structural entry, nested struct → recurse further). + +**Example**: A struct field `address` with children `city` (LargeUtf8) and `zip` (Int32) creates two leaf entries: `address/city` and `address/zip`. No entry exists for `address` itself. + +### 3.6 Dictionary-Encoded Arrays + +Dictionary arrays are **resolved to their plain equivalent** before hashing. The dictionary is unpacked so that the data stream is identical to a non-dictionary array with the same logical values. + +--- + +## 4. Field Digest Finalization + +After all record batches have been fed, each entry's digest buffer is finalized and fed into the **final combining digest**. Each entry may have up to three optional components, written in this fixed order (skipping absent components): + +``` +1. null_bits (if present — nullable entries only) +2. structural (if present — list entries only) +3. data (if present — leaf and list-leaf entries only) +``` + +### 4.1 Data-Only Entry + +``` +final_digest.update( SHA-256(data_bytes).finalize() ) // 32 bytes +``` + +### 4.2 Validity + Data Entry (Nullable Leaf) + +``` +final_digest.update( bit_count.to_le_bytes() ) // 8 bytes (u64 LE) +for each word in validity_bitvec.as_raw_slice(): // each word is u8 (1 byte) + final_digest.update( word.to_le_bytes() ) // 1 byte per word (u8, LE is trivial) +final_digest.update( SHA-256(data_bytes).finalize() ) // 32 bytes +``` + +### 4.3 Validity-Only Entry + +``` +final_digest.update( bit_count.to_le_bytes() ) // 8 bytes (u64 LE) +for each word in validity_bitvec.as_raw_slice(): + final_digest.update( word.to_le_bytes() ) // 1 byte per word (u8) +``` + +No structural or data digest is written. + +### 4.4 Structural-Only Entry + +``` +final_digest.update( SHA-256(structural_bytes).finalize() ) // 32 bytes (element counts) +``` + +### 4.5 List-Leaf Entry (Structural + Data) + +``` +final_digest.update( SHA-256(structural_bytes).finalize() ) // 32 bytes (element counts) +final_digest.update( SHA-256(data_bytes).finalize() ) // 32 bytes (leaf values) +``` + +If nullable, prepend null_bits before structural: + +``` +final_digest.update( bit_count.to_le_bytes() ) // 8 bytes (u64 LE) +for each word in validity_bitvec.as_raw_slice(): + final_digest.update( word.to_le_bytes() ) // 1 byte per word (u8) +final_digest.update( SHA-256(structural_bytes).finalize() ) // 32 bytes +final_digest.update( SHA-256(data_bytes).finalize() ) // 32 bytes +``` + +**Validity BitVec details** (applies to all entries with `null_bits`): +- Storage type: `u8` (1 byte per word). +- Bit order: `Lsb0` (least significant bit first within each word). +- `bit_count` = total number of elements (valid + null), serialized as `u64` little-endian (8 bytes). +- Each storage word is serialized as `u8` little-endian (trivially 1 byte). +- The last word may have unused high bits (zero-padded). + +--- + +## 5. Final Combining Digest + +The final hash is computed by feeding into a fresh SHA-256: + +``` +final_digest = SHA-256() + +// 1. Schema digest (32 bytes) +final_digest.update( schema_digest ) + +// 2. Field digests in alphabetical order of field path +for field_path in sorted(field_paths): + finalize field's DigestBufferType into final_digest (see Section 4) + +raw_hash = final_digest.finalize() // 32 bytes +output = [0x00, 0x00, 0x01] ++ raw_hash // 35 bytes +``` + +--- + +## 6. `hash_array` API + +The `hash_array` function hashes a single array (without a schema context). It uses the **same recursive decomposition** as the record-batch path, ensuring consistent hashing regardless of which API is used: + +``` +final_digest = SHA-256() + +// 1. Type metadata (canonical JSON string) +canonical_type = data_type_to_value(effective_data_type) +json_string = JSON.serialize(canonical_type) // compact, keys sorted +final_digest.update( json_string.as_bytes() ) + +// 2. Build BTreeMap entries from the type tree (same as record-batch path) +fields = extract_type_entries(effective_data_type, nullable, root_path="") + +// 3. Traverse and populate entries +traverse_and_update(effective_data_type, nullable, effective_array, "", fields) + +// 4. Finalize all entries into the digest (same order as record-batch finalize) +for (_, entry) in fields: + finalize_digest(final_digest, entry) // see Section 4 + +raw_hash = final_digest.finalize() // 32 bytes +output = [0x00, 0x00, 0x01] ++ raw_hash // 35 bytes +``` + +Dictionary arrays are resolved to their value type before hashing. + +--- + +## 7. Worked Examples + +### Example A: Simple Two-Column Table + +**Schema**: `{age: Int32 non-nullable, name: LargeUtf8 nullable}` + +**Data** (1 record batch, 2 rows): + +| age | name | +|-----|---------| +| 25 | "Alice" | +| 30 | NULL | + +#### Step 1: Schema Digest + +Canonical JSON (compact): +``` +{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}} +``` + +``` +schema_digest = SHA-256("{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}}") +``` + +#### Step 2: Field "age" (Int32, non-nullable) + +Values: `[25, 30]` + +Little-endian bytes: +- 25 as i32 LE: `19 00 00 00` +- 30 as i32 LE: `1e 00 00 00` + +Data fed to digest: `19 00 00 00 1e 00 00 00` (8 bytes, one contiguous slice) + +``` +age_data_digest = SHA-256(0x19000000_1e000000) +``` + +Finalization into final_digest (non-nullable): +``` +final_digest.update( age_data_digest.finalize() ) // 32 bytes +``` + +#### Step 3: Field "name" (LargeUtf8, nullable) + +Values: `["Alice", NULL]` + +**Validity bits** (Lsb0 in u8 words): +- Element 0 ("Alice"): valid → bit = 1 +- Element 1 (NULL): null → bit = 0 +- BitVec contents: bits `[1, 0]`, bit_count = 2 +- As u8 (Lsb0): bit 0 = 1, bit 1 = 0 → binary `0000_0001` = 1 +- `as_raw_slice()` = `[1_u8]` + +Validity serialization: +``` +bit_count LE: 02 00 00 00 00 00 00 00 (2 as u64 little-endian) +word 0 LE: 01 (1 as u8) +``` + +**Data bytes** (only valid elements): +- "Alice": length 5 as u64 LE = `05 00 00 00 00 00 00 00`, then UTF-8 bytes `41 6c 69 63 65` +- NULL: skipped entirely + +``` +name_data_digest = SHA-256(0x0500000000000000_416c696365) +``` + +Finalization into final_digest (nullable): +``` +final_digest.update( 0x0200000000000000 ) // bit count (u64 LE) +final_digest.update( 0x01 ) // word 0 (u8) +final_digest.update( name_data_digest.finalize() ) // 32 bytes +``` + +#### Step 4: Final Combination + +Fields in alphabetical order: `age`, then `name`. + +``` +final_digest = SHA-256() +final_digest.update( schema_digest ) // 32 bytes +final_digest.update( age_data_digest.finalize() ) // 32 bytes (non-nullable) +final_digest.update( 0x0200000000000000 ) // name bit count (u64 LE) +final_digest.update( 0x01 ) // name validity word (u8) +final_digest.update( name_data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example B: Boolean Array with Nulls (hash_array API) + +**Array**: `BooleanArray [true, NULL, false, true]` (nullable) + +#### Step 1: Type Metadata + +Canonical type JSON: `"Boolean"` (7 bytes as UTF-8) + +``` +final_digest.update(b'"Boolean"') +``` + +Note: `serde_json::to_string` of a JSON string value includes the surrounding quotes. + +#### Step 2: Data + +**Validity bits** (Lsb0 in u8): +- `[1, 0, 1, 1]` → bits: b0=1, b1=0, b2=1, b3=1 +- As u8 (Lsb0): binary `0000_1101` = 13 +- `as_raw_slice()` = `[13_u8]` + +**Data bits** (Lsb0 packed, valid values only): +- Valid values: `[true, false, true]` (3 values) +- Lsb0 packing: bit0=true(1), bit1=false(0), bit2=true(1), bits3-7=0 +- Byte: `00000101` = `0x05` + +``` +data_digest = SHA-256(0x05) +``` + +#### Step 3: Finalization + +``` +final_digest = SHA-256() +final_digest.update(b'"Boolean"') // type metadata +final_digest.update( 0x0400000000000000 ) // 4 bits (bit count as u64 LE) +final_digest.update( 0x0D ) // 13 as u8 +final_digest.update( data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example C: Non-Nullable Int32 Array (hash_array API) + +**Array**: `Int32Array [1, 2, 3]` (non-nullable) + +#### Step 1: Type Metadata + +Canonical type JSON: `"Int32"` (6 bytes: `22 49 6e 74 33 32 22`... wait, `"Int32"` is the JSON string `"Int32"` including quotes) + +Actually: `serde_json::to_string(&json!("Int32"))` produces `"\"Int32\""`, but `data_type_to_value` for Int32 produces the JSON value `"Int32"` (a JSON string). Then `serde_json::to_string` of that JSON string value produces `"\"Int32\""` — the 7-byte string `"Int32"` with quotes. + +``` +final_digest.update(b'"Int32"') // 7 bytes: 22 49 6e 74 33 32 22 +``` + +#### Step 2: Data + +Values as i32 LE bytes: +- 1: `01 00 00 00` +- 2: `02 00 00 00` +- 3: `03 00 00 00` + +Entire buffer fed as one slice: `01 00 00 00 02 00 00 00 03 00 00 00` (12 bytes) + +``` +data_digest = SHA-256(0x010000000200000003000000) +``` + +#### Step 3: Finalization (non-nullable) + +``` +final_digest = SHA-256() +final_digest.update(b'"Int32"') // 7 bytes +final_digest.update( data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example D: Binary Array (hash_array API) + +**Array**: `BinaryArray [b"hi", b""]` (non-nullable) + +#### Step 1: Type Metadata + +`Binary` is canonicalized to `LargeBinary`. + +``` +final_digest.update(b'"LargeBinary"') // 13 bytes +``` + +#### Step 2: Data + +Each element: `[u64 LE length] [raw bytes]` + +- `b"hi"`: length 2 → `02 00 00 00 00 00 00 00` + `68 69` +- `b""`: length 0 → `00 00 00 00 00 00 00 00` (no raw bytes) + +``` +data_digest = SHA-256(0x0200000000000000_6869_0000000000000000) +``` + +#### Step 3: Finalization (non-nullable) + +``` +final_digest = SHA-256() +final_digest.update(b'"LargeBinary"') +final_digest.update( data_digest.finalize() ) +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example E: Column-Order Independence + +Two record batches with the same logical data but different column orders must produce identical hashes. + +**Batch 1** (columns: x, y): +``` +Schema: {x: Int32 non-nullable, y: Boolean nullable} +x: [10] +y: [true] +``` + +**Batch 2** (columns: y, x): +``` +Schema: {y: Boolean nullable, x: Int32 non-nullable} +y: [true] +x: [10] +``` + +Both produce the same canonical schema JSON: +``` +{"x":{"data_type":"Int32","nullable":false},"y":{"data_type":"Boolean","nullable":true}} +``` + +Both produce the same field digests (fields processed alphabetically: `x` then `y`): +- Field `x`: `SHA-256(0x0a000000)` (10 as i32 LE) +- Field `y`: validity `[1]` (1 bit, 1 word), data `0x01` (true packed Lsb0) + +Therefore `hash_record_batch(batch1) == hash_record_batch(batch2)`. + +--- + +### Example F: Type Equivalence (Utf8 vs LargeUtf8) + +**Array 1**: `StringArray ["ab"]` (non-nullable, Arrow type `Utf8`) +**Array 2**: `LargeStringArray ["ab"]` (non-nullable, Arrow type `LargeUtf8`) + +Both produce the same type metadata: `"LargeUtf8"` (after canonicalization). + +Both produce the same data bytes: +``` +02 00 00 00 00 00 00 00 (length 2 as u64 LE) +61 62 ("ab" as UTF-8) +``` + +Therefore `hash_array(array1) == hash_array(array2)`. + +--- + +### Example G: Nullable Int32 Array with Nulls (hash_array API) + +**Array**: `Int32Array [Some(42), None, Some(-7), Some(0)]` (nullable) + +#### Step 1: Type Metadata + +``` +final_digest.update(b'"Int32"') // 7 bytes +``` + +#### Step 2: Data + +**Validity bits** (Lsb0 in u8): +- `[1, 0, 1, 1]` → bits: b0=1, b1=0, b2=1, b3=1 +- As u8 (Lsb0): binary `0000_1101` = 13 +- bit_count = 4 + +**Data bytes** (only valid elements): +- 42 as i32 LE: `2a 00 00 00` +- -7 as i32 LE: `f9 ff ff ff` +- 0 as i32 LE: `00 00 00 00` + +``` +data_digest = SHA-256(0x2a000000_f9ffffff_00000000) +``` + +#### Step 3: Finalization (nullable) + +``` +final_digest = SHA-256() +final_digest.update(b'"Int32"') // type metadata +final_digest.update( 0x0400000000000000 ) // 4 bits (bit count as u64 LE) +final_digest.update( 0x0D ) // 13 as u8 +final_digest.update( data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example H: Nullable String Array with Nulls (hash_array API) + +**Array**: `StringArray [Some("hello"), None, Some("world"), Some("")]` (nullable, Arrow type `Utf8`) + +#### Step 1: Type Metadata + +`Utf8` is canonicalized to `LargeUtf8`. + +``` +final_digest.update(b'"LargeUtf8"') // 12 bytes +``` + +#### Step 2: Data + +**Validity bits** (Lsb0 in u8): +- `[1, 0, 1, 1]` → 0b1101 = 13 +- bit_count = 4 + +**Data bytes** (only valid elements, null skipped entirely): +- `"hello"`: `05 00 00 00 00 00 00 00` (len=5 as u64 LE) + `68 65 6c 6c 6f` +- `"world"`: `05 00 00 00 00 00 00 00` (len=5 as u64 LE) + `77 6f 72 6c 64` +- `""`: `00 00 00 00 00 00 00 00` (len=0 as u64 LE, no raw bytes) + +``` +data_digest = SHA-256(len+"hello" + len+"world" + len+"") +``` + +#### Step 3: Finalization (nullable) + +``` +final_digest = SHA-256() +final_digest.update(b'"LargeUtf8"') +final_digest.update( 0x0400000000000000 ) // bit_count=4 as u64 LE +final_digest.update( 0x0D ) // validity=13 as u8 +final_digest.update( data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example I: Empty Table (no data, schema only) + +**Schema**: `{a: Int32 non-nullable, b: Boolean nullable}` + +When no record batches are fed (i.e., `finalize()` is called immediately after construction), the field digests still exist — they just contain no data. + +#### Schema Digest + +``` +schema_json = '{"a":{"data_type":"Int32","nullable":false},"b":{"data_type":"Boolean","nullable":true}}' +schema_digest = SHA-256(schema_json) +``` + +#### Field "a" (Int32, non-nullable) + +No data was fed, so: +``` +a_data_digest = SHA-256("") // SHA-256 of empty input +``` + +#### Field "b" (Boolean, nullable) + +No data was fed: +- `bit_count` = 0 (no elements, BitVec is empty) +- `as_raw_slice()` = `[]` (no words) +- Data digest = SHA-256 of empty input + +#### Final Combination + +``` +final_digest = SHA-256() +final_digest.update( schema_digest ) // 32 bytes +final_digest.update( SHA-256("").finalize() ) // field "a" (non-nullable, 32 bytes) +final_digest.update( 0x0000000000000000 ) // field "b" bit_count=0 (u64 LE) +// no validity words (raw_slice is empty for 0-length BitVec) +final_digest.update( SHA-256("").finalize() ) // field "b" data (32 bytes) +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +### Example J: Multi-Batch Streaming (batch-split independence) + +**Schema**: `{v: Int32 non-nullable}` + +Feeding two batches must produce the same hash as feeding one combined batch: + +- **Batch 1**: `v = [1, 2]` +- **Batch 2**: `v = [3]` +- **Combined**: `v = [1, 2, 3]` + +Because the internal SHA-256 state is incremental: +``` +update(01 00 00 00 02 00 00 00) // from batch 1 +update(03 00 00 00) // from batch 2 +``` +is identical to: +``` +update(01 00 00 00 02 00 00 00 03 00 00 00) // single combined batch +``` + +#### Manual Computation + +``` +schema_json = '{"v":{"data_type":"Int32","nullable":false}}' +schema_digest = SHA-256(schema_json) + +v_data_digest = SHA-256(0x010000000200000003000000) + +final_digest = SHA-256() +final_digest.update( schema_digest ) +final_digest.update( v_data_digest.finalize() ) +output = 0x000001 ++ final_digest.finalize() +``` + +Therefore `hash(batch1 + batch2) == hash(combined)`. + +--- + +### Example K: Struct Column in a Record Batch + +**Schema**: `{person: Struct non-nullable}` + +**Data** (2 rows): + +| person.age | person.name | +|------------|-------------| +| 25 | "Alice" | +| 30 | "Bob" | + +In the record-batch path, the struct is **decomposed into leaf fields**: `person/age` and `person/name`. Each is hashed independently. + +#### Step 1: Schema Digest + +Canonical JSON: +``` +{"person":{"data_type":{"Struct":[{"data_type":"Int32","name":"age","nullable":false},{"data_type":"LargeUtf8","name":"name","nullable":false}]},"nullable":false}} +``` + +#### Step 2: Leaf field "person/age" (Int32, non-nullable) + +``` +age_data_digest = SHA-256(0x19000000_1e000000) // [25, 30] as i32 LE +``` + +#### Step 3: Leaf field "person/name" (LargeUtf8, non-nullable) + +``` +name_data_digest = SHA-256( + 0x0500000000000000 "Alice" // len=5 u64 LE + UTF-8 + 0x0300000000000000 "Bob" // len=3 u64 LE + UTF-8 +) +``` + +#### Step 4: Final Combination + +Fields alphabetically: `person/age`, `person/name`. + +``` +final_digest = SHA-256() +final_digest.update( schema_digest ) // 32 bytes +final_digest.update( age_data_digest.finalize() ) // 32 bytes (non-nullable) +final_digest.update( name_data_digest.finalize() ) // 32 bytes (non-nullable) +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +### Example L: Struct Array via hash_array (non-nullable, decomposed) + +**Array**: `StructArray [{a: 1, b: true}, {a: 2, b: false}]` + +Children: `a: Int32 non-null`, `b: Boolean non-null`. Struct is non-nullable. + +`hash_array` uses the same recursive decomposition as the record-batch path. Struct is transparent — no BTreeMap entry for the struct itself. Children become separate entries. + +#### Step 1: Type Metadata + +Canonical type JSON (struct fields sorted alphabetically, keys sorted): +``` +{"Struct":[{"data_type":"Int32","name":"a","nullable":false},{"data_type":"Boolean","name":"b","nullable":false}]} +``` + +#### Step 2: Decomposed Entries + +BTreeMap entries (sorted by key): `"a"`, `"b"` + +**Entry "a"** (Int32, non-nullable → data-only): +``` +data_a = SHA-256(0x01000000_02000000) // [1, 2] as i32 LE +``` + +**Entry "b"** (Boolean, non-nullable → data-only): +``` +// [true, false] → Lsb0: bit0=1, bit1=0 → 0x01 +data_b = SHA-256(0x01) +``` + +#### Step 3: Finalization + +Each entry is non-nullable → no null_bits, no structural, just data.finalize(). + +``` +final_digest = SHA-256() +final_digest.update( type_json_bytes ) // type metadata +final_digest.update( data_a.finalize() ) // entry "a": 32 bytes +final_digest.update( data_b.finalize() ) // entry "b": 32 bytes +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +### Example M: Nullable Struct Array via hash_array (struct-level nulls, decomposed) + +**Array**: `StructArray [Some({a: 10, b: "x"}), None, Some({a: 30, b: "z"})]` + +Children: `a: Int32 non-null`, `b: LargeUtf8 non-null`. Struct is **nullable**. + +Row 1 is a null struct. Struct is transparent — its null is AND-propagated to children for data hashing. Since children are non-nullable per their Field definitions, their entries have no null_bits — but null rows are skipped in the data stream. + +#### Step 1: Type Metadata + +``` +{"Struct":[{"data_type":"Int32","name":"a","nullable":false},{"data_type":"LargeUtf8","name":"b","nullable":false}]} +``` + +#### Step 2: Decomposed Entries (with struct-null propagation) + +BTreeMap entries (sorted by key): `"a"`, `"b"` + +**Entry "a"** (Int32, non-nullable → data-only): +- Struct nulls propagated: rows 0, 2 valid → data: `[10, 30]` + +``` +data_a = SHA-256(0x0a000000_1e000000) // [10, 30] as i32 LE +``` + +**Entry "b"** (LargeUtf8, non-nullable → data-only): +- Struct nulls propagated: rows 0, 2 valid → data: `"x"`, `"z"` + +``` +data_b = SHA-256( + 0x0100000000000000 "x" // len=1 + "x" + 0x0100000000000000 "z" // len=1 + "z" +) +``` + +#### Step 3: Finalization + +Each entry is non-nullable → no null_bits, no structural, just data.finalize(). + +``` +final_digest = SHA-256() +final_digest.update( type_json_bytes ) // type metadata +final_digest.update( data_a.finalize() ) // entry "a": 32 bytes +final_digest.update( data_b.finalize() ) // entry "b": 32 bytes +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +### Example N: List-of-Struct in a Record Batch (Recursive Decomposition) + +**Schema**: `{items: LargeList> nullable}` + +**Data** (2 rows): + +| items | +|-------| +| `[{id: 1, label: "a"}, {id: 2, label: "b"}]` | +| `[{id: 3, label: "c"}]` | + +The list-of-struct column is **recursively decomposed** into four BTreeMap entries: + +| Path | Entry type | Components | +|------|-----------|------------| +| `items` | validity-only | null_bits: `[V, V]` (2 bits) | +| `items/` | structural-only | list lengths: `[2, 1]` | +| `items//id` | data-only | leaf values: `[1, 2, 3]` as i32 LE | +| `items//label` | data-only | leaf values: `len+"a"`, `len+"b"`, `len+"c"` | + +Note the path naming: `items` (column) → `items/` (list adds `/`) → `items//id` (struct adds `/id`, producing `//` because parent ends in `/`). + +#### Step 1: Schema Digest + +Canonical JSON (element type omits Arrow-internal field name "item"): +``` +{"items":{"data_type":{"LargeList":{"data_type":{"Struct":[{"data_type":"Int32","name":"id","nullable":false},{"data_type":"LargeUtf8","name":"label","nullable":false}]},"nullable":false}},"nullable":true}} +``` + +#### Step 2: Traversal + +The top-down recursive traversal processes each row: + +**Row 0** (valid list, 2 elements): +- `items` entry: push `valid` to null_bits +- `items/` entry: write `2_u64.to_le_bytes()` to structural +- Recurse into sub-array `[{id:1, label:"a"}, {id:2, label:"b"}]`: + - Struct is transparent — recurse into children (sorted: "id", "label"): + - `items//id` entry: write `1_i32.to_le_bytes()`, `2_i32.to_le_bytes()` to data + - `items//label` entry: write `len+"a"`, `len+"b"` to data + +**Row 1** (valid list, 1 element): +- `items` entry: push `valid` to null_bits +- `items/` entry: write `1_u64.to_le_bytes()` to structural +- Recurse into sub-array `[{id:3, label:"c"}]`: + - `items//id` entry: write `3_i32.to_le_bytes()` to data + - `items//label` entry: write `len+"c"` to data + +#### Step 3: Final Combination + +Entries are finalized in BTreeMap (alphabetical) order: + +``` +final_digest = SHA-256() +final_digest.update( schema_digest ) // 32 bytes + +// Entry "items" (validity-only) +final_digest.update( 0x0200000000000000 ) // bit_count=2 (u64 LE) +final_digest.update( 0x03 ) // validity word: 0b11 = 3 (u8) + +// Entry "items/" (structural-only) +items_structural = SHA-256( + 0x0200000000000000 // row 0: 2 elements + 0x0100000000000000 // row 1: 1 element +) +final_digest.update( items_structural.finalize() ) // 32 bytes + +// Entry "items//id" (data-only) +id_data = SHA-256( + 0x01000000 // 1 as i32 LE + 0x02000000 // 2 as i32 LE + 0x03000000 // 3 as i32 LE +) +final_digest.update( id_data.finalize() ) // 32 bytes + +// Entry "items//label" (data-only) +label_data = SHA-256( + 0x0100000000000000 0x61 // len=1 + "a" + 0x0100000000000000 0x62 // len=1 + "b" + 0x0100000000000000 0x63 // len=1 + "c" +) +final_digest.update( label_data.finalize() ) // 32 bytes + +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +## 8. Platform Considerations + +- **Integer sizes**: All length prefixes use `u64` (8 bytes, LE). Validity bitmaps use `BitVec` (1 byte per word). Bit counts use `u64` (8 bytes, LE). Hashes are **platform-independent**. +- **Byte order**: All values use little-endian. Validity words are `u8` (1 byte, so endianness is trivial). Bit counts use little-endian. +- **Floating point**: IEEE 754 representation is hashed directly. `NaN` values with different bit patterns produce different hashes. `+0.0` and `-0.0` produce different hashes. diff --git a/docs/design-spec.md b/docs/design-spec.md index 5ad83c6..1f809b4 100644 --- a/docs/design-spec.md +++ b/docs/design-spec.md @@ -21,7 +21,8 @@ The hash algorithm is parameterized via Rust's `digest::Digest` trait. The publi |------|-----------| | **Logical equivalence** | Two Arrow structures represent the same data regardless of physical layout choices (encoding, column order, batch splits). | | **Validity bitmap** | A bit vector where `1` = valid, `0` = null, tracked per nullable field. | -| **Data digest** | A running hash of the non-null data bytes for a single field. | +| **Data digest** | A running hash of the non-null leaf data bytes for a single field. | +| **Structural digest** | A running hash of element counts for list-type fields, separating structure from leaf data. | | **Schema digest** | A hash of the canonicalized JSON representation of the schema. | | **Field path** | A `/`-separated path for nested struct fields (e.g., `address/city`). | @@ -69,35 +70,38 @@ Because the top-level is a `BTreeMap`, field names are automatica ```json { "age": {"data_type": "Int32", "nullable": false}, - "name": {"data_type": "Utf8", "nullable": true} + "name": {"data_type": "LargeUtf8", "nullable": true} } ``` -### 4.2 Data Type Serialization +### 4.2 Data Type Serialization (`data_type_to_value`) + +All data type serialization goes through `data_type_to_value`, which produces a canonical JSON representation. The output is recursively key-sorted via `sort_json_value` before returning. #### Primitive types Serialized using Arrow's built-in serde, producing strings like `"Int32"`, `"Boolean"`, `"Float64"`, or objects like `{"Decimal128": [38, 5]}`, `{"Time32": "Second"}`. #### Logical type equivalence classes -For fully logical hashing, certain types that differ only in physical representation are canonicalized to a single form in the schema: +Certain types that differ only in physical representation (offset width) are canonicalized to a single form: | Types in equivalence class | Canonical form in schema | |---|---| | `Binary`, `LargeBinary` | `"LargeBinary"` | | `Utf8`, `LargeUtf8` | `"LargeUtf8"` | -| `List(field)`, `LargeList(field)` | `{"LargeList": }` | +| `List(field)`, `LargeList(field)` | `{"LargeList": }` | +| `Dictionary(key_type, value_type)` | Recursive `data_type_to_value(value_type)` | The "large" variant is always the canonical form because it is the superset representation. #### Nested types - **Struct**: `{"Struct": []}` — inner fields are **sorted alphabetically by field name** before serialization. -- **List / LargeList**: `{"LargeList": }` (canonicalized to large variant). -- **FixedSizeList**: `{"FixedSizeList": [, ]}`. +- **List / LargeList**: `{"LargeList": }` (canonicalized to large variant). The element type uses `element_type_to_value` which omits the Arrow-internal field name (e.g., `"item"`), including only `data_type` and `nullable`. +- **FixedSizeList**: `{"FixedSizeList": [, ]}`. Also uses `element_type_to_value` (no field name). - **Map**: `{"Map": [, ]}`. -Each inner field object has the form: +**Inner field object** (for struct children, map entries): ```json { "data_type": , @@ -106,6 +110,14 @@ Each inner field object has the form: } ``` +**Element type object** (for list/fixed-size-list items): +```json +{ + "data_type": , + "nullable": +} +``` + All JSON objects have their keys sorted recursively via `sort_json_value` to ensure deterministic serialization. ### 4.3 Schema Digest Computation @@ -116,13 +128,33 @@ schema_digest = SHA256(canonical_json_string) --- -## 5. Data Serialization (Byte Layout) +## 5. DigestBufferType + +Each entry in the BTreeMap has a `DigestBufferType` struct with three **optional** components: + +```rust +struct DigestBufferType { + null_bits: Option>, // Present for nullable entries + structural: Option, // Present for list-type entries + data: Option, // Present for leaf and list-leaf entries +} +``` + +- **`null_bits`**: Validity bitmap. Present for nullable fields, absent for non-nullable. +- **`structural`**: A separate running digest for list element counts. Present for list-type entries. Separates structure (how elements are partitioned into lists) from leaf data. +- **`data`**: The running digest for actual data bytes (leaf values). Present for leaf and list-leaf entries, absent for validity-only and structural-only entries. + +There are four entry types, constructed via dedicated constructors: +- **`new_data_only(nullable)`**: Leaf field (e.g., `Int32`). Has `data`, optionally `null_bits`. +- **`new_structural_only(nullable)`**: List intermediate node above a struct or nested list. Has `structural`, optionally `null_bits`. +- **`new_list_leaf(nullable)`**: List whose value type is a leaf (e.g., `List`). Has `structural` + `data`, optionally `null_bits`. +- **`new_validity_only()`**: Nullable parent whose descendants have their own entries. Has `null_bits` only. + +--- -Each field is hashed independently. The field's digest buffer is one of: -- `NonNullable(D)` — a single running digest for data bytes. -- `Nullable(BitVec, D)` — a validity bitmap (`BitVec`) plus a running data digest. +## 6. Data Serialization (Byte Layout) -### 5.1 Fixed-Size Types +### 6.1 Fixed-Size Types **Types:** `Int8`, `UInt8`, `Int16`, `UInt16`, `Int32`, `UInt32`, `Int64`, `UInt64`, `Float16`, `Float32`, `Float64`, `Date32`, `Date64`, `Time32(*)`, `Time64(*)`, `Decimal32`, `Decimal64`, `Decimal128`, `Decimal256`, `FixedSizeBinary(n)`. @@ -138,18 +170,18 @@ Each field is hashed independently. The field's digest buffer is one of: | Decimal256 | 32 | Little-endian | | FixedSizeBinary(n) | n | Raw bytes | -**Non-nullable path:** The entire buffer slice (accounting for offset) is fed into the digest in one call. +**Non-nullable path:** The entire buffer slice (accounting for offset) is fed into the data digest in one call. **Nullable path:** 1. Extend the validity bitmap with `is_valid(i)` for each element. 2. For each valid element, feed its little-endian bytes into the data digest. 3. Null elements are **skipped** — no data bytes are fed (null information is captured solely by the validity bitmap). -### 5.2 Boolean Type +### 6.2 Boolean Type -Boolean values are **bit-packed** using MSB-first (`Msb0`) ordering into bytes. +Boolean values are **bit-packed** using LSB-first (`Lsb0`) ordering with `u8` storage words into bytes via `BitVec`. -**Non-nullable path:** All values are packed sequentially. +**Non-nullable path:** All values are packed sequentially into a `BitVec`, and the raw backing bytes are fed into the data digest. **Nullable path:** 1. Extend the validity bitmap. @@ -158,9 +190,12 @@ Boolean values are **bit-packed** using MSB-first (`Msb0`) ordering into bytes. **Example:** `[true, NULL, false, true]` (nullable) - Validity bitmap: `[1, 0, 1, 1]` -- Data bits (valid only): `[true, false, true]` → Msb0 packed: `1010_0000` = `0xA0` +- Data bits (valid only): `[true, false, true]` → Lsb0 packed: bit0=1, bit1=0, bit2=1 → `0000_0101` = `0x05` + +**Example:** `[true, false, true]` (non-nullable) +- Lsb0 packed: bit0=1, bit1=0, bit2=1 → `0000_0101` = `0x05` -### 5.3 Variable-Length Types (Binary, String) +### 6.3 Variable-Length Types (Binary, String) **Types:** `Binary`, `LargeBinary`, `Utf8`, `LargeUtf8`. @@ -171,67 +206,67 @@ Each element is serialized as: The length prefix is **always u64** (8 bytes, little-endian) regardless of the offset type (`i32` for `Binary`/`Utf8`, `i64` for `LargeBinary`/`LargeUtf8`). This ensures cross-platform stability and logical equivalence between small/large variants. -**Non-nullable path:** For each element, feed `len.to_le_bytes()` (u64) then the raw bytes. +**Non-nullable path:** For each element, feed `(value.len() as u64).to_le_bytes()` then the raw bytes. **Nullable path:** 1. Extend the validity bitmap. 2. For valid elements: feed length prefix + raw bytes. 3. For null elements: **skip entirely** — no sentinel bytes. Null information is captured by the validity bitmap. -### 5.4 List Types +### 6.4 List Types (Record-Batch Path) **Types:** `List(field)`, `LargeList(field)`. -Each list element (a sub-array) is serialized as: -``` -[sub-array length as u64 little-endian (8 bytes)] [recursive serialization of sub-array elements] -``` +List columns are **recursively decomposed** into separate BTreeMap entries. A list creates an intermediate entry at `path/` (path + delimiter). The value type is then recursively traversed. -The sub-array length prefix prevents collisions between differently-partitioned lists (e.g., `[[1,2],[3]]` vs `[[1],[2,3]]`). +**Decomposition by value type:** +- **`List`** (e.g., `List`): Entry at `path/` is a **list-leaf** with both structural and data digests. +- **`List>`**: Entry at `path/` is **structural-only**. The struct is transparent, and each struct child creates its own entry at `path//childname`. +- **`List>`**: Entry at `path/` is structural-only. The inner list creates another entry at `path//`. -**Nullable path:** Same as other types — extend validity bitmap, skip null list entries. +**Nullable list columns:** A **validity-only** entry is created at `path` (without trailing `/`), recording which rows are null vs valid. Null list elements are not traversed. -The sub-array elements are hashed recursively using the same `array_digest_update` dispatch, so nested lists and nested structs within lists follow the same rules. +**Traversal:** For each non-null list element, write the sub-array length (u64 LE) to the structural digest at `path/`, then recurse into the sub-array. -### 5.5 Struct Types +### 6.5 Struct Types (Record-Batch Path) -Struct fields are **not hashed as a composite** — instead, each leaf field within the struct is extracted and hashed independently under its own field path (e.g., `address/city`, `address/zip`). The field paths are stored in a `BTreeMap`, so they are always processed in alphabetical order. +Struct fields are **transparent** — they do not create a BTreeMap entry. Instead: -This design means: -- Struct field order in the Arrow schema does not affect the hash. -- Each leaf field maintains its own independent validity bitmap and data digest. +1. **Children are traversed** in alphabetical order by field name. +2. **Struct-level nulls are AND-propagated** to all descendant entries via `combine_nulls`. If a struct row is null, none of its children's data is hashed for that row. +3. Each child is recursively decomposed (leaf → data entry, list → structural entry, nested struct → recurse further). -### 5.6 Dictionary-Encoded Arrays +**Path naming:** Struct adds `/fieldname` to the path. Combined with list's trailing `/`, this produces paths like `items//id` (list `/` + struct `/id`). -Dictionary-encoded arrays are **resolved to their plain equivalent** before hashing. The dictionary is unpacked so that the resulting data stream is identical to what a non-dictionary-encoded array with the same logical values would produce. +### 6.6 Dictionary-Encoded Arrays + +Dictionary-encoded arrays are **resolved to their plain equivalent** before hashing. The dictionary is unpacked using Arrow's `cast` kernel so that the resulting data stream is identical to what a non-dictionary-encoded array with the same logical values would produce. This ensures that `DictionaryArray(indices=[0,1,0], dict=["a","b"])` produces the same hash as `StringArray(["a","b","a"])`. --- -## 6. Final Digest Assembly +## 7. Final Digest Assembly -### 6.1 Field Digest Finalization +### 7.1 Field Digest Finalization -Each field's digest buffer is finalized and fed into the combined final digest: +Each entry's `DigestBufferType` is finalized and fed into the combined final digest via `finalize_digest`. Each component is written only if present: -**Non-nullable field:** -``` -feed: SHA256_finalize(data_digest) // 32 bytes ``` +// If nullable (null_bits is Some): +feed: validity_bitmap_length as u64 LE // 8 bytes (number of bits) +feed: validity_bitmap raw bytes (LE) // ceil(length/8) bytes (u8 words, to_le_bytes is identity for u8) -**Nullable field:** -``` -feed: validity_bitmap_length as u64 LE // 8 bytes (number of bits) -feed: validity_bitmap words (BE bytes) // ceil(length/8) bytes, each u8 word in big-endian -feed: SHA256_finalize(data_digest) // 32 bytes +// If list type (structural is Some): +feed: SHA256_finalize(structural_digest) // 32 bytes + +// If leaf/list-leaf (data is Some): +feed: SHA256_finalize(data_digest) // 32 bytes ``` -The validity bitmap is serialized as: -1. The bit count (number of elements seen) as `u64` little-endian. -2. The raw backing storage words, each converted to big-endian bytes. +The validity bitmap uses `BitVec` storage. Each `u8` word is serialized via `to_le_bytes()` (identity for single-byte words). The bit count (not byte count) is written as the length prefix. -### 6.2 Combined Final Digest +### 7.2 Combined Final Digest ``` final_digest = SHA256( @@ -244,7 +279,7 @@ final_digest = SHA256( Fields are iterated from the `BTreeMap` which maintains alphabetical ordering by field path. -### 6.3 Version Prefix +### 7.3 Version Prefix The public `ArrowDigester` prepends a 3-byte version prefix to the final digest: @@ -254,139 +289,86 @@ output = [0x00, 0x00, 0x01] || final_digest // 3 + 32 = 35 bytes total --- -## 7. Standalone `hash_array` Function +## 8. Standalone `hash_array` Function -`hash_array` hashes a single array without a full schema context. Its digest is: +`hash_array` hashes a single array without a full schema context. It uses the **same recursive decomposition** as the record-batch path (`extract_type_entries` + `traverse_and_update`), ensuring consistent hashing regardless of which API is used. ``` final = SHA256( - canonical_json(data_type) // data type metadata - || finalized_field_digest // nullable or non-nullable, same rules as above + serde_json::to_string(data_type_to_value(effective_type)) // canonical type JSON string + || for each BTreeMap entry: finalize_digest(entry) // same decomposition as record-batch ) ``` -The data type is serialized using the same `data_type_to_value` logic (with type canonicalization) and then `serde_json::to_string`. +If the input is a dictionary array, it is first resolved to its plain value type via `cast`. The effective type is then serialized using `data_type_to_value` (with type canonicalization and recursive key sorting), converted to a JSON string, and fed into the digest before the decomposed field entries. --- -## 8. Invariants and Guarantees +## 9. Schema Equality in `update()` + +When `update(record_batch)` is called, the record batch's schema is compared against the digester's schema **logically** — both schemas are serialized via `serialized_schema()` (which uses `data_type_to_value` with type canonicalization) and the resulting strings are compared. This means: +- Column order doesn't matter (both are sorted by `BTreeMap`). +- `Utf8` vs `LargeUtf8`, `Binary` vs `LargeBinary`, `List` vs `LargeList` are treated as equivalent. +- Dictionary types are canonicalized to their value types. + +--- + +## 10. Invariants and Guarantees 1. **Column-order independence:** Top-level fields are sorted alphabetically via `BTreeMap`. -2. **Struct field-order independence:** Struct children are sorted by name during schema serialization and field extraction. +2. **Struct field-order independence:** Struct children are sorted by name during schema serialization and during composite hashing in `array_digest_update`. 3. **Batch-split independence:** Streaming `update()` calls produce the same hash as a single combined batch. 4. **Encoding independence:** Dictionary-encoded arrays are resolved before hashing. 5. **Physical type independence:** `Binary`/`LargeBinary`, `Utf8`/`LargeUtf8`, `List`/`LargeList` are canonicalized to their large variants in the schema and use identical data serialization. -6. **Platform independence:** All length prefixes use `u64` (8 bytes LE), all numeric values use little-endian byte order. +6. **Platform independence:** All length prefixes use `u64` (8 bytes LE), all numeric values use little-endian byte order, validity bitmaps use `BitVec` (u8-width words, not platform-dependent `usize`). 7. **Null handling consistency:** Null values are tracked solely via the validity bitmap. No sentinel bytes are fed into the data digest for any type. -8. **Non-null arrays with/without validity bitmap:** An array with all valid values produces the same data digest whether or not a validity bitmap is present (nulls simply mean bits are not pushed and values are not fed, and all-valid arrays feed the same bytes). - ---- - -## 9. Known Issues and Required Fixes - -The following issues have been identified in the current implementation that must be fixed to achieve the guarantees above: - -### 9.1 Struct Fields Not Sorted in Schema Serialization - -**File:** `arrow_digester_core.rs`, `data_type_to_value()` (line ~206) - -**Issue:** Struct inner fields are collected into a `Vec` in their original order. Two schemas with the same struct fields in different order will produce different schema hashes. - -**Fix:** Sort the fields iterator by field name before collecting into the Vec. - -### 9.2 `inner_field_to_value` Not Recursively Sorted - -**File:** `arrow_digester_core.rs`, `inner_field_to_value()` (line ~232) - -**Issue:** The JSON object produced by `serde_json::json!` has non-deterministic key order. While `sort_json_value` is applied at the top level in `serialized_schema`, it is NOT applied to the output of `data_type_to_value`/`inner_field_to_value`. - -**Fix:** Apply `sort_json_value` recursively in `data_type_to_value` before returning. - -### 9.3 Binary Length Prefix Uses Platform-Dependent `usize` - -**File:** `arrow_digester_core.rs`, `hash_binary_array()` (line ~518) - -**Issue:** `value.len().to_le_bytes()` produces 4 bytes on 32-bit and 8 bytes on 64-bit platforms. - -**Fix:** Cast to `u64` before calling `to_le_bytes()`: `(value.len() as u64).to_le_bytes()`. - -### 9.4 `NULL_BYTES` Sentinel in Binary/String Nullable Paths - -**File:** `arrow_digester_core.rs`, `hash_binary_array()` (line ~536), `hash_string_array()` (line ~579) - -**Issue:** Null values feed `b"NULL"` into the data digest, but `hash_fixed_size_array` skips nulls entirely. Since null information is already captured in the validity bitmap, the sentinel is redundant and inconsistent. - -**Fix:** Remove `data_digest.update(NULL_BYTES)` from the null branches. Skip null values entirely, matching the fixed-size type behavior. - -### 9.5 No Type Canonicalization for Binary/Utf8/List Variants - -**File:** `arrow_digester_core.rs`, `data_type_to_value()` and `serialized_schema()` - -**Issue:** `Binary` and `LargeBinary` serialize to different JSON strings, causing logically equivalent schemas to hash differently. - -**Fix:** In `data_type_to_value`, map `Binary` → `LargeBinary`, `Utf8` → `LargeUtf8`, `List` → `LargeList` before serialization. - -### 9.6 Dictionary-Encoded Arrays Not Supported - -**File:** `arrow_digester_core.rs`, `array_digest_update()` (line ~437) - -**Issue:** Dictionary-encoded arrays hit `todo!()` and panic. - -**Fix:** Resolve dictionary arrays to their plain value arrays using Arrow's `take` kernel or equivalent, then recursively hash the result. - -### 9.7 Schema Equality Check in `update()` Too Strict - -**File:** `arrow_digester_core.rs`, `update()` (line ~61) - -**Issue:** `*record_batch.schema() == self.schema` uses strict Arrow schema equality which includes column order. This prevents streaming batches with different column orders. - -**Fix:** Compare schemas logically (same set of fields with same types and nullability, regardless of order). +8. **Non-null arrays with/without validity bitmap:** An array with all valid values produces the same data digest whether or not a validity bitmap is present. --- -## 10. Comprehensive Test Plan +## 11. Comprehensive Test Plan -### 10.1 Column-Order Independence Tests +### 11.1 Column-Order Independence Tests - **Top-level column reorder:** Two record batches with columns `[a, b, c]` vs `[c, a, b]` with same data produce identical hashes. - **Schema-only column reorder:** Two schemas with same fields in different order produce identical schema hashes. - **Streaming with reordered batches:** Feed batch1 with order `[a, b]`, batch2 with order `[b, a]` — should produce same hash as feeding both in order `[a, b]`. -### 10.2 Struct Field-Order Independence Tests +### 11.2 Struct Field-Order Independence Tests - **Flat struct reorder:** `Struct({x: Int32, y: Utf8})` vs `Struct({y: Utf8, x: Int32})` with same data produce identical hashes. - **Nested struct reorder:** Deeply nested structs with shuffled field orders at every level. - **Schema hash with reordered struct fields:** Verify schema digest is identical. -### 10.3 Dictionary Encoding Equivalence Tests +### 11.3 Dictionary Encoding Equivalence Tests - **String dictionary vs plain:** `DictionaryArray` vs `StringArray` with same logical values. - **Integer dictionary vs plain:** Dictionary-encoded integers vs plain integer array. - **Dictionary with nulls:** Dictionary arrays containing null entries match plain arrays with same nulls. - **Nested dictionary:** List of dictionary-encoded strings vs list of plain strings. -### 10.4 Binary/Utf8/List Size Variant Equivalence Tests +### 11.4 Binary/Utf8/List Size Variant Equivalence Tests - **Binary vs LargeBinary:** Same byte data in both produces identical hash. - **Utf8 vs LargeUtf8:** Same string data produces identical hash. - **List vs LargeList:** Same list data produces identical hash. - **Schema equivalence:** Schema with `Binary` field hashes same as schema with `LargeBinary` field (same name, same nullability). -### 10.5 Null Handling Tests +### 11.5 Null Handling Tests -- **No sentinel bytes:** Verify that null values in binary/string arrays don't feed any extra bytes into the data digest (after fix). +- **No sentinel bytes:** Verify that null values in binary/string arrays don't feed any extra bytes into the data digest. - **All-null array:** Array of all nulls produces a hash that depends only on the validity bitmap. - **All-valid nullable vs non-nullable:** Array with all valid values produces same data digest whether schema says nullable or not. - **Mixed nulls across batches:** First batch all nulls, second batch all valid — same as single combined batch. - **Null at different positions:** `[1, NULL, 3]` vs `[NULL, 1, 3]` produce different hashes. -### 10.6 Batch Splitting Independence Tests +### 11.6 Batch Splitting Independence Tests - **Two batches vs one:** Already tested, but extend to more types and edge cases. - **Many small batches:** Split into single-row batches vs one large batch. - **Empty batches:** Inserting empty batches between data batches doesn't change the hash. -### 10.7 Edge Cases +### 11.7 Edge Cases - **Empty table:** Schema-only hash (no data). - **Zero-length arrays:** Arrays with length 0 for each type. @@ -398,7 +380,7 @@ The following issues have been identified in the current implementation that mus - **Unicode strings:** Strings with multi-byte UTF-8 characters. - **Sliced arrays:** Arrays created via `array.slice(offset, length)` should hash the same as a fresh array with the same values. -### 10.8 Collision Resistance Tests +### 11.8 Collision Resistance Tests - **Binary partition collision:** `[[0x01, 0x02], [0x03]]` vs `[[0x01], [0x02, 0x03]]` (already tested). - **String partition collision:** `["ab", "c"]` vs `["a", "bc"]` (already tested). @@ -406,12 +388,12 @@ The following issues have been identified in the current implementation that mus - **Null vs zero:** `[NULL]` vs `[0]` produce different hashes. - **Empty vs null:** `[Some("")]` vs `[None]` for string type. -### 10.9 Regression / Golden Value Tests +### 11.9 Regression / Golden Value Tests - Maintain golden hash values for a comprehensive schema with data, verified against manually computed expected bytes. - Byte-level verification tests (already partially present) for each data type confirming exact bytes fed into the digest. -### 10.10 Cross-Type Distinction Tests +### 11.10 Cross-Type Distinction Tests - **Float32 vs Float64:** Same numeric value (e.g., `1.5`) in different float types produces different hashes (schema distinguishes them). - **Int32 vs Int64:** Same integer value in different integer types produces different hashes. diff --git a/docs/implementation-plan.md b/docs/implementation-plan.md new file mode 100644 index 0000000..bc9a4e9 --- /dev/null +++ b/docs/implementation-plan.md @@ -0,0 +1,434 @@ +# Implementation Plan: Complete Stable Logical Hashing + +This plan addresses all identified gaps in the Starfix hashing implementation, organized into tiers by priority. Each item follows the project's TDD workflow: write failing tests first, then implement. + +**Files primarily affected:** +- `src/arrow_digester_core.rs` — core implementation +- `tests/arrow_digester.rs` — integration tests +- `tests/digest_bytes.rs` — byte-level specification conformance tests +- `docs/byte-layout-spec.md` — specification updates + +--- + +## Tier 1 — Blocks Production Use + +### 1.1 Implement `Timestamp` data hashing + +**Current state:** `todo!()` in `array_digest_update` for `DataType::Timestamp`. Schema serialization already works (falls through to Arrow serde: `{"Timestamp":["Nanosecond","UTC"]}`). + +**Implementation:** Timestamp is always `i64` (8 bytes LE), regardless of unit or timezone. + +```rust +DataType::Timestamp(_, _) => Self::hash_fixed_size_array(effective_array, digest, 8), +``` + +**Design decision — Timezone equivalence:** +Arrow's serde serializes `Timestamp(Nanosecond, Some("UTC"))` as `{"Timestamp":["Nanosecond","UTC"]}` and `Timestamp(Nanosecond, None)` as `{"Timestamp":["Nanosecond",null]}`. These naturally produce different schema hashes, which means **two columns with the same epoch values but different timezone annotations will hash differently** (because their schemas differ). This is the correct behavior — timezone is part of the logical type identity. **No special handling needed.** + +However, there is a subtler question: should `Timestamp(Nanosecond, Some("UTC"))` and `Timestamp(Nanosecond, Some("Etc/UTC"))` hash the same? They refer to the same timezone but have different string representations. **Recommended decision: do NOT normalize timezone strings.** Timezone alias resolution is complex, locale-dependent, and outside Starfix's scope. Document this as a known limitation. + +**Tests:** +- `Timestamp(Nanosecond, Some("UTC"))` basic hashing (hash_array) +- `Timestamp(Microsecond, None)` with nulls +- Different units with same raw value produce different schema hashes (schema difference) +- Same unit, same data, different timezone strings produce different hashes +- Byte-level test in `digest_bytes.rs` + +**Spec update:** Add Section 3.7 for Timestamp, or extend Section 3.1 with a note that Timestamp/Duration are 8-byte fixed-size types. + +--- + +### 1.2 Implement `Duration` data hashing + +**Current state:** `todo!()` in `array_digest_update` for `DataType::Duration`. Schema serialization works (`{"Duration":"Millisecond"}`). + +**Implementation:** Duration is always `i64` (8 bytes LE). + +```rust +DataType::Duration(_) => Self::hash_fixed_size_array(effective_array, digest, 8), +``` + +**Design decision:** None needed. The unit is encoded in the schema JSON, so different Duration units produce different schema hashes. Data is just raw i64 bytes. + +**Tests:** +- `Duration(Millisecond)` basic hashing +- Different units produce different schema hashes +- Byte-level test + +--- + +### 1.3 Implement `Interval` data hashing + +**Current state:** `todo!()` in `array_digest_update` for `DataType::Interval`. + +**Implementation:** Element size depends on the IntervalUnit variant: + +```rust +DataType::Interval(unit) => { + let size = match unit { + IntervalUnit::YearMonth => 4, // i32 + IntervalUnit::DayTime => 8, // i32 + i32 packed as i64 + IntervalUnit::MonthDayNano => 16, // i32 + i32 + i64 + }; + Self::hash_fixed_size_array(effective_array, digest, size); +} +``` + +**Design decision:** None needed. Schema serialization (`{"Interval":"MonthDayNano"}`) already differentiates variants. Each variant has a fixed physical size, so `hash_fixed_size_array` works directly. + +**Tests:** +- One test per IntervalUnit variant +- `MonthDayNano` with nulls +- Different interval units produce different schema hashes +- Byte-level test for `YearMonth` (simplest, 4-byte) + +--- + +### 1.4 Implement `FixedSizeList` data hashing + +**Current state:** `todo!()` in `array_digest_update` for `DataType::FixedSizeList`. Schema normalization and serialization already work correctly (`{"FixedSizeList":[, size]}`). Normalization recurses into the inner field but does **not** collapse `FixedSizeList` → `LargeList`. + +**Design decision — Should `FixedSizeList(Int32, 3)` be equivalent to `LargeList(Int32)`?** +**Recommended: No.** They are semantically different types (fixed-length vs variable-length). A `FixedSizeList` guarantees every element has exactly N items; a `LargeList` does not. Keep them as distinct types in the hash. This is consistent with how FixedSizeBinary is already handled (kept separate from LargeBinary). + +**Implementation:** `FixedSizeList` is conceptually a list where every element has exactly `size` items. For hashing, we can treat it like `LargeList` but without structural size prefixes (since all sizes are identical and encoded in the schema). + +However, for consistency with `LargeList`, we should still use structural hashing with the fixed size. This ensures that if a user ever needs to compare a `FixedSizeList` hash against a manually reconstructed one, the logic is consistent. + +**Alternative (simpler):** Treat `FixedSizeList(field, n)` as a flat buffer of `n * element_size` bytes per row. This only works for fixed-size inner types. For variable-size inner types (e.g., `FixedSizeList(Utf8, 3)`), we must recurse. + +**Recommended approach:** Reuse `hash_list_array` logic by casting `FixedSizeListArray` to `LargeListArray`. Arrow's `cast` supports this. This is the simplest and most consistent approach. + +```rust +DataType::FixedSizeList(field, _) => { + let as_large_list = cast(effective_array, &DataType::LargeList(Arc::clone(field))) + .expect("Failed to cast FixedSizeList to LargeList"); + Self::hash_list_array( + as_large_list.as_any().downcast_ref::() + .expect("Failed to downcast to LargeListArray"), + field.data_type(), + digest, + ); +} +``` + +**Design decision — Normalization update needed?** If we cast at hash time, we should also normalize `FixedSizeList` → `LargeList` in `normalize_data_type` to keep schema and data hashing consistent. But then `FixedSizeList` and `LargeList` with the same element type would be logically equivalent (same hash), which loses the fixed-size guarantee in the hash. **Decision needed from project owner:** +- **(A)** Normalize `FixedSizeList(f, n)` → `LargeList(f)` — treats them as equivalent (like Utf8/LargeUtf8) +- **(B)** Keep separate — `FixedSizeList` and `LargeList` always hash differently (different schema JSON) +- **(C)** Keep schema separate but use same data hashing logic (cast at data time, don't normalize schema) — this is the recommended approach + +If **(C)**: schema JSON stays as `{"FixedSizeList":[..., n]}` (preserving the size), but data hashing uses LargeList logic internally. This means two arrays with identical data but different types (`FixedSizeList` vs `LargeList`) produce different hashes (because their schemas differ), which is correct. + +**Tests:** +- `FixedSizeList(Int32, 2)` basic hashing +- `FixedSizeList(LargeUtf8, 3)` with variable-length inner type +- Nullable `FixedSizeList` with null elements +- Verify `FixedSizeList(Int32, 2)` ≠ `LargeList(Int32)` (if option B/C chosen) +- Byte-level test + +--- + +### 1.5 Implement `Map` data hashing + +**Current state:** `todo!()` in `array_digest_update` for `DataType::Map`. Schema normalization and serialization work (`{"Map":[, sorted]}`). + +**Background:** A `Map` in Arrow is physically stored as `LargeList>`. The Arrow `MapArray` wraps a `ListArray` of `StructArray` entries. + +**Design decision — Should `Map` be normalized to `LargeList>`?** +**Recommended: No.** `Map` has semantic meaning (key-value pairs, optional sort guarantee) that `LargeList` does not. The `sorted` flag is part of the schema JSON and should affect the hash. Keep `Map` as a distinct type. + +**Implementation:** Treat `Map` as a list of structs. Use the same approach as `LargeList`: + +```rust +DataType::Map(field, _sorted) => { + // Map is physically stored as a list of key-value structs + let map_array = effective_array.as_any() + .downcast_ref::() + .expect("Failed to downcast to MapArray"); + // Reinterpret as list of entries + // MapArray provides .entries() as StructArray and offsets + // Hash like a LargeList> + // ... +} +``` + +Concretely, `MapArray` exposes `keys()`, `values()`, and offsets. The cleanest path is to extract the underlying `ListArray` and hash it: + +```rust +DataType::Map(field, _) => { + // MapArray is backed by a ListArray of Struct entries + let map_array = effective_array.as_any() + .downcast_ref::() + .expect("Failed to downcast to MapArray"); + Self::hash_list_array( + // MapArray derefs to its inner ListArray representation + // We may need to access the underlying storage + ..., + field.data_type(), + digest, + ); +} +``` + +**Note:** The exact API depends on Arrow's `MapArray` internals. May need to construct a `LargeListArray` from the Map's offsets and entries struct. Check `arrow::array::MapArray` API. + +**Tests:** +- Simple `Map` with 2 rows +- Nullable Map with null entries +- Verify `Map` ≠ `LargeList>` (different schema hashes) +- Byte-level test + +--- + +### 1.6 Add multi-word validity bitmap test + +**Current state:** All existing tests use arrays with ≤ 8 elements, so validity bitmaps always fit in a single `u8` word. No test verifies correct behavior across word boundaries. + +**Implementation:** No code change needed — just add tests. + +**Tests:** +- Array with 9 elements (null at position 8 → triggers second u8 word) +- Array with 16 elements (nulls spanning exactly 2 full words) +- Array with 20 elements (partial third word, verifying zero-padding of unused high bits) +- All three as byte-level tests in `digest_bytes.rs` to verify exact word serialization + +--- + +## Tier 2 — Robustness + +### 2.1 Implement `Null` type + +**Current state:** `todo!()` in `array_digest_update` for `DataType::Null`. + +**Design decision:** A `Null` column has no data — every element is null. The only information to hash is the validity bitmap (all zeros) and the count. + +**Implementation:** +```rust +DataType::Null => { + // Null type: no data bytes. Only push null bits (all false). + if let Some(ref mut null_bits) = digest.null_bits { + null_bits.extend(repeat_n(false, effective_array.len())); + } + // No data to feed into digest.data — intentionally empty. +} +``` + +**Tests:** +- `NullArray` with 3 elements via hash_array +- Nullable vs non-nullable Null column in record batch +- Byte-level test: verify only validity bits (all 0s) and empty data digest + +--- + +### 2.2 Add nullable list element tests + +**Current state:** No test creates a `LargeListArray` where some list entries themselves are NULL (not list *values* being null, but entire list entries absent). + +**Tests:** +- `LargeList` with data `[[1,2], NULL, [3]]` — verify null list entry is skipped (no structural size, no data) +- Byte-level test verifying exact bytes: validity = `[1, 0, 1]`, structural receives only 2 sizes, data receives only `[1,2,3]` + +--- + +### 2.3 Document metadata exclusion in spec + +**Current state:** Arrow Field/Schema metadata (`HashMap`) is silently ignored. `normalize_field()` drops metadata. This is correct but undocumented. + +**Changes:** +- Add to `docs/byte-layout-spec.md` Section 2.1: "Arrow field metadata and schema metadata are **excluded** from the hash. Only field names, data types (recursively), and nullability are included. This means two schemas that differ only in metadata produce identical hashes." +- Add a test: two schemas identical except for metadata → same hash + +--- + +### 2.4 Add property-based test: column reorder invariance + +**Current state:** Column order independence is tested with 2 fixed examples. A property test would strengthen this. + +**Design decision:** Use `proptest` or `quickcheck` crate? **Recommend `proptest`** — more flexible, better shrinking. + +**Tests:** +- Generate random schemas with 2-10 fields of supported types +- Generate random data matching schema +- Shuffle column order → hash must be identical +- This would also serve as a crash test for unsupported types (should not panic for supported types) + +**Note:** This is a `dev-dependency` addition. Keep it behind a feature flag if desired. + +--- + +## Tier 3 — Completeness + +### 3.1 Implement `Union` types (Dense and Sparse) + +**Current state:** `todo!()` in `array_digest_update` for `DataType::Union`. + +**Design decision — This is the hardest type to hash correctly:** + +A Union contains multiple child arrays and a type_ids buffer that says which child each row comes from. DenseUnion also has an offsets buffer. + +Options: +- **(A) Resolve to concrete values:** For each row, look up the active child + offset, extract the value, hash it. This is like dictionary resolution. Simple but loses the "which variant" information. +- **(B) Hash type_ids + child data separately:** Feed `type_ids` as a fixed-size array, then hash each child independently. This preserves variant identity. +- **(C) Hash compositely:** For each row, hash `(type_id, value_bytes)`. This is the most collision-resistant. + +**Recommended: (C)** — hash `type_id` byte followed by value bytes for each row. This ensures that a union value `Int32(5)` hashes differently from `Float32(5.0)` even if they happen to have similar byte representations. + +**Implementation sketch:** +```rust +DataType::Union(fields, mode) => { + let union_array = effective_array.as_any() + .downcast_ref::() + .expect("Failed to downcast to UnionArray"); + for i in 0..union_array.len() { + let type_id = union_array.type_id(i); + digest.data.update(type_id.to_le_bytes()); + let child = union_array.value(i); + // Hash the single-element child value + // Need a way to hash a single scalar — possibly slice the child array + ... + } +} +``` + +**Complexity:** High. Union hashing requires per-element dispatch. Defer if not needed for initial production use. + +**Tests:** +- SparseUnion with Int32 and Utf8 children +- DenseUnion with nulls (if Union supports nulls — it depends on Arrow version) +- Byte-level test + +--- + +### 3.2 Implement `RunEndEncoded` + +**Current state:** `todo!()` in `array_digest_update` for `DataType::RunEndEncoded`. + +**Design decision:** RunEndEncoded is a compression format. Like Dictionary, the logical values are what matter. + +**Recommended:** Resolve/decode to the plain array equivalent and hash that. Arrow should support `cast()` from REE to plain arrays. + +```rust +DataType::RunEndEncoded(_, values_field) => { + let plain = cast(effective_array, values_field.data_type()) + .expect("Failed to decode RunEndEncoded"); + Self::array_digest_update(values_field.data_type(), plain.as_ref(), digest); +} +``` + +**Design decision:** Should REE normalize in the schema? **Recommended: Yes** — normalize `RunEndEncoded(run_ends, values)` → `normalize_data_type(values.data_type())`. This treats REE as a pure encoding optimization, like Dictionary. + +**Tests:** +- REE Int32 array hashes same as plain Int32 array +- REE with runs of different lengths + +--- + +### 3.3 Implement View types (`BinaryView`, `Utf8View`) + +**Current state:** `todo!()` at lines 533, 541. + +**Implementation:** View types are logically equivalent to their non-view counterparts. Normalize in both schema and data: + +**Schema normalization** (add to `normalize_data_type`): +```rust +DataType::Utf8View => DataType::LargeUtf8, +DataType::BinaryView => DataType::LargeBinary, +``` + +**Data hashing** (add to normalization block at top of `array_digest_update`): +```rust +DataType::Utf8View => { + normalized_type = DataType::LargeUtf8; + cast_array = cast(array, &normalized_type).expect("Failed to cast Utf8View to LargeUtf8"); + (&normalized_type, cast_array.as_ref()) +} +DataType::BinaryView => { + normalized_type = DataType::LargeBinary; + cast_array = cast(array, &normalized_type).expect("Failed to cast BinaryView to LargeBinary"); + (&normalized_type, cast_array.as_ref()) +} +``` + +**Tests:** +- `Utf8View ["hello"]` hashes same as `LargeUtf8 ["hello"]` +- `BinaryView` hashes same as `LargeBinary` +- Schema equivalence test + +--- + +### 3.4 Implement `ListView` / `LargeListView` + +**Current state:** `todo!()` at lines 542, 554. + +**Implementation:** Normalize to `LargeList` (same logical semantics, different physical layout): + +**Schema normalization:** +```rust +DataType::ListView(field) | DataType::LargeListView(field) => { + DataType::LargeList(Arc::new(normalize_field(field))) +} +``` + +**Data hashing:** Cast to `LargeList` at the normalization block in `array_digest_update`. + +**Tests:** +- `ListView` hashes same as `LargeList` +- With nulls + +--- + +### 3.5 Add fuzz testing for panic detection + +**Implementation:** Add a fuzz target that generates random `RecordBatch` instances from random schemas (using only supported types) and ensures `hash_record_batch` never panics. + +**Tool:** `cargo-fuzz` with `libfuzzer` or `afl`. + +**Scope:** Generate schemas with 1-20 fields, types drawn from supported set, 0-100 rows, random null patterns. + +--- + +## Execution Order + +Recommended implementation sequence (respecting dependencies): + +1. **1.1–1.3** (Timestamp, Duration, Interval) — independent, trivial implementations +2. **1.6** (multi-word validity test) — test-only, no code changes +3. **2.1** (Null type) — trivial +4. **2.2** (nullable list test) — test-only +5. **2.3** (document metadata exclusion) — docs-only +6. **3.3** (View types) — simple normalization + cast +7. **3.4** (ListView) — simple normalization + cast +8. **1.4** (FixedSizeList) — needs design decision on normalization +9. **1.5** (Map) — moderate complexity, needs Arrow API exploration +10. **3.2** (RunEndEncoded) — needs design decision on normalization +11. **3.1** (Union) — highest complexity +12. **2.4** (property tests) — after all types implemented +13. **3.5** (fuzz testing) — after all types implemented + +Items 1-7 can likely be done in a single PR. Items 8-11 may warrant individual PRs due to design decisions. Items 12-13 are infrastructure additions. + +--- + +## Python Bindings + +The Python interface should be provided via **PyO3 bindings** to the Rust library (not a parallel pure-Python implementation). This lives in the separate `nauticalab/starfix-python` repository. + +**TODO:** +- Configure PyO3/maturin build for the starfix crate +- Expose `ArrowDigester`, `hash_array`, `hash_record_batch`, `hash_table` to Python +- Use `arrow-rs` ↔ `pyarrow` interop via `arrow::pyarrow` feature or `pyo3-arrow` +- Publish to PyPI as `starfix` + +--- + +## Open Design Decisions Summary + +| # | Question | Recommendation | Impact | +|---|----------|---------------|--------| +| 1 | Should timezone strings be normalized (e.g., "UTC" == "Etc/UTC")? | **No** — document as known limitation | Low risk | +| 2 | Should `FixedSizeList` normalize to `LargeList`? | **No** — keep schema separate, use same data hashing logic (option C) | Affects schema equivalence | +| 3 | Should `Map` normalize to `LargeList`? | **No** — keep as distinct type | Affects schema equivalence | +| 4 | Should `RunEndEncoded` normalize to its value type? | **Yes** — treat as encoding optimization like Dictionary | Affects schema equivalence | +| 5 | Should View types normalize to Large equivalents? | **Yes** — `Utf8View`→`LargeUtf8`, etc. | Affects schema equivalence | +| 6 | How should Union be hashed? | **(C)** — type_id + value bytes per row | Affects hash format | +| 7 | Should metadata affect the hash? | **No** — current behavior is correct, just document it | Documentation only | diff --git a/src/arrow_digester_core.rs b/src/arrow_digester_core.rs index 5dde5a6..d834a99 100644 --- a/src/arrow_digester_core.rs +++ b/src/arrow_digester_core.rs @@ -3,41 +3,144 @@ clippy::todo, reason = "First iteration of code, will add proper error handling later. Allow for unsupported data types for now" )] -use std::{collections::BTreeMap, iter::repeat_n}; +use std::{collections::BTreeMap, iter::repeat_n, sync::Arc}; use arrow::{ array::{ - Array, BinaryArray, BooleanArray, GenericBinaryArray, GenericListArray, GenericStringArray, - LargeBinaryArray, LargeListArray, LargeStringArray, ListArray, OffsetSizeTrait, - RecordBatch, StringArray, StructArray, + make_array, Array, BooleanArray, GenericBinaryArray, GenericStringArray, LargeBinaryArray, + LargeListArray, LargeStringArray, OffsetSizeTrait, RecordBatch, StructArray, }, + buffer::NullBuffer, + compute::cast, datatypes::{DataType, Schema}, }; use arrow_schema::Field; use bitvec::prelude::*; use digest::Digest; -const NULL_BYTES: &[u8] = b"NULL"; - const DELIMITER_FOR_NESTED_FIELD: &str = "/"; #[derive(Clone)] -enum DigestBufferType { - NonNullable(D), - Nullable(BitVec, D), // Where first digest is for the bull bits, while the second is for the actual data +struct DigestBufferType { + null_bits: Option>, + structural: Option, + data: Option, +} + +impl DigestBufferType { + /// Create a buffer for a leaf field (data + optional `null_bits`). + fn new_data_only(nullable: bool) -> Self { + Self { + null_bits: nullable.then(BitVec::::new), + structural: None, + data: Some(D::new()), + } + } + + /// Create a buffer for a list-level-only entry (structural + optional `null_bits`, no data). + fn new_structural_only(nullable: bool) -> Self { + Self { + null_bits: nullable.then(BitVec::::new), + structural: Some(D::new()), + data: None, + } + } + + /// Create a buffer for a leaf that is itself a list type (structural + data + optional `null_bits`). + fn new_list_leaf(nullable: bool) -> Self { + Self { + null_bits: nullable.then(BitVec::::new), + structural: Some(D::new()), + data: Some(D::new()), + } + } + + /// Create a buffer for a column-level nullable entry (`null_bits` only). + fn new_validity_only() -> Self { + Self { + null_bits: Some(BitVec::::new()), + structural: None, + data: None, + } + } + + /// Get a mutable reference to the data digest, panicking if absent. + #[expect(clippy::panic, reason = "Const fn cannot use expect/unwrap")] + const fn data_mut(&mut self) -> &mut D { + match &mut self.data { + Some(d) => d, + None => panic!("data digest not present on this entry"), + } + } +} + +/// Recursively normalize a `DataType` to its canonical large equivalent. +/// +/// - `Utf8` → `LargeUtf8` +/// - `Binary` → `LargeBinary` +/// - `List(field)` → `LargeList(normalized_field)` +/// - `Dictionary(_, value_type)` → `normalize_data_type(value_type)` +/// - `Struct`, `LargeList`, `FixedSizeList`, `Map` have their inner fields normalized recursively. +fn normalize_data_type(data_type: &DataType) -> DataType { + match data_type { + DataType::Utf8 => DataType::LargeUtf8, + DataType::Binary => DataType::LargeBinary, + DataType::List(field) | DataType::LargeList(field) => { + DataType::LargeList(Arc::new(normalize_field(field))) + } + DataType::Struct(fields) => DataType::Struct( + fields + .iter() + .map(|f| Arc::new(normalize_field(f))) + .collect(), + ), + DataType::FixedSizeList(field, size) => { + DataType::FixedSizeList(Arc::new(normalize_field(field)), *size) + } + DataType::Map(field, sorted) => DataType::Map(Arc::new(normalize_field(field)), *sorted), + DataType::Dictionary(_, value_type) => normalize_data_type(value_type), + other => other.clone(), + } +} + +/// Normalize a single field: keep name and nullability, normalize the data type recursively. +fn normalize_field(field: &Field) -> Field { + Field::new( + field.name(), + normalize_data_type(field.data_type()), + field.is_nullable(), + ) +} + +/// Normalize all fields in a schema to their canonical large equivalents. +fn normalize_schema(schema: &Schema) -> Schema { + Schema::new( + schema + .fields() + .iter() + .map(|f| Arc::new(normalize_field(f))) + .collect::>(), + ) } #[derive(Clone)] pub struct ArrowDigesterCore { - schema: Schema, schema_digest: Vec, + serialized_schema: String, fields_digest_buffer: BTreeMap>, } impl ArrowDigesterCore { - /// Create a new instance of `ArrowDigesterCore` with the schema which will be enforce through each update. - pub fn new(schema: Schema) -> Self { - // Hash the schema first + /// Create a new instance of `ArrowDigesterCore` with the schema, which will be enforced through each update. + #[expect( + clippy::shadow_reuse, + reason = "Intentional: shadow input with normalized version so all downstream code uses canonical types" + )] + pub fn new(schema: &Schema) -> Self { + // Normalize the schema so all internal state uses canonical large types + let schema = normalize_schema(schema); + + // Hash the normalized schema let schema_digest = Self::hash_schema(&schema); // Flatten all nested fields into a single map, this allows us to hash each field individually and efficiently @@ -46,96 +149,109 @@ impl ArrowDigesterCore { Self::extract_fields_name(field, "", &mut fields_digest_buffer); }); + let serialized_schema = Self::serialized_schema(&schema); + // Store it in the new struct for now Self { - schema, schema_digest, + serialized_schema, fields_digest_buffer, } } /// Hash a record batch and update the internal digests. pub fn update(&mut self, record_batch: &RecordBatch) { - // Verify schema matches assert!( - *record_batch.schema() == self.schema, + Self::serialized_schema(record_batch.schema().as_ref()) == self.serialized_schema, "Record batch schema does not match ArrowDigester schema" ); - // Iterate through each field and update its digest - self.fields_digest_buffer - .iter_mut() - .for_each(|(field_name, digest)| { - // Determine if field name is nested - let field_name_hierarchy = field_name - .split(DELIMITER_FOR_NESTED_FIELD) - .collect::>(); - - if field_name_hierarchy.len() == 1 { - Self::array_digest_update( - record_batch - .schema() - .field_with_name(field_name) - .expect("Failed to get field with name") - .data_type(), - record_batch - .column_by_name(field_name) - .expect("Failed to get column by name"), - digest, - ); - } else { - Self::update_nested_field( - &field_name_hierarchy, - 0, - record_batch - .column_by_name( - field_name_hierarchy - .first() - .expect("Failed to get field name at idx 0, list is empty!"), - ) - .expect("Failed to get column by name") - .as_any() - .downcast_ref::() - .expect("Failed to downcast to StructArray"), - digest, - ); - } - }); + let schema = record_batch.schema(); + for col_idx in 0..record_batch.num_columns() { + let field = schema.field(col_idx); + let array = record_batch.column(col_idx); + let path = field.name().to_owned(); + + Self::traverse_and_update( + field.data_type(), + field.is_nullable(), + array.as_ref(), + &path, + None, // no ancestor struct nulls at top level + &mut self.fields_digest_buffer, + ); + } } - /// Hash an array directly without needing to create an `ArrowDigester` instance on the user side - /// For hash array, we don't have a schema to hash, however we do have field data type. - /// So similar to schema, we will hash based on datatype to encode the metadata information into the digest..... + /// Hash an array directly without needing to create an `ArrowDigester` instance on the user side. + /// Unlike full table hashing, we don't have a schema to hash; however, we do have the field data type. + /// Similar to schema hashing, we hash based on the data type to encode metadata information into the digest. + /// + /// Uses the same recursive decomposition as the record-batch path so that data hashing + /// is consistent regardless of which API is used. /// /// # Panics /// /// This function will panic if JSON serialization of the data type fails. /// pub fn hash_array(array: &dyn Array) -> Vec { + // Resolve dictionary arrays to their plain value type + let (effective_type, resolved_array); + let effective_array: &dyn Array = + if let DataType::Dictionary(_, value_type) = array.data_type() { + resolved_array = cast(array, value_type.as_ref()) + .expect("Failed to cast dictionary to plain array"); + effective_type = value_type.as_ref().clone(); + resolved_array.as_ref() + } else { + effective_type = array.data_type().clone(); + array + }; + + // Normalize to canonical large types + let normalized_type = normalize_data_type(&effective_type); + let mut final_digest = D::new(); - let data_type_serialized = serde_json::to_string(&array.data_type()) + // Use canonical type serialization for metadata (data_type_to_value also normalizes, + // but we pass the already-normalized type for consistency) + let canonical_type = Self::data_type_to_value(&normalized_type); + let data_type_serialized = serde_json::to_string(&canonical_type) .expect("Failed to serialize data type to string"); - // Update the digest buffer with the array metadata and field data + // Update the digest with array metadata final_digest.update(data_type_serialized); - // Now we update it with the actual array data - let mut digest_buffer = if array.is_nullable() { - DigestBufferType::Nullable(BitVec::new(), D::new()) - } else { - DigestBufferType::NonNullable(D::new()) - }; - Self::array_digest_update(array.data_type(), array, &mut digest_buffer); - Self::finalize_digest(&mut final_digest, digest_buffer); + // Build BTreeMap entries from the type tree (same decomposition as record-batch path) + let mut fields = BTreeMap::new(); + Self::extract_type_entries( + &effective_type, + effective_array.is_nullable(), + "", + &mut fields, + ); + + // Traverse and populate entries + Self::traverse_and_update( + &effective_type, + effective_array.is_nullable(), + effective_array, + "", + None, + &mut fields, + ); + + // Finalize all entries into the digest (same order as record-batch finalize) + for (_, digest) in fields { + Self::finalize_digest(&mut final_digest, digest); + } - // Finalize and return the digest final_digest.finalize().to_vec() } - /// Hash record batch directly without needing to create an `ArrowDigester` instance on the user side. + /// Hash a record batch directly without needing to create an `ArrowDigester` instance on the user side. pub fn hash_record_batch(record_batch: &RecordBatch) -> Vec { - let mut digester = Self::new(record_batch.schema().as_ref().clone()); + let mut digester = Self::new(record_batch.schema().as_ref()); digester.update(record_batch); digester.finalize() } @@ -157,28 +273,27 @@ impl ArrowDigesterCore { final_digest.finalize().to_vec() } - #[expect( - clippy::big_endian_bytes, - reason = "Use for bit packing the null_bit_values" - )] /// Finalize a single field digest into the final digest. - /// Helpers to reduce code duplication. + /// Helper to reduce code duplication. fn finalize_digest(final_digest: &mut D, digest: DigestBufferType) { - match digest { - DigestBufferType::NonNullable(data_digest) => { - final_digest.update(data_digest.finalize()); - } - DigestBufferType::Nullable(null_bit_digest, data_digest) => { - final_digest.update(null_bit_digest.len().to_le_bytes()); - for &word in null_bit_digest.as_raw_slice() { - final_digest.update(word.to_be_bytes()); - } - final_digest.update(data_digest.finalize()); + // Null bits first (if nullable) + if let Some(null_bit_vec) = &digest.null_bits { + final_digest.update((null_bit_vec.len() as u64).to_le_bytes()); + for &word in null_bit_vec.as_raw_slice() { + final_digest.update(word.to_le_bytes()); } } + // Structural digest (if list type) — sizes separated from leaf data + if let Some(structural) = digest.structural { + final_digest.update(structural.finalize()); + } + // Data/leaf digest (if present) + if let Some(data) = digest.data { + final_digest.update(data.finalize()); + } } - /// Serialize the schema into a `BTreeMap` for field name and its digest. + /// Serialize the schema into a canonical JSON string keyed by field name. /// /// # Panics /// This function will panic if JSON serialization of the schema fails. @@ -200,34 +315,41 @@ impl ArrowDigesterCore { /// Convert a `DataType` to a JSON value, recursively converting any inner `Field` /// references to only include `name`, `data_type`, and `nullable`. + /// + /// Types are first normalized via `normalize_data_type` (Utf8→LargeUtf8, Binary→LargeBinary, + /// List→LargeList, Dictionary→value type) so the JSON always reflects canonical forms. fn data_type_to_value(data_type: &DataType) -> serde_json::Value { - match data_type { + // Normalize first so all downstream serialization uses canonical types + let canonical = normalize_data_type(data_type); + let value = match &canonical { DataType::Struct(fields) => { - let fields_json: Vec = fields + let mut sorted_fields: Vec<_> = fields.iter().collect(); + sorted_fields.sort_by_key(|f| f.name().clone()); + let fields_json: Vec = sorted_fields .iter() .map(|f| Self::inner_field_to_value(f)) .collect(); serde_json::json!({ "Struct": fields_json }) } - DataType::List(field) => { - serde_json::json!({ "List": Self::inner_field_to_value(field) }) - } + // After normalization, all list types are LargeList DataType::LargeList(field) => { - serde_json::json!({ "LargeList": Self::inner_field_to_value(field) }) + serde_json::json!({ "LargeList": Self::element_type_to_value(field) }) } DataType::FixedSizeList(field, size) => { - serde_json::json!({ "FixedSizeList": [Self::inner_field_to_value(field), size] }) + serde_json::json!({ "FixedSizeList": [Self::element_type_to_value(field), size] }) } DataType::Map(field, sorted) => { serde_json::json!({ "Map": [Self::inner_field_to_value(field), sorted] }) } - // For all non-nested types, Arrow's default serde is sufficient + // For all non-nested types (including LargeUtf8, LargeBinary after normalization), + // Arrow's default serde is sufficient other => serde_json::to_value(other).expect("Failed to serialize data type"), - } + }; + Self::sort_json_value(value) } - /// Convert an inner field (e.g., list item, struct child) to a JSON value - /// with only `name`, `data_type`, and `nullable`. + /// Convert an inner field (e.g., struct child) to a JSON value + /// with `name`, `data_type`, and `nullable`. fn inner_field_to_value(field: &Field) -> serde_json::Value { serde_json::json!({ "name": field.name(), @@ -236,6 +358,15 @@ impl ArrowDigesterCore { }) } + /// Convert a container element field (e.g., list item) to a JSON value + /// with only `data_type` and `nullable`, omitting the Arrow-internal field name. + fn element_type_to_value(field: &Field) -> serde_json::Value { + serde_json::json!({ + "data_type": Self::data_type_to_value(field.data_type()), + "nullable": field.is_nullable(), + }) + } + /// Recursively sort all JSON object keys for deterministic serialization. fn sort_json_value(value: serde_json::Value) -> serde_json::Value { match value { @@ -255,191 +386,407 @@ impl ArrowDigesterCore { } } - /// Serialize the schema into a `BTreeMap` for field name and its digest. + /// Hash the schema by serializing it to a canonical JSON string and computing its digest. pub fn hash_schema(schema: &Schema) -> Vec { // Hash the entire thing to the digest D::digest(Self::serialized_schema(schema)).to_vec() } - /// Recursive function to update nested field digests (structs within structs). - fn update_nested_field( - field_name_hierarchy: &[&str], - current_level: usize, - array: &StructArray, - digest: &mut DigestBufferType, + /// Top-down recursive traversal that routes data to `BTreeMap` entries. + fn traverse_and_update( + data_type: &DataType, + nullable: bool, + array: &dyn Array, + path: &str, + ancestor_struct_nulls: Option<&NullBuffer>, + fields: &mut BTreeMap>, ) { - let current_level_plus_one = current_level - .checked_add(1) - .expect("Field nesting level overflow"); - - if field_name_hierarchy - .len() - .checked_sub(1) - .expect("field_name_hierarchy underflow") - == current_level_plus_one - { - let array_data = array - .column_by_name( - field_name_hierarchy - .last() - .expect("Failed to get field name at idx 0, list is empty!"), - ) - .expect("Failed to get column by name"); - // Base case, it should be a non-struct field - Self::array_digest_update(array_data.data_type(), array_data.as_ref(), digest); + // Normalize small variants + let (normalized_type, cast_array); + let (effective_type, effective_array): (&DataType, &dyn Array) = match data_type { + DataType::Utf8 => { + normalized_type = DataType::LargeUtf8; + cast_array = cast(array, &normalized_type).expect("cast Utf8"); + (&normalized_type, cast_array.as_ref()) + } + DataType::Binary => { + normalized_type = DataType::LargeBinary; + cast_array = cast(array, &normalized_type).expect("cast Binary"); + (&normalized_type, cast_array.as_ref()) + } + DataType::List(field) => { + normalized_type = DataType::LargeList(Arc::clone(field)); + cast_array = cast(array, &normalized_type).expect("cast List"); + (&normalized_type, cast_array.as_ref()) + } + DataType::Dictionary(_, value_type) => { + cast_array = cast(array, value_type.as_ref()).expect("cast Dict"); + (value_type.as_ref(), cast_array.as_ref()) + } + _ => (data_type, array), + }; + + let canonical = normalize_data_type(effective_type); + + match &canonical { + DataType::LargeList(value_field) => { + Self::traverse_list( + effective_array, + value_field, + nullable, + path, + ancestor_struct_nulls, + fields, + ); + } + DataType::Struct(struct_fields) => { + Self::traverse_struct( + effective_array, + struct_fields, + nullable, + path, + ancestor_struct_nulls, + fields, + ); + } + _ => { + Self::traverse_leaf( + effective_type, + effective_array, + path, + ancestor_struct_nulls, + fields, + ); + } + } + } + + fn traverse_list( + array: &dyn Array, + value_field: &Field, + nullable: bool, + path: &str, + ancestor_struct_nulls: Option<&NullBuffer>, + fields: &mut BTreeMap>, + ) { + let list_array = array + .as_any() + .downcast_ref::() + .expect("downcast to LargeListArray"); + + // If the field is nullable, record column/field-level validity at `path` + if nullable { + if let Some(entry) = fields.get_mut(path) { + if let Some(ref mut null_bits) = entry.null_bits { + let effective_nulls = + Self::combine_nulls(list_array.nulls(), ancestor_struct_nulls); + match &effective_nulls { + Some(nb) => { + for i in 0..list_array.len() { + null_bits.push(nb.is_valid(i)); + } + } + None => null_bits.extend(repeat_n(true, list_array.len())), + } + } + } + } + + let list_path = format!("{path}{DELIMITER_FOR_NESTED_FIELD}"); + + // Determine effective null buffer (field null AND ancestor struct null) + let effective_nulls = Self::combine_nulls(list_array.nulls(), ancestor_struct_nulls); + + // For each row, write structural info and recurse into non-null elements + for i in 0..list_array.len() { + let is_valid = effective_nulls.as_ref().is_none_or(|nb| nb.is_valid(i)); + if is_valid { + let sub_array = list_array.value(i); + let sub_len = sub_array.len() as u64; + + // Write list length to structural digest at list_path + if let Some(entry) = fields.get_mut(&list_path) { + if let Some(ref mut structural) = entry.structural { + structural.update(sub_len.to_le_bytes()); + } + } + + // Recurse into the sub-array using the ORIGINAL value type + // (not canonical) so traverse_and_update can normalize internally. + let original_value_type = sub_array.data_type(); + Self::traverse_and_update( + original_value_type, + value_field.is_nullable(), + sub_array.as_ref(), + &list_path, + None, // list elements don't have ancestor struct nulls + fields, + ); + } + } + } + + fn traverse_struct( + array: &dyn Array, + _struct_fields: &arrow_schema::Fields, + nullable: bool, + path: &str, + ancestor_struct_nulls: Option<&NullBuffer>, + fields: &mut BTreeMap>, + ) { + let struct_array = array + .as_any() + .downcast_ref::() + .expect("downcast to StructArray"); + + // Combine struct's own nulls with ancestor nulls (AND propagation) + let combined_nulls = if nullable { + Self::combine_nulls(struct_array.nulls(), ancestor_struct_nulls) } else { - // Recursive case, it should be a struct field - let next_array = array - .column_by_name( - field_name_hierarchy - .get(current_level_plus_one) - .expect("Failed to get field name at current level"), - ) - .expect("Failed to get column by name") - .as_any() - .downcast_ref::() - .expect("Failed to downcast to StructArray"); - - Self::update_nested_field( - field_name_hierarchy, - current_level_plus_one, - next_array, - digest, + ancestor_struct_nulls.cloned() + }; + + // Use the ORIGINAL struct array's fields (not the canonical ones from + // the type tree) so that data_type matches the actual child array. + // traverse_and_update will normalize types internally. + let original_fields = struct_array.fields(); + let mut sorted_children: Vec<(usize, &Field)> = original_fields + .iter() + .enumerate() + .map(|(i, f)| (i, f.as_ref())) + .collect(); + sorted_children.sort_by_key(|(_, f)| f.name().clone()); + + for (idx, child_field) in sorted_children { + let child_array = struct_array.column(idx); + let child_path = Self::construct_field_name_hierarchy(path, child_field.name()); + + Self::traverse_and_update( + child_field.data_type(), + child_field.is_nullable(), + child_array.as_ref(), + &child_path, + combined_nulls.as_ref(), + fields, ); } } + fn traverse_leaf( + data_type: &DataType, + array: &dyn Array, + path: &str, + ancestor_struct_nulls: Option<&NullBuffer>, + fields: &mut BTreeMap>, + ) { + let entry = fields + .get_mut(path) + .expect("entry must exist for leaf path"); + + // Compute effective validity (own nulls AND ancestor struct nulls) + let effective_nulls = Self::combine_nulls(array.nulls(), ancestor_struct_nulls); + + // Handle null_bits + if let Some(ref mut null_bits) = entry.null_bits { + match &effective_nulls { + Some(nb) => { + for i in 0..array.len() { + null_bits.push(nb.is_valid(i)); + } + } + None => null_bits.extend(repeat_n(true, array.len())), + } + } + + // Hash leaf data with combined null buffer + if let Some(effective) = &effective_nulls { + let child_data = array.to_data(); + let null_count = effective.null_count(); + let new_data = child_data + .into_builder() + .null_count(null_count) + .null_bit_buffer(Some(effective.clone().into_inner().into_inner())) + .build() + .expect("rebuild array with combined null buffer"); + let combined_array = make_array(new_data); + Self::hash_leaf_data(data_type, combined_array.as_ref(), entry); + } else { + Self::hash_leaf_data(data_type, array, entry); + } + } + + /// Hash leaf data into the entry's data digest, without modifying `null_bits` + /// (which are already handled by `traverse_leaf`). + fn hash_leaf_data(data_type: &DataType, array: &dyn Array, entry: &mut DigestBufferType) { + // Save and restore null_bits so array_digest_update's handle_null_bits + // pushes don't pollute the real null_bits (which traverse_leaf manages). + // We keep null_bits in place during the call so hash functions use + // the null-aware code path (checking array.nulls() to skip null values). + let saved = entry.null_bits.take(); + // Put a temporary empty bitvec so hash functions use the null-aware path + // when the array actually has nulls + if array.nulls().is_some() { + entry.null_bits = Some(BitVec::::new()); + } + Self::array_digest_update(data_type, array, entry); + // Restore the real null_bits + entry.null_bits = saved; + } + + fn combine_nulls( + own_nulls: Option<&NullBuffer>, + ancestor_nulls: Option<&NullBuffer>, + ) -> Option { + match (own_nulls, ancestor_nulls) { + (Some(own), Some(ancestor)) => Some(NullBuffer::new(own.inner() & ancestor.inner())), + (Some(own), None) => Some(own.clone()), + (None, Some(ancestor)) => Some(ancestor.clone()), + (None, None) => None, + } + } + #[expect( clippy::too_many_lines, reason = "Comprehensive match on all data types" )] + #[expect( + clippy::unreachable, + reason = "Small types are normalized to large equivalents; List/Struct are handled by traverse_and_update" + )] fn array_digest_update( data_type: &DataType, array: &dyn Array, digest: &mut DigestBufferType, ) { - match data_type { + // Normalize small variants to their large equivalents so every code path + // goes through a single canonical representation. The cast only widens + // offsets (i32 → i64). These variables extend the lifetime of cast + // results. They are only initialized (and read) in branches that perform + // a cast; the default branch never touches them. + let (normalized_type, cast_array); + let (effective_type, effective_array): (&DataType, &dyn Array) = match data_type { + DataType::Utf8 => { + normalized_type = DataType::LargeUtf8; + cast_array = + cast(array, &normalized_type).expect("Failed to cast Utf8 to LargeUtf8"); + (&normalized_type, cast_array.as_ref()) + } + DataType::Binary => { + normalized_type = DataType::LargeBinary; + cast_array = + cast(array, &normalized_type).expect("Failed to cast Binary to LargeBinary"); + (&normalized_type, cast_array.as_ref()) + } + DataType::List(field) => { + normalized_type = DataType::LargeList(Arc::clone(field)); + cast_array = + cast(array, &normalized_type).expect("Failed to cast List to LargeList"); + (&normalized_type, cast_array.as_ref()) + } + _ => (data_type, array), + }; + + match effective_type { DataType::Null => todo!(), DataType::Boolean => { // Bool Array is stored a bit differently, so we can't use the standard fixed buffer approach - let bool_array = array + let bool_array = effective_array .as_any() .downcast_ref::() .expect("Failed to downcast to BooleanArray"); - match digest { - DigestBufferType::NonNullable(data_digest) => { - // We want to bit pack the boolean values into bytes for hashing - let mut bit_vec = BitVec::::with_capacity(bool_array.len()); - for i in 0..bool_array.len() { + if let Some(ref mut null_bits) = digest.null_bits { + // Handle null bits first + Self::handle_null_bits(bool_array, null_bits); + + // Handle the data — only valid bits + let mut bit_vec = BitVec::::with_capacity(bool_array.len()); + for i in 0..bool_array.len() { + if bool_array.is_valid(i) { bit_vec.push(bool_array.value(i)); } - - data_digest.update(bit_vec.as_raw_slice()); } - DigestBufferType::Nullable(null_bit_vec, data_digest) => { - // Handle null bits first - Self::handle_null_bits(bool_array, null_bit_vec); - - // Handle the data - let mut bit_vec = BitVec::::with_capacity(bool_array.len()); - for i in 0..bool_array.len() { - // We only want the valid bits, for null we will discard from the hash since that is already capture by null_bits - if bool_array.is_valid(i) { - bit_vec.push(bool_array.value(i)); - } - } - data_digest.update(bit_vec.as_raw_slice()); + digest.data_mut().update(bit_vec.as_raw_slice()); + } else { + // Non-nullable: pack all boolean values + let mut bit_vec = BitVec::::with_capacity(bool_array.len()); + for i in 0..bool_array.len() { + bit_vec.push(bool_array.value(i)); } + digest.data_mut().update(bit_vec.as_raw_slice()); } } - DataType::Int8 | DataType::UInt8 => Self::hash_fixed_size_array(array, digest, 1), + DataType::Int8 | DataType::UInt8 => { + Self::hash_fixed_size_array(effective_array, digest, 1); + } DataType::Int16 | DataType::UInt16 | DataType::Float16 => { - Self::hash_fixed_size_array(array, digest, 2); + Self::hash_fixed_size_array(effective_array, digest, 2); } DataType::Int32 | DataType::UInt32 | DataType::Float32 | DataType::Date32 | DataType::Decimal32(_, _) => { - Self::hash_fixed_size_array(array, digest, 4); + Self::hash_fixed_size_array(effective_array, digest, 4); } DataType::Int64 | DataType::UInt64 | DataType::Float64 | DataType::Date64 | DataType::Decimal64(_, _) => { - Self::hash_fixed_size_array(array, digest, 8); + Self::hash_fixed_size_array(effective_array, digest, 8); } DataType::Timestamp(_, _) => todo!(), - DataType::Time32(_) => Self::hash_fixed_size_array(array, digest, 4), - DataType::Time64(_) => Self::hash_fixed_size_array(array, digest, 8), + DataType::Time32(_) => Self::hash_fixed_size_array(effective_array, digest, 4), + DataType::Time64(_) => Self::hash_fixed_size_array(effective_array, digest, 8), DataType::Duration(_) => todo!(), DataType::Interval(_) => todo!(), - DataType::Binary => Self::hash_binary_array( - array - .as_any() - .downcast_ref::() - .expect("Failed to downcast to BinaryArray"), - digest, - ), + // Small variants are normalized above — these arms are unreachable + DataType::Binary | DataType::Utf8 | DataType::List(_) => { + unreachable!("Normalized to Large variant at the top of array_digest_update") + } DataType::FixedSizeBinary(element_size) => { - Self::hash_fixed_size_array(array, digest, *element_size); + Self::hash_fixed_size_array(effective_array, digest, *element_size); } DataType::LargeBinary => Self::hash_binary_array( - array + effective_array .as_any() .downcast_ref::() .expect("Failed to downcast to LargeBinaryArray"), digest, ), DataType::BinaryView => todo!(), - DataType::Utf8 => Self::hash_string_array( - array - .as_any() - .downcast_ref::() - .expect("Failed to downcast to StringArray"), - digest, - ), DataType::LargeUtf8 => Self::hash_string_array( - array + effective_array .as_any() .downcast_ref::() .expect("Failed to downcast to LargeStringArray"), digest, ), DataType::Utf8View => todo!(), - DataType::List(field) => { - Self::hash_list_array( - array - .as_any() - .downcast_ref::() - .expect("Failed to downcast to ListArray"), - field.data_type(), - digest, - ); - } DataType::ListView(_) => todo!(), DataType::FixedSizeList(_, _) => todo!(), - DataType::LargeList(field) => { - Self::hash_list_array( - array - .as_any() - .downcast_ref::() - .expect("Failed to downcast to LargeListArray"), - field.data_type(), - digest, - ); + // List and Struct types are handled by the recursive decomposition path + // (traverse_and_update → traverse_list / traverse_struct). They should + // never reach array_digest_update directly. + DataType::LargeList(_) | DataType::Struct(_) => { + unreachable!( + "List and Struct types are decomposed by traverse_and_update; \ + they should not reach array_digest_update" + ) } DataType::LargeListView(_) => todo!(), - DataType::Struct(_) => todo!(), DataType::Union(_, _) => todo!(), - DataType::Dictionary(_, _) => todo!(), + DataType::Dictionary(_, value_type) => { + let resolved = cast(effective_array, value_type.as_ref()) + .expect("Failed to cast dictionary to plain array"); + Self::array_digest_update(value_type.as_ref(), resolved.as_ref(), digest); + } DataType::Decimal128(_, _) => { - Self::hash_fixed_size_array(array, digest, 16); + Self::hash_fixed_size_array(effective_array, digest, 16); } DataType::Decimal256(_, _) => { - Self::hash_fixed_size_array(array, digest, 32); + Self::hash_fixed_size_array(effective_array, digest, 32); } DataType::Map(_, _) => todo!(), DataType::RunEndEncoded(_, _) => todo!(), @@ -455,55 +802,59 @@ impl ArrowDigesterCore { let array_data = array.to_data(); let element_size_usize = element_size as usize; - // Get the slice with offset accounted for if there is any + // Get the slice with offset and length accounted for + let start = array_data + .offset() + .checked_mul(element_size_usize) + .expect("Offset multiplication overflow"); + let end = start + .checked_add( + array_data + .len() + .checked_mul(element_size_usize) + .expect("Length multiplication overflow"), + ) + .expect("End position overflow"); let slice = array_data .buffers() .first() .expect("Unable to get first buffer to determine offset") .as_slice() - .get( - array_data - .offset() - .checked_mul(element_size_usize) - .expect("Offset multiplication overflow").., - ) + .get(start..end) .expect("Failed to get buffer slice for FixedSizeBinaryArray"); - match digest_buffer { - DigestBufferType::NonNullable(data_digest) => { - // No nulls, we can hash the entire buffer directly - data_digest.update(slice); - } - DigestBufferType::Nullable(null_bits, data_digest) => { - // Handle null bits first - Self::handle_null_bits(array, null_bits); - - match array_data.nulls() { - Some(null_buffer) => { - // There are nulls, so we need to incrementally hash each value - for i in 0..array_data.len() { - if null_buffer.is_valid(i) { - let data_pos = i - .checked_mul(element_size_usize) - .expect("Data position multiplication overflow"); - let end_pos = data_pos - .checked_add(element_size_usize) - .expect("End position addition overflow"); - - data_digest.update( - slice - .get(data_pos..end_pos) - .expect("Failed to get data_slice"), - ); - } + if let Some(ref mut null_bits) = digest_buffer.null_bits { + // Handle null bits first + Self::handle_null_bits(array, null_bits); + + match array_data.nulls() { + Some(null_buffer) => { + // There are nulls, so we need to incrementally hash each value + for i in 0..array_data.len() { + if null_buffer.is_valid(i) { + let data_pos = i + .checked_mul(element_size_usize) + .expect("Data position multiplication overflow"); + let end_pos = data_pos + .checked_add(element_size_usize) + .expect("End position addition overflow"); + + digest_buffer.data_mut().update( + slice + .get(data_pos..end_pos) + .expect("Failed to get data_slice"), + ); } } - None => { - // No nulls, we can hash the entire buffer directly - data_digest.update(slice); - } + } + None => { + // No nulls, we can hash the entire buffer directly + digest_buffer.data_mut().update(slice); } } + } else { + // No nulls, we can hash the entire buffer directly + digest_buffer.data_mut().update(slice); } } @@ -511,42 +862,16 @@ impl ArrowDigesterCore { array: &GenericBinaryArray, digest: &mut DigestBufferType, ) { - match digest { - DigestBufferType::NonNullable(data_digest) => { - for i in 0..array.len() { - let value = array.value(i); - data_digest.update(value.len().to_le_bytes()); - data_digest.update(value); - } - } - DigestBufferType::Nullable(null_bit_vec, data_digest) => { - // Deal with the null bits first - if let Some(null_buf) = array.nulls() { - // We would need to iterate through the null buffer and push it into the null_bit_vec - for i in 0..array.len() { - null_bit_vec.push(null_buf.is_valid(i)); - } - - for i in 0..array.len() { - if null_buf.is_valid(i) { - let value = array.value(i); - data_digest.update(value.len().to_le_bytes()); - data_digest.update(value); - } else { - data_digest.update(NULL_BYTES); - } - } - } else { - // All valid, therefore we can extend the bit vector with all true values - null_bit_vec.extend(repeat_n(true, array.len())); + if let Some(ref mut null_bits) = digest.null_bits { + Self::handle_null_bits(array, null_bits); + } - // Deal with the data - for i in 0..array.len() { - let value = array.value(i); - data_digest.update(value.len().to_le_bytes()); - data_digest.update(value); - } - } + let null_buf = array.nulls(); + for i in 0..array.len() { + if null_buf.is_none_or(|nb| nb.is_valid(i)) { + let value = array.value(i); + digest.data_mut().update((value.len() as u64).to_le_bytes()); + digest.data_mut().update(value); } } } @@ -555,112 +880,124 @@ impl ArrowDigesterCore { array: &GenericStringArray, digest: &mut DigestBufferType, ) { - match digest { - DigestBufferType::NonNullable(data_digest) => { - for i in 0..array.len() { - let value = array.value(i); - data_digest.update((value.len() as u64).to_le_bytes()); - data_digest.update(value.as_bytes()); - } - } - DigestBufferType::Nullable(null_bit_vec, data_digest) => { - // Deal with the null bits first - Self::handle_null_bits(array, null_bit_vec); - - match array.nulls() { - Some(null_buf) => { - for i in 0..array.len() { - if null_buf.is_valid(i) { - let value = array.value(i); - data_digest.update((value.len() as u64).to_le_bytes()); - data_digest.update(value.as_bytes()); - } else { - data_digest.update(NULL_BYTES); - } - } - } - None => { - for i in 0..array.len() { - let value = array.value(i); - data_digest.update((value.len() as u64).to_le_bytes()); - data_digest.update(value.as_bytes()); - } - } - } - } + if let Some(ref mut null_bits) = digest.null_bits { + Self::handle_null_bits(array, null_bits); } - } - fn hash_list_array( - array: &GenericListArray, - field_data_type: &DataType, - digest: &mut DigestBufferType, - ) { - match digest { - // Wildcard `_` avoids binding so `digest` remains usable below - DigestBufferType::NonNullable(_) => { - for i in 0..array.len() { - let sub = array.value(i); - // Prefix sub-array element count to prevent cross-boundary collisions. - // Without this [[1,2],[3]] and [[1],[2,3]] produce identical byte streams. - // sub.len() returns usize, avoiding the non-primitive OffsetSizeTrait cast. - Self::update_data_digest(digest, (sub.len() as u64).to_le_bytes()); - Self::array_digest_update(field_data_type, sub.as_ref(), digest); - } - } - DigestBufferType::Nullable(bit_vec, _) => { - // Deal with null bits first; NLL ends bit_vec borrow after this call - Self::handle_null_bits(array, bit_vec); - - match array.nulls() { - Some(null_buf) => { - for i in 0..array.len() { - if null_buf.is_valid(i) { - let sub = array.value(i); - Self::update_data_digest(digest, (sub.len() as u64).to_le_bytes()); - Self::array_digest_update(field_data_type, sub.as_ref(), digest); - } - } - } - None => { - for i in 0..array.len() { - let sub = array.value(i); - Self::update_data_digest(digest, (sub.len() as u64).to_le_bytes()); - Self::array_digest_update(field_data_type, sub.as_ref(), digest); - } - } - } + let null_buf = array.nulls(); + for i in 0..array.len() { + if null_buf.is_none_or(|nb| nb.is_valid(i)) { + let value = array.value(i); + digest.data_mut().update((value.len() as u64).to_le_bytes()); + digest.data_mut().update(value.as_bytes()); } } } - /// Internal recursive function to extract field names from nested structs effectively flattening the schema. - /// The format is `parent__child__grandchild__etc`... for nested fields and will be stored in `fields_digest_buffer`. + /// Recursively extract field entries from the type tree. + /// + /// - **List**: creates a structural-only entry at `path/`, then recurses into + /// the value type. If the column field is nullable, also creates a + /// validity-only entry at the field path (before the `/`). + /// - **Struct**: transparent — recurses into each child field with `path/childname`. + /// No entry for the struct itself. Struct null propagation is handled at + /// traversal time. + /// - **Leaf (non-list, non-struct)**: creates a data entry at the current path. fn extract_fields_name( field: &Field, parent_field_name: &str, fields_digest_buffer: &mut BTreeMap>, ) { - // Check if field is a nested type of struct - if let DataType::Struct(fields) = field.data_type() { - // We will add fields in alphabetical order - fields.into_iter().for_each(|field_inner| { - Self::extract_fields_name( - field_inner, - Self::construct_field_name_hierarchy(parent_field_name, field.name()).as_str(), - fields_digest_buffer, - ); - }); - } else { - // Base case, just add the the combine field name to the map - fields_digest_buffer.insert( - Self::construct_field_name_hierarchy(parent_field_name, field.name()), - if field.is_nullable() { - DigestBufferType::Nullable(BitVec::new(), D::new()) - } else { - DigestBufferType::NonNullable(D::new()) - }, - ); + let path = Self::construct_field_name_hierarchy(parent_field_name, field.name()); + Self::extract_type_entries( + field.data_type(), + field.is_nullable(), + &path, + fields_digest_buffer, + ); + } + + /// Core recursive type walker — creates `BTreeMap` entries based on the type tree. + /// + /// `nullable` reflects whether the current position is nullable (from the `Field`). + fn extract_type_entries( + data_type: &DataType, + nullable: bool, + path: &str, + fields_digest_buffer: &mut BTreeMap>, + ) { + let canonical = normalize_data_type(data_type); + + match &canonical { + DataType::Struct(fields) => { + // Struct is transparent — no entry, just recurse into children. + for child_field in fields { + let child_path = Self::construct_field_name_hierarchy(path, child_field.name()); + Self::extract_type_entries( + child_field.data_type(), + child_field.is_nullable(), + &child_path, + fields_digest_buffer, + ); + } + } + DataType::LargeList(value_field) | DataType::List(value_field) => { + // For a nullable field that is a list, create a validity-only entry + // at the field path (column-level or field-level null tracking). + if nullable { + fields_digest_buffer + .insert(path.to_owned(), DigestBufferType::new_validity_only()); + } + + // List level: create entry at path + "/" + let list_path = format!("{path}{DELIMITER_FOR_NESTED_FIELD}"); + let inner_type = value_field.data_type(); + let inner_canonical = normalize_data_type(inner_type); + + match &inner_canonical { + DataType::Struct(_) => { + // List>: list entry is structural-only, + // struct children become separate entries + fields_digest_buffer.insert( + list_path.clone(), + DigestBufferType::new_structural_only(value_field.is_nullable()), + ); + // Recurse into the struct's children + Self::extract_type_entries( + inner_type, + value_field.is_nullable(), + &list_path, + fields_digest_buffer, + ); + } + DataType::LargeList(_) | DataType::List(_) => { + // List>: list entry is structural-only, + // recurse into the inner list + fields_digest_buffer.insert( + list_path.clone(), + DigestBufferType::new_structural_only(value_field.is_nullable()), + ); + Self::extract_type_entries( + inner_type, + value_field.is_nullable(), + &list_path, + fields_digest_buffer, + ); + } + _ => { + // List: list entry is both structural + data (leaf) + fields_digest_buffer.insert( + list_path, + DigestBufferType::new_list_leaf(value_field.is_nullable()), + ); + } + } + } + _ => { + // Leaf type (non-struct, non-list): create data entry + fields_digest_buffer + .insert(path.to_owned(), DigestBufferType::new_data_only(nullable)); + } } } @@ -672,15 +1009,7 @@ impl ArrowDigesterCore { } } - /// Write bytes directly into the data digest portion of the buffer, bypassing null-bit tracking. - /// Used to write length prefixes that sit in the data stream but are not nullable values. - fn update_data_digest(digest: &mut DigestBufferType, data: impl AsRef<[u8]>) { - match digest { - DigestBufferType::NonNullable(d) | DigestBufferType::Nullable(_, d) => d.update(data), - } - } - - fn handle_null_bits(array: &dyn Array, null_bit_vec: &mut BitVec) { + fn handle_null_bits(array: &dyn Array, null_bit_vec: &mut BitVec) { match array.nulls() { Some(null_buf) => { // We would need to iterate through the null buffer and push it into the null_bit_vec @@ -714,10 +1043,10 @@ mod tests { array::{ ArrayRef, BinaryArray, BooleanArray, Date32Array, Date64Array, Decimal128Array, Decimal32Array, FixedSizeBinaryBuilder, Float16Array, Float32Array, Float64Array, - Int16Array, Int32Array, Int64Array, Int8Array, LargeBinaryArray, LargeListBuilder, - LargeStringArray, ListBuilder, PrimitiveBuilder, RecordBatch, StringArray, StructArray, - Time32SecondArray, Time64MicrosecondArray, UInt16Array, UInt32Array, UInt64Array, - UInt8Array, + Int16Array, Int32Array, Int64Array, Int8Array, LargeBinaryArray, LargeListArray, + LargeListBuilder, LargeStringArray, ListBuilder, PrimitiveBuilder, RecordBatch, + StringArray, StructArray, Time32SecondArray, Time64MicrosecondArray, UInt16Array, + UInt32Array, UInt64Array, UInt8Array, }, datatypes::Int32Type, }; @@ -727,8 +1056,9 @@ mod tests { use pretty_assertions::assert_eq; use sha2::{Digest as _, Sha256}; - use crate::arrow_digester_core::{ArrowDigesterCore, DigestBufferType}; + use crate::arrow_digester_core::ArrowDigesterCore; use arrow::array::{Decimal256Array, Decimal64Array}; + use arrow::buffer::OffsetBuffer; use arrow_buffer::i256; #[expect( @@ -870,7 +1200,7 @@ mod tests { ), ]); - let mut digester = ArrowDigesterCore::::new(schema.clone()); + let mut digester = ArrowDigesterCore::::new(&schema); let field_names: Vec<&String> = digester.fields_digest_buffer.keys().collect(); assert_eq!(field_names.len(), 3); @@ -920,7 +1250,7 @@ mod tests { // Check the digest assert_eq!( encode(digester.finalize()), - "9841aab2dfeb637872d41422d33fca1e939f06b8fa0dcec66ff3782592cf9565" + "9b52ad7430dea81b35f14a04d828b2424080fbc210570081c6e6cb62b6566c42" ); } @@ -928,10 +1258,10 @@ mod tests { #[test] fn digest_bool_nullable_bytes() { - // [true, None, false, true] — valid values bit-packed Msb0, null skipped + // [true, None, false, true] — valid values bit-packed Lsb0, null skipped let array = BooleanArray::from(vec![Some(true), None, Some(false), Some(true)]); let schema = Schema::new(vec![Field::new("col", DataType::Boolean, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -944,11 +1274,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 4); assert!(null_bit_vec[0], "index 0 (true) should be valid"); @@ -956,10 +1284,10 @@ mod tests { assert!(null_bit_vec[2], "index 2 (false) should be valid"); assert!(null_bit_vec[3], "index 3 (true) should be valid"); - // Valid values [true, false, true] packed Msb0 into one byte: - // bit0=1, bit1=0, bit2=1 → 1010_0000 = 0xA0 + // Valid values [true, false, true] packed Lsb0 into one byte: + // bit0=1, bit1=0, bit2=1 → 0000_0101 = 0x05 let mut manual = Sha256::new(); - manual.update([0xA0_u8]); + manual.update([0x05_u8]); assert_eq!(data_digest.clone().finalize(), manual.finalize()); } @@ -968,7 +1296,7 @@ mod tests { // [false, true, false] — all values bit-packed, no nulls let array = BooleanArray::from(vec![false, true, false]); let schema = Schema::new(vec![Field::new("col", DataType::Boolean, false)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -981,14 +1309,13 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); - // [false, true, false] packed Msb0: bit0=0, bit1=1, bit2=0 → 0100_0000 = 0x40 + // [false, true, false] packed Lsb0: bit0=0, bit1=1, bit2=0 → 0000_0010 = 0x02 let mut manual = Sha256::new(); - manual.update([0x40_u8]); + manual.update([0x02_u8]); assert_eq!(data_digest.clone().finalize(), manual.finalize()); } @@ -999,7 +1326,7 @@ mod tests { // [10, None, -3] — valid bytes: 0x0A, 0xFD let array = Int8Array::from(vec![Some(10_i8), None, Some(-3_i8)]); let schema = Schema::new(vec![Field::new("col", DataType::Int8, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::Int8, true)])), @@ -1008,11 +1335,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1030,7 +1355,7 @@ mod tests { // [1, 2, 255] let array = UInt8Array::from(vec![1_u8, 2_u8, 255_u8]); let schema = Schema::new(vec![Field::new("col", DataType::UInt8, false)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::UInt8, false)])), @@ -1039,10 +1364,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); let mut manual = Sha256::new(); manual.update([0x01_u8, 0x02_u8, 0xFF_u8]); @@ -1058,7 +1382,7 @@ mod tests { // -512 LE = 00 fe let array = Int16Array::from(vec![Some(1000_i16), None, Some(-512_i16)]); let schema = Schema::new(vec![Field::new("col", DataType::Int16, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::Int16, true)])), @@ -1067,11 +1391,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1089,7 +1411,7 @@ mod tests { // [100, 200, 65535] let array = UInt16Array::from(vec![100_u16, 200_u16, 0xFFFF_u16]); let schema = Schema::new(vec![Field::new("col", DataType::UInt16, false)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1102,10 +1424,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); let mut manual = Sha256::new(); manual.update(100_u16.to_le_bytes()); @@ -1125,7 +1446,7 @@ mod tests { half::f16::from_f32(-0.5), ]); let schema = Schema::new(vec![Field::new("col", DataType::Float16, false)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1138,10 +1459,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); let mut manual = Sha256::new(); manual.update(half::f16::from_f32(1.0).to_le_bytes()); @@ -1162,7 +1482,7 @@ mod tests { let schema = Schema::new(vec![Field::new("int32_col", DataType::Int32, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( @@ -1176,13 +1496,12 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = digester + let buf = digester .fields_digest_buffer .get("int32_col") - .expect("int32_col field should exist in digest buffer") - else { - panic!("Expected a Nullable digest buffer for int32_col"); - }; + .expect("int32_col field should exist in digest buffer"); + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); // The null bit vector should be [true, false, true, true] for [Some(42), None, Some(-7), Some(0)] assert_eq!(null_bit_vec.len(), 4); @@ -1208,7 +1527,7 @@ mod tests { // [0, None, u32::MAX] let array = UInt32Array::from(vec![Some(0_u32), None, Some(u32::MAX)]); let schema = Schema::new(vec![Field::new("col", DataType::UInt32, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::UInt32, true)])), @@ -1217,11 +1536,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1243,7 +1560,7 @@ mod tests { // 2.5f32 LE: 00 00 20 40 let array = Float32Array::from(vec![Some(1.0_f32), None, Some(2.5_f32)]); let schema = Schema::new(vec![Field::new("col", DataType::Float32, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1256,11 +1573,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1283,7 +1598,7 @@ mod tests { .with_precision_and_scale(9, 2) .unwrap(); let schema = Schema::new(vec![Field::new("col", DataType::Decimal32(9, 2), true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1296,11 +1611,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1320,7 +1633,7 @@ mod tests { .with_precision_and_scale(9, 2) .unwrap(); let schema = Schema::new(vec![Field::new("col", DataType::Decimal32(9, 2), false)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1333,10 +1646,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); let mut manual = Sha256::new(); manual.update(0_i32.to_le_bytes()); @@ -1352,7 +1664,7 @@ mod tests { // [i64::MIN, None, 9_876_543_210] let array = Int64Array::from(vec![Some(i64::MIN), None, Some(9_876_543_210_i64)]); let schema = Schema::new(vec![Field::new("col", DataType::Int64, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::Int64, true)])), @@ -1361,11 +1673,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1383,7 +1693,7 @@ mod tests { // [0, None, u64::MAX] let array = UInt64Array::from(vec![Some(0_u64), None, Some(u64::MAX)]); let schema = Schema::new(vec![Field::new("col", DataType::UInt64, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::UInt64, true)])), @@ -1392,11 +1702,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1416,7 +1724,7 @@ mod tests { // [1.0, -0.5, π] let array = Float64Array::from(vec![1.0_f64, -0.5_f64, f64::consts::PI]); let schema = Schema::new(vec![Field::new("col", DataType::Float64, false)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1429,10 +1737,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); let mut manual = Sha256::new(); manual.update(1.0_f64.to_le_bytes()); @@ -1451,7 +1758,7 @@ mod tests { .with_precision_and_scale(18, 3) .unwrap(); let schema = Schema::new(vec![Field::new("col", DataType::Decimal64(18, 3), true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1464,11 +1771,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1488,7 +1793,7 @@ mod tests { .with_precision_and_scale(18, 3) .unwrap(); let schema = Schema::new(vec![Field::new("col", DataType::Decimal64(18, 3), false)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1501,10 +1806,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); let mut manual = Sha256::new(); manual.update(0_i64.to_le_bytes()); @@ -1520,7 +1824,7 @@ mod tests { // Days since Unix epoch: [0, None, 19000] let array = Date32Array::from(vec![Some(0_i32), None, Some(19000_i32)]); let schema = Schema::new(vec![Field::new("col", DataType::Date32, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::Date32, true)])), @@ -1529,11 +1833,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1551,7 +1853,7 @@ mod tests { // Milliseconds since Unix epoch: [0, None, 1_000_000] let array = Date64Array::from(vec![Some(0_i64), None, Some(1_000_000_i64)]); let schema = Schema::new(vec![Field::new("col", DataType::Date64, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::Date64, true)])), @@ -1560,11 +1862,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1588,7 +1888,7 @@ mod tests { DataType::Time32(TimeUnit::Second), true, )]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1601,11 +1901,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1627,7 +1925,7 @@ mod tests { DataType::Time64(TimeUnit::Microsecond), true, )]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1640,11 +1938,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1667,7 +1963,7 @@ mod tests { .with_precision_and_scale(38, 5) .unwrap(); let schema = Schema::new(vec![Field::new("col", DataType::Decimal128(38, 5), true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1680,11 +1976,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1711,7 +2005,7 @@ mod tests { .with_precision_and_scale(76, 10) .unwrap(); let schema = Schema::new(vec![Field::new("col", DataType::Decimal256(76, 10), true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1724,11 +2018,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1753,7 +2045,7 @@ mod tests { let array = builder.finish(); let schema = Schema::new(vec![Field::new("col", DataType::FixedSizeBinary(4), true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1766,11 +2058,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1789,11 +2079,11 @@ mod tests { #[test] fn digest_binary_nullable_bytes() { // [b"hello", None, b"world"] - // Valid entries: (length as usize LE) ++ bytes. - // Null entries contribute the sentinel b"NULL" to the data digest. + // Valid entries: (length as u64 LE) ++ bytes. + // Null entries are skipped entirely in the data digest. let array = BinaryArray::from(vec![Some(b"hello".as_ref()), None, Some(b"world".as_ref())]); let schema = Schema::new(vec![Field::new("col", DataType::Binary, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::Binary, true)])), @@ -1802,11 +2092,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1814,10 +2102,10 @@ mod tests { assert!(null_bit_vec[2]); let mut manual = Sha256::new(); - manual.update(5_usize.to_le_bytes()); // len("hello") + manual.update(5_u64.to_le_bytes()); // len("hello") manual.update(b"hello"); - manual.update(b"NULL"); // null sentinel - manual.update(5_usize.to_le_bytes()); // len("world") + // null entry skipped — no sentinel bytes + manual.update(5_u64.to_le_bytes()); // len("world") manual.update(b"world"); assert_eq!(data_digest.clone().finalize(), manual.finalize()); } @@ -1827,7 +2115,7 @@ mod tests { // [b"ab", b"cde"] — all valid, length prefix is usize LE let array = LargeBinaryArray::from(vec![b"ab".as_ref(), b"cde".as_ref()]); let schema = Schema::new(vec![Field::new("col", DataType::LargeBinary, false)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1840,15 +2128,14 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); let mut manual = Sha256::new(); - manual.update(2_usize.to_le_bytes()); + manual.update(2_u64.to_le_bytes()); manual.update(b"ab"); - manual.update(3_usize.to_le_bytes()); + manual.update(3_u64.to_le_bytes()); manual.update(b"cde"); assert_eq!(data_digest.clone().finalize(), manual.finalize()); } @@ -1859,10 +2146,10 @@ mod tests { fn digest_utf8_nullable_bytes() { // ["foo", None, "ba"] // Valid entries: (length as u64 LE) ++ UTF-8 bytes. - // Null entries contribute the sentinel b"NULL" to the data digest. + // Null entries are skipped entirely in the data digest. let array = StringArray::from(vec![Some("foo"), None, Some("ba")]); let schema = Schema::new(vec![Field::new("col", DataType::Utf8, true)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, true)])), @@ -1871,11 +2158,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1885,7 +2170,7 @@ mod tests { let mut manual = Sha256::new(); manual.update(3_u64.to_le_bytes()); // len("foo") manual.update(b"foo"); - manual.update(b"NULL"); // null sentinel + // null entry skipped — no sentinel bytes manual.update(2_u64.to_le_bytes()); // len("ba") manual.update(b"ba"); assert_eq!(data_digest.clone().finalize(), manual.finalize()); @@ -1896,7 +2181,7 @@ mod tests { // ["x", "yz"] — all valid, length prefix is u64 LE let array = LargeStringArray::from(vec!["x", "yz"]); let schema = Schema::new(vec![Field::new("col", DataType::LargeUtf8, false)]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1909,10 +2194,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); let mut manual = Sha256::new(); manual.update(1_u64.to_le_bytes()); @@ -1924,10 +2208,9 @@ mod tests { // ── List / LargeList ───────────────────────────────────── // - // Each outer element is prefixed by its inner element count (u64 LE), then the - // raw bytes of the inner array (no length limit — the implementation hashes from - // the element's offset to the end of the shared child buffer). - // Using a single outer element avoids buffer-bleed from preceding elements. + // With recursive decomposition, a non-nullable List column + // creates a single entry at "col/" (list_leaf) with structural (element counts), + // data (leaf values), and null_bits (item nullability). #[test] fn digest_list_non_nullable_bytes() { @@ -1945,7 +2228,7 @@ mod tests { DataType::List(Arc::clone(&item_field)), false, )]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -1958,18 +2241,33 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + // Non-nullable column → no "col" entry; list_leaf entry at "col/" + let buf = &digester.fields_digest_buffer["col/"]; + // Items are nullable → null_bits present (all valid in this case) + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable items"); + assert_eq!(null_bit_vec.len(), 3); + assert!(null_bit_vec.iter().all(|b| *b), "All items should be valid"); - // sub-array has 3 elements at offset 0 → raw buffer slice from byte 0 - let mut manual = Sha256::new(); - manual.update(3_u64.to_le_bytes()); // element count prefix - manual.update(10_i32.to_le_bytes()); - manual.update(20_i32.to_le_bytes()); - manual.update(30_i32.to_le_bytes()); - assert_eq!(data_digest.clone().finalize(), manual.finalize()); + let structural_digest = buf + .structural + .as_ref() + .expect("Expected structural digest for list"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); + + // Structural digest: element count (sizes separated from leaf data) + let mut manual_structural = Sha256::new(); + manual_structural.update(3_u64.to_le_bytes()); + assert_eq!( + structural_digest.clone().finalize(), + manual_structural.finalize() + ); + + // Data/leaf digest: only the raw leaf values + let mut manual_data = Sha256::new(); + manual_data.update(10_i32.to_le_bytes()); + manual_data.update(20_i32.to_le_bytes()); + manual_data.update(30_i32.to_le_bytes()); + assert_eq!(data_digest.clone().finalize(), manual_data.finalize()); } #[test] @@ -1988,7 +2286,7 @@ mod tests { DataType::LargeList(Arc::clone(&item_field)), false, )]); - let mut digester = ArrowDigesterCore::::new(schema); + let mut digester = ArrowDigesterCore::::new(&schema); digester.update( &RecordBatch::try_new( Arc::new(Schema::new(vec![Field::new( @@ -2001,16 +2299,481 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + // Non-nullable column → no "col" entry; list_leaf entry at "col/" + let buf = &digester.fields_digest_buffer["col/"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable items"); + assert_eq!(null_bit_vec.len(), 3); + assert!(null_bit_vec.iter().all(|b| *b), "All items should be valid"); - let mut manual = Sha256::new(); - manual.update(3_u64.to_le_bytes()); - manual.update(1_i32.to_le_bytes()); - manual.update(2_i32.to_le_bytes()); - manual.update(3_i32.to_le_bytes()); - assert_eq!(data_digest.clone().finalize(), manual.finalize()); + let structural_digest = buf + .structural + .as_ref() + .expect("Expected structural digest for list"); + let data_digest = buf.data.as_ref().expect("Expected data digest"); + + // Structural digest: element count (sizes separated from leaf data) + let mut manual_structural = Sha256::new(); + manual_structural.update(3_u64.to_le_bytes()); + assert_eq!( + structural_digest.clone().finalize(), + manual_structural.finalize() + ); + + // Data/leaf digest: only the raw leaf values + let mut manual_data = Sha256::new(); + manual_data.update(1_i32.to_le_bytes()); + manual_data.update(2_i32.to_le_bytes()); + manual_data.update(3_i32.to_le_bytes()); + assert_eq!(data_digest.clone().finalize(), manual_data.finalize()); + } + + #[test] + fn digest_buffer_type_structural_only() { + let buf = super::DigestBufferType::::new_structural_only(true); + assert!(buf.null_bits.is_some()); + assert!(buf.structural.is_some()); + assert!(buf.data.is_none()); + } + + #[test] + fn digest_buffer_type_data_only() { + let buf = super::DigestBufferType::::new_data_only(false); + assert!(buf.null_bits.is_none()); + assert!(buf.structural.is_none()); + assert!(buf.data.is_some()); + } + + #[test] + fn digest_buffer_type_list_leaf() { + let buf = super::DigestBufferType::::new_list_leaf(true); + assert!(buf.null_bits.is_some()); + assert!(buf.structural.is_some()); + assert!(buf.data.is_some()); + } + + #[test] + fn digest_buffer_type_validity_only() { + let buf = super::DigestBufferType::::new_validity_only(); + assert!(buf.null_bits.is_some()); + assert!(buf.structural.is_none()); + assert!(buf.data.is_none()); + } + + #[test] + fn extract_fields_list_of_struct() { + // List> + let schema = Schema::new(vec![Field::new( + "x", + DataType::LargeList(Arc::new(Field::new( + "item", + DataType::Struct( + vec![ + Field::new("a", DataType::Int32, false), + Field::new("b", DataType::LargeUtf8, false), + ] + .into(), + ), + false, + ))), + true, // column is nullable + )]); + + let digester = ArrowDigesterCore::::new(&schema); + let field_names: Vec<&String> = digester.fields_digest_buffer.keys().collect(); + + // Should have: "x" (validity-only), "x/" (structural), "x//a" (data), "x//b" (data) + assert_eq!( + field_names.len(), + 4, + "Expected 4 entries, got: {field_names:?}" + ); + assert!(field_names.contains(&&"x".to_owned())); + assert!(field_names.contains(&&"x/".to_owned())); + assert!(field_names.contains(&&"x//a".to_owned())); + assert!(field_names.contains(&&"x//b".to_owned())); + } + + #[test] + fn extract_fields_nested_list_struct_list() { + // x: Nullable, b: Struct>, h: Int32>>>> + let schema = Schema::new(vec![Field::new( + "x", + DataType::LargeList(Arc::new(Field::new( + "item", + DataType::Struct( + vec![ + Field::new("a", DataType::Int32, true), + Field::new( + "b", + DataType::Struct( + vec![ + Field::new( + "g", + DataType::LargeList(Arc::new(Field::new( + "item", + DataType::Int32, + false, + ))), + true, + ), + Field::new("h", DataType::Int32, false), + ] + .into(), + ), + false, + ), + ] + .into(), + ), + false, + ))), + true, + )]); + + let digester = ArrowDigesterCore::::new(&schema); + let field_names: Vec<&String> = digester.fields_digest_buffer.keys().collect(); + + // Expected entries: "x", "x/", "x//a", "x//b/g", "x//b/g/", "x//b/h" + assert_eq!( + field_names.len(), + 6, + "Expected 6 entries, got: {field_names:?}" + ); + assert!(field_names.contains(&&"x".to_owned())); + assert!(field_names.contains(&&"x/".to_owned())); + assert!(field_names.contains(&&"x//a".to_owned())); + assert!(field_names.contains(&&"x//b/g".to_owned())); + assert!(field_names.contains(&&"x//b/g/".to_owned())); + assert!(field_names.contains(&&"x//b/h".to_owned())); + } + + #[test] + fn recursive_list_struct_decomposition() { + use crate::arrow_digester_core::normalize_schema; + + // Schema: x: Nullable, + // b: Struct< + // g: Nullable>, + // h: Int32 + // > + // >>> + let g_field = Field::new( + "g", + DataType::LargeList(Arc::new(Field::new("item", DataType::Int32, false))), + true, // g is nullable + ); + let h_field = Field::new("h", DataType::Int32, false); + let b_field = Field::new( + "b", + DataType::Struct(vec![g_field.clone(), h_field.clone()].into()), + false, // b is non-nullable + ); + let a_field = Field::new("a", DataType::Int32, true); // a is nullable + let struct_type = DataType::Struct(vec![a_field.clone(), b_field.clone()].into()); + let item_field = Field::new("item", struct_type, false); + let x_field = Field::new( + "x", + DataType::LargeList(Arc::new(item_field.clone())), + true, // column is nullable + ); + let schema = Schema::new(vec![x_field]); + + // Build the data: + // Row 0: [{a: 1, b: {g: [10, 20], h: 100}}, {a: null, b: {g: [30], h: 200}}] + // Row 1: null + // Row 2: [{a: 3, b: {g: null, h: 300}}, {a: 4, b: {g: [], h: 400}}, {a: 5, b: {g: [50], h: 500}}] + + // Inner g values: [10, 20, 30, 50] (across all non-null g lists) + let g_values = Int32Array::from(vec![10, 20, 30, 50]); + // g list offsets: elem0=[10,20](len2), elem1=[30](len1), elem2=null, elem3=[](len0), elem4=[50](len1) + // For 5 struct elements, g has offsets [0, 2, 3, 3, 3, 4] + // with validity [true, true, false, true, true] + let g_list = LargeListArray::new( + Arc::new(Field::new("item", DataType::Int32, false)), + OffsetBuffer::new(vec![0_i64, 2, 3, 3, 3, 4].into()), + Arc::new(g_values) as ArrayRef, + Some(vec![true, true, false, true, true].into()), // g null at struct element 2 + ); + + let h_values = Int32Array::from(vec![100, 200, 300, 400, 500]); + + let b_struct = StructArray::from(vec![ + (Arc::new(g_field), Arc::new(g_list) as ArrayRef), + (Arc::new(h_field), Arc::new(h_values) as ArrayRef), + ]); + + let a_values = Int32Array::from(vec![Some(1), None, Some(3), Some(4), Some(5)]); + + let inner_struct = StructArray::from(vec![ + (Arc::new(a_field), Arc::new(a_values) as ArrayRef), + (Arc::new(b_field), Arc::new(b_struct) as ArrayRef), + ]); + + // Outer list: Row 0 has 2 elements, Row 1 is null, Row 2 has 3 elements + // Offsets: [0, 2, 2, 5] (row 1 is null but offset still present) + let outer_list = LargeListArray::new( + Arc::new(item_field), + OffsetBuffer::new(vec![0_i64, 2, 2, 5].into()), + Arc::new(inner_struct) as ArrayRef, + Some(vec![true, false, true].into()), // row 1 is null + ); + + let batch = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![Arc::new(outer_list) as ArrayRef], + ) + .unwrap(); + + // ── Compute expected hash manually ── + // BTreeMap entries (in sorted order): + // "x" → null_bits: V,I,V (3 bits) + // "x/" → structural: [2, 3] + // "x//a" → null_bits: V,I,V,V,V (5 bits), data: [1, 3, 4, 5] as i32 LE + // "x//b/g" → null_bits: V,V,I,V,V (5 bits) + // "x//b/g/" → structural: [2, 1, 0, 1], data: [10, 20, 30, 50] as i32 LE + // "x//b/h" → data: [100, 200, 300, 400, 500] as i32 LE + + let schema_digest = Sha256::digest( + ArrowDigesterCore::::serialized_schema(&normalize_schema(&schema)).as_bytes(), + ); + + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + + // Entry "x": null_bits V,I,V → bit_count=3, validity=0b101=5 + final_digest.update(3_u64.to_le_bytes()); + final_digest.update(5_u8.to_le_bytes()); + + // Entry "x/": structural only [2, 3] + let mut x_structural = Sha256::new(); + x_structural.update(2_u64.to_le_bytes()); + x_structural.update(3_u64.to_le_bytes()); + final_digest.update(x_structural.finalize()); + + // Entry "x//a": null_bits V,I,V,V,V → bit_count=5, validity=0b11101=29 + // data: [1, 3, 4, 5] as i32 LE + final_digest.update(5_u64.to_le_bytes()); + final_digest.update(29_u8.to_le_bytes()); + let mut xa_data = Sha256::new(); + xa_data.update(1_i32.to_le_bytes()); + xa_data.update(3_i32.to_le_bytes()); + xa_data.update(4_i32.to_le_bytes()); + xa_data.update(5_i32.to_le_bytes()); + final_digest.update(xa_data.finalize()); + + // Entry "x//b/g": null_bits V,V,I,V,V → bit_count=5, validity=0b11011=27 + final_digest.update(5_u64.to_le_bytes()); + final_digest.update(27_u8.to_le_bytes()); + + // Entry "x//b/g/": structural [2, 1, 0, 1], data [10, 20, 30, 50] as i32 LE + let mut xbg_structural = Sha256::new(); + xbg_structural.update(2_u64.to_le_bytes()); + xbg_structural.update(1_u64.to_le_bytes()); + xbg_structural.update(0_u64.to_le_bytes()); + xbg_structural.update(1_u64.to_le_bytes()); + final_digest.update(xbg_structural.finalize()); + let mut xbg_data = Sha256::new(); + xbg_data.update(10_i32.to_le_bytes()); + xbg_data.update(20_i32.to_le_bytes()); + xbg_data.update(30_i32.to_le_bytes()); + xbg_data.update(50_i32.to_le_bytes()); + final_digest.update(xbg_data.finalize()); + + // Entry "x//b/h": data only [100, 200, 300, 400, 500] as i32 LE + let mut h_leaf_data = Sha256::new(); + h_leaf_data.update(100_i32.to_le_bytes()); + h_leaf_data.update(200_i32.to_le_bytes()); + h_leaf_data.update(300_i32.to_le_bytes()); + h_leaf_data.update(400_i32.to_le_bytes()); + h_leaf_data.update(500_i32.to_le_bytes()); + final_digest.update(h_leaf_data.finalize()); + + let expected_hash = final_digest.finalize().to_vec(); + + let mut digester = ArrowDigesterCore::::new(&schema); + digester.update(&batch); + + let actual_hash = digester.finalize(); + + assert_eq!( + encode(&actual_hash), + encode(&expected_hash), + "Recursive list/struct decomposition hash mismatch" + ); + } + + #[expect( + clippy::too_many_lines, + reason = "Test builds multiple complex batches for batch-split independence verification" + )] + #[test] + fn recursive_list_struct_batch_split_independence() { + // Same schema and data as recursive_list_struct_decomposition, + // split into two batches: rows 0-1 and row 2. + // Verify: hash(batch1 + batch2) == hash(combined) + + let g_field = Field::new( + "g", + DataType::LargeList(Arc::new(Field::new("item", DataType::Int32, false))), + true, + ); + let h_field = Field::new("h", DataType::Int32, false); + let b_field = Field::new( + "b", + DataType::Struct(vec![g_field.clone(), h_field.clone()].into()), + false, + ); + let a_field = Field::new("a", DataType::Int32, true); + let struct_type = DataType::Struct(vec![a_field.clone(), b_field.clone()].into()); + let item_field = Field::new("item", struct_type, false); + let x_field = Field::new("x", DataType::LargeList(Arc::new(item_field.clone())), true); + let schema = Arc::new(Schema::new(vec![x_field])); + + // ── Build combined batch (all 3 rows) ── + let g_values = Int32Array::from(vec![10, 20, 30, 50]); + let g_list = LargeListArray::new( + Arc::new(Field::new("item", DataType::Int32, false)), + OffsetBuffer::new(vec![0_i64, 2, 3, 3, 3, 4].into()), + Arc::new(g_values) as ArrayRef, + Some(vec![true, true, false, true, true].into()), + ); + let h_values = Int32Array::from(vec![100, 200, 300, 400, 500]); + let b_struct = StructArray::from(vec![ + (Arc::new(g_field.clone()), Arc::new(g_list) as ArrayRef), + (Arc::new(h_field.clone()), Arc::new(h_values) as ArrayRef), + ]); + let a_values = Int32Array::from(vec![Some(1), None, Some(3), Some(4), Some(5)]); + let inner_struct = StructArray::from(vec![ + (Arc::new(a_field.clone()), Arc::new(a_values) as ArrayRef), + (Arc::new(b_field.clone()), Arc::new(b_struct) as ArrayRef), + ]); + let outer_list = LargeListArray::new( + Arc::new(item_field.clone()), + OffsetBuffer::new(vec![0_i64, 2, 2, 5].into()), + Arc::new(inner_struct) as ArrayRef, + Some(vec![true, false, true].into()), + ); + let combined_batch = + RecordBatch::try_new(Arc::clone(&schema), vec![Arc::new(outer_list) as ArrayRef]) + .unwrap(); + + // ── Build batch 1: rows 0-1 ── + let g_values_1 = Int32Array::from(vec![10, 20, 30]); + let g_list_1 = LargeListArray::new( + Arc::new(Field::new("item", DataType::Int32, false)), + OffsetBuffer::new(vec![0_i64, 2, 3].into()), + Arc::new(g_values_1) as ArrayRef, + Some(vec![true, true].into()), + ); + let h_values_1 = Int32Array::from(vec![100, 200]); + let b_struct_1 = StructArray::from(vec![ + (Arc::new(g_field.clone()), Arc::new(g_list_1) as ArrayRef), + (Arc::new(h_field.clone()), Arc::new(h_values_1) as ArrayRef), + ]); + let a_values_1 = Int32Array::from(vec![Some(1), None]); + let inner_struct_1 = StructArray::from(vec![ + (Arc::new(a_field.clone()), Arc::new(a_values_1) as ArrayRef), + (Arc::new(b_field.clone()), Arc::new(b_struct_1) as ArrayRef), + ]); + let outer_list_1 = LargeListArray::new( + Arc::new(item_field.clone()), + OffsetBuffer::new(vec![0_i64, 2, 2].into()), + Arc::new(inner_struct_1) as ArrayRef, + Some(vec![true, false].into()), + ); + let batch1 = RecordBatch::try_new( + Arc::clone(&schema), + vec![Arc::new(outer_list_1) as ArrayRef], + ) + .unwrap(); + + // ── Build batch 2: row 2 ── + let g_values_2 = Int32Array::from(vec![50]); + let g_list_2 = LargeListArray::new( + Arc::new(Field::new("item", DataType::Int32, false)), + OffsetBuffer::new(vec![0_i64, 0, 0, 1].into()), + Arc::new(g_values_2) as ArrayRef, + Some(vec![false, true, true].into()), + ); + let h_values_2 = Int32Array::from(vec![300, 400, 500]); + let b_struct_2 = StructArray::from(vec![ + (Arc::new(g_field), Arc::new(g_list_2) as ArrayRef), + (Arc::new(h_field), Arc::new(h_values_2) as ArrayRef), + ]); + let a_values_2 = Int32Array::from(vec![Some(3), Some(4), Some(5)]); + let inner_struct_2 = StructArray::from(vec![ + (Arc::new(a_field), Arc::new(a_values_2) as ArrayRef), + (Arc::new(b_field), Arc::new(b_struct_2) as ArrayRef), + ]); + let outer_list_2 = LargeListArray::new( + Arc::new(item_field), + OffsetBuffer::new(vec![0_i64, 3].into()), + Arc::new(inner_struct_2) as ArrayRef, + Some(vec![true].into()), + ); + let batch2 = RecordBatch::try_new( + Arc::clone(&schema), + vec![Arc::new(outer_list_2) as ArrayRef], + ) + .unwrap(); + + // ── Compare ── + let mut single = ArrowDigesterCore::::new(schema.as_ref()); + single.update(&combined_batch); + let single_hash = single.finalize(); + + let mut split = ArrowDigesterCore::::new(schema.as_ref()); + split.update(&batch1); + split.update(&batch2); + let split_hash = split.finalize(); + + assert_eq!( + encode(&single_hash), + encode(&split_hash), + "Batch split independence failed for recursive list/struct decomposition" + ); + } + + #[test] + fn hash_array_list_of_struct() { + // Verify hash_array works with List> using the same recursive + // decomposition as the record-batch path. + let inner_struct = StructArray::from(vec![ + ( + Arc::new(Field::new("a", DataType::Int32, false)), + Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef, + ), + ( + Arc::new(Field::new("b", DataType::Int32, false)), + Arc::new(Int32Array::from(vec![10, 20, 30])) as ArrayRef, + ), + ]); + + let list_array = LargeListArray::new( + Arc::new(Field::new( + "item", + DataType::Struct( + vec![ + Field::new("a", DataType::Int32, false), + Field::new("b", DataType::Int32, false), + ] + .into(), + ), + false, + )), + OffsetBuffer::new(vec![0_i64, 2, 3].into()), + Arc::new(inner_struct) as ArrayRef, + Some(vec![true, true].into()), + ); + + let hash1 = ArrowDigesterCore::::hash_array(&list_array); + let hash2 = ArrowDigesterCore::::hash_array(&list_array); + assert_eq!(hash1, hash2, "hash_array should be deterministic"); + assert_eq!( + hash1.len(), + 32, + "core hash_array should return 32 bytes (SHA-256)" + ); } } diff --git a/src/lib.rs b/src/lib.rs index a3745ff..685bcaf 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -10,15 +10,15 @@ use crate::arrow_digester_core::ArrowDigesterCore; const VERSION_BYTES: [u8; 3] = [0_u8, 0_u8, 1_u8]; // Version 0.0.1 -/// Maps `arrow_digester_core` function to a `sha_256` digester + versioning. +/// Maps `ArrowDigesterCore` to a SHA-256 digester with version prefix. #[derive(Clone)] pub struct ArrowDigester { digester: ArrowDigesterCore, } impl ArrowDigester { - /// Create a new instance of `ArrowDigester` with SHA256 as the digester with the schema which will be enforce through each update. - pub fn new(schema: Schema) -> Self { + /// Create a new instance of `ArrowDigester` with SHA-256 as the digest algorithm. The schema will be enforced on each update. + pub fn new(schema: &Schema) -> Self { Self { digester: ArrowDigesterCore::::new(schema), } @@ -34,17 +34,17 @@ impl ArrowDigester { Self::prepend_version_bytes(self.digester.finalize()) } - /// Function to hash an Array in one go. + /// Hash an array in one go. pub fn hash_array(array: &dyn Array) -> Vec { Self::prepend_version_bytes(ArrowDigesterCore::::hash_array(array)) } - /// Function to hash a complete `RecordBatch` in one go. + /// Hash a complete `RecordBatch` in one go. pub fn hash_record_batch(record_batch: &RecordBatch) -> Vec { Self::prepend_version_bytes(ArrowDigesterCore::::hash_record_batch(record_batch)) } - /// Function to hash schema only. + /// Hash a schema only. pub fn hash_schema(schema: &Schema) -> Vec { Self::prepend_version_bytes(ArrowDigesterCore::::hash_schema(schema)) } diff --git a/src/pyarrow.rs b/src/pyarrow.rs index 03277ba..0477b65 100644 --- a/src/pyarrow.rs +++ b/src/pyarrow.rs @@ -67,10 +67,10 @@ pub struct InternalPyArrowDigester { #[uniffi::export] impl InternalPyArrowDigester { - /// Create a new instance of `PyArrowDigester` with SHA256 as the digester with the schema which will be enforce through each update + /// Create a new instance of `PyArrowDigester` with SHA-256 as the digest algorithm. The schema will be enforced on each update. /// /// # Panics - /// The pointer must be a valid Arrow schema from Python's pyarrow, if failed to convert, it will panic + /// The pointer must be a valid Arrow schema from Python's pyarrow. Panics if conversion fails. #[uniffi::constructor] pub fn new(schema_ptr: u64) -> Self { @@ -81,7 +81,7 @@ impl InternalPyArrowDigester { Schema::try_from(&ffi_schema).expect("Failed to convert FFI schema to Arrow schema") }; Self { - digester: Arc::new(Mutex::new(ArrowDigester::new(schema))), + digester: Arc::new(Mutex::new(ArrowDigester::new(&schema))), } } @@ -117,7 +117,7 @@ impl InternalPyArrowDigester { /// Consume the digester and finalize the hash computation /// /// # Panics - /// If failed to acquire lock on digester + /// Panics if it fails to acquire the lock on the digester. pub fn finalize(&self) -> Vec { self.digester .lock() diff --git a/tests/arrow_digester.rs b/tests/arrow_digester.rs index 303e258..602ac26 100644 --- a/tests/arrow_digester.rs +++ b/tests/arrow_digester.rs @@ -7,8 +7,9 @@ mod tests { array::{ ArrayRef, BinaryArray, BooleanArray, Date32Array, Date64Array, Decimal32Array, Decimal64Array, DictionaryArray, Float32Array, Float64Array, Int16Array, Int32Array, - Int64Array, Int8Array, LargeBinaryArray, LargeListArray, LargeStringArray, ListArray, - RecordBatch, StringArray, StructArray, Time32MillisecondArray, Time32SecondArray, + Int64Array, Int8Array, LargeBinaryArray, LargeListArray, LargeListBuilder, + LargeStringArray, LargeStringBuilder, ListArray, ListBuilder, RecordBatch, StringArray, + StringBuilder, StructArray, Time32MillisecondArray, Time32SecondArray, Time64MicrosecondArray, Time64NanosecondArray, UInt16Array, UInt32Array, UInt64Array, UInt8Array, }, @@ -72,8 +73,8 @@ mod tests { // Empty Table Hashing Check assert_eq!( - encode(ArrowDigester::new(schema.clone()).finalize()), - "0000019c75bd0c40bd2fb15e878418c151c0b792c966476b35ded7d0f6fd1922cf5a00" + encode(ArrowDigester::new(&schema).finalize()), + "0000015955baf5303c8545360b2f0a253065e9d83d91cd44f0bc947c1904dfd9d09aac" ); let batch = RecordBatch::try_new( @@ -129,7 +130,7 @@ mod tests { // Hash the record batch assert_eq!( encode(ArrowDigester::hash_record_batch(&batch)), - "00000199f7ba7f6c7ec30ad487996c2b3eb6f0e1c750c318a32b09afcdfdce7de8c08e" + "000001487059003be1a84dbe29ba6e90ea50798a76d22e46e221b6a0c332421dc4062e" ); } @@ -139,7 +140,7 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&bool_array)); assert_eq!( hash, - "000001f9abeb37d9395f359b48a379f0a8467c572b19ecc6cae9fa85e1bf627a52a8f3" + "00000185a9c99eba7bcfd9b14fd529b9534f2289319779270aa4a072f117cf90a6ac8b" ); } @@ -150,7 +151,7 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&int_array)); assert_eq!( hash, - "00000127f2411e6839eb1e3fe706ac3f01e704c7b46357360fb2ddb8a08ec98e8ba4fa" + "0000018330f9b8796b9434cbf7bc028c18c58a2a739b980acf9995ce1e5d60b43b0138" ); } @@ -161,7 +162,7 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&time_array)); assert_eq!( hash, - "0000019000b74aa80f685103a8cafc7e113aa8f33ccc0c94ea3713318d2cc2f3436baa" + "000001aba70469e596c735ec13c3d60a9db2d0e5515eb864f07ad5d24572b35f23eacc" ); } @@ -172,7 +173,7 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&time_array)); assert_eq!( hash, - "00000195f12143d789f364a3ed52f7300f8f91dc21fbe00c34aed798ca8fd54182dea3" + "000001c96d705b1278f9ffe1b31fb307408768f14d961c44028a1d0f778dd61786ee26" ); } @@ -199,10 +200,10 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&binary_array)); assert_eq!( hash, - "000001466801efd880d2acecd6c78915b5c2a51476870f9116912834d79de43a000071" + "0000018dc3a0e479d1335553546c8f23c36d75335cbd34805a6f96c5d5225b347fbc57" ); - // Test large binary array with same data to ensure consistency + // Large binary array with same data should produce identical hash (type canonicalization) let large_binary_array = LargeBinaryArray::from(vec![ Some(b"hello".as_ref()), None, @@ -210,7 +211,7 @@ mod tests { Some(b"".as_ref()), ]); - assert_ne!( + assert_eq!( hex::encode(ArrowDigester::hash_array(&large_binary_array)), hash ); @@ -263,14 +264,14 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&string_array)); assert_eq!( hash, - "000001811f2407a0d2e90ef9688514d37cd92225242e7614f02ef5ef36abcae73ca374" + "0000016255bde0141ebf26e08c31c96f6112e5e21d101ab8bb90d77f2c3eec02c62d3c" ); - // Test large string array with same data to ensure consistency + // Large string array with same data should produce identical hash (type canonicalization) let large_string_array = LargeStringArray::from(vec![Some("hello"), None, Some("world"), Some("")]); - assert_ne!( + assert_eq!( hex::encode(ArrowDigester::hash_array(&large_string_array)), hash ); @@ -289,7 +290,7 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&list_array)); assert_eq!( hash, - "00000114b8faee7c56d2a94d77095db599152df41aaf4d11e485035eebc94e8981f769" + "000001dc359d563a1ed210eb271b314612ea8343f0a0b0955b9053a9eb47962d27163c" ); // Collision test: [[1, 2], [3]] vs [[1], [2, 3]] @@ -324,7 +325,7 @@ mod tests { assert_eq!( encode(ArrowDigester::hash_array(&decimal32_array)), - "000001ef29250615f9d6ab34672c3b11dfa2dcda6e8e6164bc55899c13887f17705f5d" + "0000014f015bd5c4b6ce6e939a8c890333f3e110c2c28ef8014aafd352f8373791e547" ); // Test Decimal64 (precision 10-18) @@ -338,7 +339,7 @@ mod tests { .unwrap(); assert_eq!( encode(ArrowDigester::hash_array(&decimal64_array)), - "000001efa4ed72641051233889c07775366cbf2e56eb4b0fcfd46653f5741e81786f08" + "000001dc08c7b9c583edecec36bc5dee21cd2edec9f402a651014fea5f8834d16ad737" ); // Test Decimal128 (precision 19-38) @@ -352,7 +353,7 @@ mod tests { .unwrap(); assert_eq!( hex::encode(ArrowDigester::hash_array(&decimal128_array)), - "00000155cc4d81a048dbca001ca8581673a5a6c93efd870d358df211a545c2af9b658d" + "0000011e3b33d28771b3593fd5dc4b68af8091a1ba9cd493ade374e7368e213bef244e" ); } @@ -424,12 +425,12 @@ mod tests { let batch2 = RecordBatch::try_new(Arc::clone(&schema), vec![uids2, fake_data2]).unwrap(); // Hash both record batches - let mut digester = ArrowDigester::new((*schema).clone()); + let mut digester = ArrowDigester::new(schema.as_ref()); digester.update(&batch1); digester.update(&batch2); assert_eq!( encode(digester.finalize()), - "0000018aa41f456395dc1d26c8d82895d6c81ed9453c1bb3f401fee637131baa60553e" + "0000019f5fa370d315a4b4f2314be7b7284a0549b70ad4e21e584fdebf441ad02f44f0" ); } @@ -507,7 +508,7 @@ mod tests { .unwrap(); // Hash batches incrementally - let mut digester_batches = ArrowDigester::new((*schema).clone()); + let mut digester_batches = ArrowDigester::new(schema.as_ref()); digester_batches.update(&batch1); digester_batches.update(&batch2); let hash_batches = encode(digester_batches.finalize()); @@ -522,7 +523,7 @@ mod tests { ) .unwrap(); - let mut digester_single = ArrowDigester::new((*schema).clone()); + let mut digester_single = ArrowDigester::new(schema.as_ref()); digester_single.update(&combined_batch); let hash_single = encode(digester_single.finalize()); @@ -559,7 +560,7 @@ mod tests { .unwrap(); // Hash batches incrementally - let mut digester_batches = ArrowDigester::new((*schema).clone()); + let mut digester_batches = ArrowDigester::new(schema.as_ref()); digester_batches.update(&batch1); digester_batches.update(&batch2); let hash_batches = encode(digester_batches.finalize()); @@ -588,7 +589,7 @@ mod tests { ) .unwrap(); - let mut digester_single = ArrowDigester::new((*schema).clone()); + let mut digester_single = ArrowDigester::new(schema.as_ref()); digester_single.update(&combined_batch); let hash_single = encode(digester_single.finalize()); @@ -603,7 +604,7 @@ mod tests { /// Two schemas with the same struct fields in different order should produce identical schema hashes. /// Bug: `data_type_to_value()` preserves struct field insertion order in the JSON Vec. #[test] - #[ignore = "Bug: struct fields not sorted in data_type_to_value (Issue 1)"] + fn struct_field_order_in_schema_should_not_affect_hash() { let schema1 = Schema::new(vec![Field::new( "my_struct", @@ -640,7 +641,7 @@ mod tests { /// Record batches with struct columns whose inner fields are reordered should produce identical hashes. #[test] - #[ignore = "Bug: struct fields not sorted in data_type_to_value (Issue 1)"] + fn struct_field_order_in_record_batch_should_not_affect_hash() { let schema1 = Arc::new(Schema::new(vec![Field::new( "s", @@ -667,8 +668,7 @@ mod tests { )])); let ints = Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef; - let bools = - Arc::new(BooleanArray::from(vec![Some(true), Some(false), None])) as ArrayRef; + let bools = Arc::new(BooleanArray::from(vec![Some(true), Some(false), None])) as ArrayRef; let struct1 = StructArray::from(vec![ ( @@ -692,10 +692,8 @@ mod tests { ), ]); - let batch1 = - RecordBatch::try_new(schema1, vec![Arc::new(struct1) as ArrayRef]).unwrap(); - let batch2 = - RecordBatch::try_new(schema2, vec![Arc::new(struct2) as ArrayRef]).unwrap(); + let batch1 = RecordBatch::try_new(schema1, vec![Arc::new(struct1) as ArrayRef]).unwrap(); + let batch2 = RecordBatch::try_new(schema2, vec![Arc::new(struct2) as ArrayRef]).unwrap(); assert_eq!( encode(ArrowDigester::hash_record_batch(&batch1)), @@ -707,7 +705,7 @@ mod tests { // ── Issue 5: Type canonicalization (Binary/LargeBinary, Utf8/LargeUtf8, List/LargeList) ── #[test] - #[ignore = "Bug: no type canonicalization for Binary vs LargeBinary (Issue 5)"] + fn binary_and_large_binary_schema_should_hash_equal() { let schema1 = Schema::new(vec![Field::new("col", DataType::Binary, true)]); let schema2 = Schema::new(vec![Field::new("col", DataType::LargeBinary, true)]); @@ -720,7 +718,7 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for Utf8 vs LargeUtf8 (Issue 5)"] + fn utf8_and_large_utf8_schema_should_hash_equal() { let schema1 = Schema::new(vec![Field::new("col", DataType::Utf8, true)]); let schema2 = Schema::new(vec![Field::new("col", DataType::LargeUtf8, true)]); @@ -733,7 +731,7 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for List vs LargeList (Issue 5)"] + fn list_and_large_list_schema_should_hash_equal() { let list_field = Field::new("item", DataType::Int32, true); let schema1 = Schema::new(vec![Field::new( @@ -755,19 +753,74 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for Binary vs LargeBinary in hash_array (Issue 5)"] - fn binary_and_large_binary_array_should_hash_equal() { - let bin = BinaryArray::from(vec![ - Some(b"hello".as_ref()), + fn list_and_large_list_array_should_hash_equal() { + let list = ListArray::from_iter_primitive::(vec![ + Some(vec![Some(1), Some(2)]), None, - Some(b"world".as_ref()), + Some(vec![Some(3)]), ]); - let large_bin = LargeBinaryArray::from(vec![ - Some(b"hello".as_ref()), + let large_list = LargeListArray::from_iter_primitive::(vec![ + Some(vec![Some(1), Some(2)]), None, - Some(b"world".as_ref()), + Some(vec![Some(3)]), ]); + assert_eq!( + encode(ArrowDigester::hash_array(&list)), + encode(ArrowDigester::hash_array(&large_list)), + "List and LargeList arrays with same data should produce same hash" + ); + } + + #[test] + fn list_and_large_list_record_batch_should_hash_equal() { + let list_field = Field::new("item", DataType::Int32, true); + let schema1 = Arc::new(Schema::new(vec![Field::new( + "col", + DataType::List(Box::new(list_field.clone()).into()), + true, + )])); + let schema2 = Arc::new(Schema::new(vec![Field::new( + "col", + DataType::LargeList(Box::new(list_field).into()), + true, + )])); + + let batch1 = RecordBatch::try_new( + schema1, + vec![ + Arc::new(ListArray::from_iter_primitive::(vec![ + Some(vec![Some(10), Some(20)]), + None, + ])) as ArrayRef, + ], + ) + .unwrap(); + + let batch2 = RecordBatch::try_new( + schema2, + vec![ + Arc::new(LargeListArray::from_iter_primitive::( + vec![Some(vec![Some(10), Some(20)]), None], + )) as ArrayRef, + ], + ) + .unwrap(); + + assert_eq!( + encode(ArrowDigester::hash_record_batch(&batch1)), + encode(ArrowDigester::hash_record_batch(&batch2)), + "List and LargeList record batches with same data should produce same hash" + ); + } + + #[test] + + fn binary_and_large_binary_array_should_hash_equal() { + let bin = BinaryArray::from(vec![Some(b"hello".as_ref()), None, Some(b"world".as_ref())]); + let large_bin = + LargeBinaryArray::from(vec![Some(b"hello".as_ref()), None, Some(b"world".as_ref())]); + assert_eq!( encode(ArrowDigester::hash_array(&bin)), encode(ArrowDigester::hash_array(&large_bin)), @@ -776,7 +829,7 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for Utf8 vs LargeUtf8 in hash_array (Issue 5)"] + fn utf8_and_large_utf8_array_should_hash_equal() { let arr = StringArray::from(vec![Some("hello"), None, Some("world")]); let large_arr = LargeStringArray::from(vec![Some("hello"), None, Some("world")]); @@ -789,7 +842,35 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for Binary vs LargeBinary in hash_record_batch (Issue 5)"] + fn utf8_and_large_utf8_record_batch_should_hash_equal() { + let schema1 = Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, true)])); + let schema2 = Arc::new(Schema::new(vec![Field::new( + "col", + DataType::LargeUtf8, + true, + )])); + + let batch1 = RecordBatch::try_new( + schema1, + vec![Arc::new(StringArray::from(vec![Some("abc"), None])) as ArrayRef], + ) + .unwrap(); + + let batch2 = RecordBatch::try_new( + schema2, + vec![Arc::new(LargeStringArray::from(vec![Some("abc"), None])) as ArrayRef], + ) + .unwrap(); + + assert_eq!( + encode(ArrowDigester::hash_record_batch(&batch1)), + encode(ArrowDigester::hash_record_batch(&batch2)), + "Utf8 and LargeUtf8 record batches with same data should produce same hash" + ); + } + + #[test] + fn binary_and_large_binary_record_batch_should_hash_equal() { let schema1 = Arc::new(Schema::new(vec![Field::new("col", DataType::Binary, true)])); let schema2 = Arc::new(Schema::new(vec![Field::new( @@ -800,19 +881,13 @@ mod tests { let batch1 = RecordBatch::try_new( schema1, - vec![Arc::new(BinaryArray::from(vec![ - Some(b"abc".as_ref()), - None, - ])) as ArrayRef], + vec![Arc::new(BinaryArray::from(vec![Some(b"abc".as_ref()), None])) as ArrayRef], ) .unwrap(); let batch2 = RecordBatch::try_new( schema2, - vec![Arc::new(LargeBinaryArray::from(vec![ - Some(b"abc".as_ref()), - None, - ])) as ArrayRef], + vec![Arc::new(LargeBinaryArray::from(vec![Some(b"abc".as_ref()), None])) as ArrayRef], ) .unwrap(); @@ -823,10 +898,184 @@ mod tests { ); } + // ── Deep nested type normalization ────────────────────────────────── + + #[test] + fn list_of_utf8_vs_large_list_of_large_utf8_array_should_hash_equal() { + // List(Utf8) vs LargeList(LargeUtf8) — normalization must be recursive + let list = { + let mut builder = ListBuilder::new(StringBuilder::new()); + builder.values().append_value("hello"); + builder.values().append_value("world"); + builder.append(true); + builder.values().append_value("foo"); + builder.append(true); + builder.finish() + }; + + let large_list = { + let mut builder = LargeListBuilder::new(LargeStringBuilder::new()); + builder.values().append_value("hello"); + builder.values().append_value("world"); + builder.append(true); + builder.values().append_value("foo"); + builder.append(true); + builder.finish() + }; + + assert_eq!( + encode(ArrowDigester::hash_array(&list)), + encode(ArrowDigester::hash_array(&large_list)), + "List(Utf8) and LargeList(LargeUtf8) should produce same hash" + ); + } + + #[test] + fn list_of_utf8_vs_large_list_of_large_utf8_schema_should_hash_equal() { + let schema1 = Schema::new(vec![Field::new( + "col", + DataType::List(Arc::new(Field::new("item", DataType::Utf8, true))), + true, + )]); + let schema2 = Schema::new(vec![Field::new( + "col", + DataType::LargeList(Arc::new(Field::new("item", DataType::LargeUtf8, true))), + true, + )]); + + assert_eq!( + encode(ArrowDigester::hash_schema(&schema1)), + encode(ArrowDigester::hash_schema(&schema2)), + "List(Utf8) and LargeList(LargeUtf8) schemas should be logically equivalent" + ); + } + + #[test] + fn struct_with_list_utf8_vs_large_variants_record_batch_should_hash_equal() { + // Struct({items: List(Utf8), name: Utf8}) vs Struct({items: LargeList(LargeUtf8), name: LargeUtf8}) + let schema1 = Arc::new(Schema::new(vec![Field::new( + "s", + DataType::Struct( + vec![ + Field::new( + "items", + DataType::List(Arc::new(Field::new("item", DataType::Utf8, true))), + true, + ), + Field::new("name", DataType::Utf8, true), + ] + .into(), + ), + false, + )])); + + let schema2 = Arc::new(Schema::new(vec![Field::new( + "s", + DataType::Struct( + vec![ + Field::new( + "items", + DataType::LargeList(Arc::new(Field::new( + "item", + DataType::LargeUtf8, + true, + ))), + true, + ), + Field::new("name", DataType::LargeUtf8, true), + ] + .into(), + ), + false, + )])); + + // Build struct with List(Utf8) + let list1 = { + let mut builder = ListBuilder::new(StringBuilder::new()); + builder.values().append_value("a"); + builder.values().append_value("b"); + builder.append(true); + builder.values().append_value("c"); + builder.append(true); + builder.finish() + }; + let names1 = StringArray::from(vec![Some("Alice"), Some("Bob")]); + let struct1 = StructArray::from(vec![ + ( + Arc::new(Field::new( + "items", + DataType::List(Arc::new(Field::new("item", DataType::Utf8, true))), + true, + )), + Arc::new(list1) as ArrayRef, + ), + ( + Arc::new(Field::new("name", DataType::Utf8, true)), + Arc::new(names1) as ArrayRef, + ), + ]); + + // Build struct with LargeList(LargeUtf8) + let list2 = { + let mut builder = LargeListBuilder::new(LargeStringBuilder::new()); + builder.values().append_value("a"); + builder.values().append_value("b"); + builder.append(true); + builder.values().append_value("c"); + builder.append(true); + builder.finish() + }; + let names2 = LargeStringArray::from(vec![Some("Alice"), Some("Bob")]); + let struct2 = StructArray::from(vec![ + ( + Arc::new(Field::new( + "items", + DataType::LargeList(Arc::new(Field::new("item", DataType::LargeUtf8, true))), + true, + )), + Arc::new(list2) as ArrayRef, + ), + ( + Arc::new(Field::new("name", DataType::LargeUtf8, true)), + Arc::new(names2) as ArrayRef, + ), + ]); + + let batch1 = RecordBatch::try_new(schema1, vec![Arc::new(struct1) as ArrayRef]).unwrap(); + let batch2 = RecordBatch::try_new(schema2, vec![Arc::new(struct2) as ArrayRef]).unwrap(); + + assert_eq!( + encode(ArrowDigester::hash_record_batch(&batch1)), + encode(ArrowDigester::hash_record_batch(&batch2)), + "Struct with List(Utf8) should hash same as Struct with LargeList(LargeUtf8)" + ); + } + + #[test] + fn streaming_with_type_equivalent_schemas_should_succeed() { + // Create digester with Utf8 schema, feed batch with LargeUtf8 schema + let schema_utf8 = Schema::new(vec![Field::new("col", DataType::Utf8, true)]); + + let mut digester = ArrowDigester::new(&schema_utf8); + + let batch = RecordBatch::try_new( + Arc::new(Schema::new(vec![Field::new( + "col", + DataType::LargeUtf8, + true, + )])), + vec![Arc::new(LargeStringArray::from(vec![Some("hello"), None])) as ArrayRef], + ) + .unwrap(); + + digester.update(&batch); // Should NOT panic — schemas are logically equivalent + let _hash = encode(digester.finalize()); + } + // ── Issue 6: Dictionary-encoded array equivalence ─────────────────── #[test] - #[ignore = "Bug: Dictionary arrays hit todo!() panic (Issue 6)"] + fn dictionary_utf8_should_hash_same_as_plain_string() { let plain = StringArray::from(vec![Some("apple"), Some("banana"), Some("apple")]); @@ -842,13 +1091,12 @@ mod tests { } #[test] - #[ignore = "Bug: Dictionary arrays hit todo!() panic (Issue 6)"] + fn dictionary_int_values_should_hash_same_as_plain() { let plain = StringArray::from(vec![Some("x"), Some("y"), Some("x")]); - let dict: DictionaryArray = vec![Some("x"), Some("y"), Some("x")] - .into_iter() - .collect(); + let dict: DictionaryArray = + vec![Some("x"), Some("y"), Some("x")].into_iter().collect(); assert_eq!( encode(ArrowDigester::hash_array(&plain)), @@ -858,13 +1106,12 @@ mod tests { } #[test] - #[ignore = "Bug: Dictionary arrays hit todo!() panic (Issue 6)"] + fn dictionary_with_nulls_should_hash_same_as_plain() { let plain = StringArray::from(vec![Some("a"), None, Some("b"), None]); - let dict: DictionaryArray = vec![Some("a"), None, Some("b"), None] - .into_iter() - .collect(); + let dict: DictionaryArray = + vec![Some("a"), None, Some("b"), None].into_iter().collect(); assert_eq!( encode(ArrowDigester::hash_array(&plain)), @@ -877,14 +1124,14 @@ mod tests { /// Feeding a batch with reordered columns into a digester should not panic. #[test] - #[ignore = "Bug: update() uses strict schema equality including column order (Issue 7)"] + fn streaming_update_with_reordered_columns_should_succeed() { let schema = Schema::new(vec![ Field::new("a", DataType::Int32, false), Field::new("b", DataType::Boolean, true), ]); - let mut digester = ArrowDigester::new(schema); + let mut digester = ArrowDigester::new(&schema); // Batch with columns in DIFFERENT order: [b, a] let reordered_schema = Arc::new(Schema::new(vec![ @@ -908,7 +1155,7 @@ mod tests { /// A digester fed batches with different column orders should produce the same hash /// as one fed batches in the original order. #[test] - #[ignore = "Bug: update() uses strict schema equality including column order (Issue 7)"] + fn streaming_reordered_columns_produce_same_hash() { let schema_ab = Schema::new(vec![ Field::new("a", DataType::Int32, false), @@ -934,12 +1181,12 @@ mod tests { .unwrap(); // Digester fed batch in original order [a, b] - let mut digester1 = ArrowDigester::new(schema_ab.clone()); + let mut digester1 = ArrowDigester::new(&schema_ab); digester1.update(&batch_ab); let hash1 = encode(digester1.finalize()); // Digester fed batch in reversed order [b, a] - let mut digester2 = ArrowDigester::new(schema_ab); + let mut digester2 = ArrowDigester::new(&schema_ab); digester2.update(&batch_ba); let hash2 = encode(digester2.finalize()); diff --git a/tests/digest_bytes.rs b/tests/digest_bytes.rs index 5c6016f..f64a8b6 100644 --- a/tests/digest_bytes.rs +++ b/tests/digest_bytes.rs @@ -1,2 +1,912 @@ +/// Manual byte-level verification tests for the Starfix hashing specification. +/// +/// Each test in this module manually computes the expected SHA-256 hash by +/// feeding the exact bytes described in `docs/byte-layout-spec.md` into a +/// fresh SHA-256 hasher, then asserts that the library produces the identical +/// result. This serves as both a conformance check and a reference +/// implementation for anyone porting Starfix to another language. #[cfg(test)] -mod tests {} +mod tests { + #![expect(clippy::unwrap_used, reason = "Okay in test")] + #![expect(clippy::redundant_clone, reason = "Clones for clarity in test setup")] + #![expect(clippy::absolute_paths, reason = "One-off use in test")] + + use std::sync::Arc; + + use arrow::array::{ + ArrayRef, BinaryArray, BooleanArray, Int32Array, LargeListArray, LargeStringArray, + RecordBatch, StringArray, StructArray, + }; + use arrow::buffer::NullBuffer; + use arrow_schema::{DataType, Field, Schema}; + use sha2::{Digest as _, Sha256}; + use starfix::ArrowDigester; + + const VERSION: [u8; 3] = [0x00, 0x00, 0x01]; + + // ── Helper ─────────────────────────────────────────────────────────── + + /// Prepend the 3-byte version prefix to a 32-byte SHA-256 digest, + /// returning the full 35-byte Starfix hash. + fn with_version(digest: Vec) -> Vec { + let mut out = VERSION.to_vec(); + out.extend(digest); + out + } + + // ══════════════════════════════════════════════════════════════════════ + // Example A: Simple Two-Column Table (record batch) + // Schema: {age: Int32 non-nullable, name: LargeUtf8 nullable} + // Row 0: age=25, name="Alice" + // Row 1: age=30, name=NULL + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_a_two_column_table() { + // ── Build the table ────────────────────────────────────────────── + let schema = Schema::new(vec![ + Field::new("age", DataType::Int32, false), + Field::new("name", DataType::LargeUtf8, true), + ]); + let batch = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![ + Arc::new(Int32Array::from(vec![25_i32, 30])) as ArrayRef, + Arc::new(LargeStringArray::from(vec![Some("Alice"), None])) as ArrayRef, + ], + ) + .unwrap(); + + // ── Step 1: Schema digest ──────────────────────────────────────── + let schema_json = r#"{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + // Verify the library agrees on schema hash + assert_eq!( + ArrowDigester::hash_schema(&schema), + with_version(schema_digest.to_vec()), + "Schema hash mismatch — canonical JSON may differ" + ); + + // ── Step 2: Field "age" (Int32, non-nullable) ──────────────────── + // Values: [25, 30] → little-endian bytes + let mut age_data = Sha256::new(); + age_data.update(25_i32.to_le_bytes()); // 19 00 00 00 + age_data.update(30_i32.to_le_bytes()); // 1e 00 00 00 + let age_data_finalized = age_data.finalize(); + + // ── Step 3: Field "name" (LargeUtf8, nullable) ─────────────────── + // Values: ["Alice", NULL] + // + // Validity BitVec (Lsb0, u8 storage): + // bit 0 = 1 (valid), bit 1 = 0 (null) + // → u8 word = 0b01 = 1 + // bit_count = 2 + let bit_count: u64 = 2; + let validity_word: u8 = 1; // bits: [1, 0] in Lsb0 + + // Data bytes (only valid elements): + // "Alice" → len=5 as u64 LE, then UTF-8 bytes + // NULL → skipped + let mut name_data = Sha256::new(); + name_data.update(5_u64.to_le_bytes()); // length prefix + name_data.update(b"Alice"); // raw UTF-8 bytes + // NULL element: nothing fed + let name_data_finalized = name_data.finalize(); + + // ── Step 4: Final combination ──────────────────────────────────── + // Fields in alphabetical order: "age", "name" + let mut final_digest = Sha256::new(); + + // Schema + final_digest.update(schema_digest); + + // Field "age" (non-nullable → just the data digest) + final_digest.update(age_data_finalized); + + // Field "name" (nullable → bit_count + validity words + data digest) + final_digest.update(bit_count.to_le_bytes()); // 02 00 00 00 00 00 00 00 + final_digest.update(validity_word.to_le_bytes()); // 01 + final_digest.update(name_data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + // ── Verify ─────────────────────────────────────────────────────── + assert_eq!( + ArrowDigester::hash_record_batch(&batch), + expected, + "Example A: two-column table hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example B: Boolean Array with Nulls (hash_array API) + // BooleanArray [true, NULL, false, true] (nullable) + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_b_boolean_array_with_nulls() { + let array = BooleanArray::from(vec![Some(true), None, Some(false), Some(true)]); + + // ── Type metadata ──────────────────────────────────────────────── + // data_type_to_value(Boolean) → JSON value "Boolean" + // serde_json::to_string(json!("Boolean")) → "\"Boolean\"" + let type_json = b"\"Boolean\""; + + // ── Validity bits (Lsb0, u8 storage) ────────────────────────── + // [valid, null, valid, valid] → bits [1, 0, 1, 1] + // Lsb0 in u8: bit0=1, bit1=0, bit2=1, bit3=1 → 0b1101 = 13 + let bit_count: u64 = 4; + let validity_word: u8 = 0b1101; // = 13 + + // ── Data bits (Lsb0 packed, valid values only) ─────────────────── + // Valid values: [true, false, true] → 3 bits + // Lsb0: bit0=1(true), bit1=0(false), bit2=1(true) → 0b101 = 0x05 + let mut data_digest = Sha256::new(); + data_digest.update([0x05_u8]); + let data_finalized = data_digest.finalize(); + + // ── Final combination ──────────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + // Nullable finalization + final_digest.update(bit_count.to_le_bytes()); + final_digest.update(validity_word.to_le_bytes()); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example B: boolean array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example C: Non-Nullable Int32 Array (hash_array API) + // Int32Array [1, 2, 3] (non-nullable) + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_c_non_nullable_int32_array() { + let array = Int32Array::from(vec![1_i32, 2, 3]); + + // ── Type metadata ──────────────────────────────────────────────── + let type_json = b"\"Int32\""; + + // ── Data (contiguous LE buffer) ────────────────────────────────── + // [1, 2, 3] as i32 LE: + // 01 00 00 00 02 00 00 00 03 00 00 00 + let mut data_digest = Sha256::new(); + data_digest.update(1_i32.to_le_bytes()); + data_digest.update(2_i32.to_le_bytes()); + data_digest.update(3_i32.to_le_bytes()); + let data_finalized = data_digest.finalize(); + + // ── Final (non-nullable) ───────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example C: non-nullable int32 array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example D: Non-Nullable Binary Array (hash_array API) + // BinaryArray [b"hi", b""] (non-nullable) + // Tests type canonicalization: Binary → LargeBinary + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_d_non_nullable_binary_array() { + let array = BinaryArray::from(vec![b"hi".as_ref(), b"".as_ref()]); + + // ── Type metadata (canonicalized) ──────────────────────────────── + // Binary → LargeBinary in canonical form + let type_json = b"\"LargeBinary\""; + + // ── Data ───────────────────────────────────────────────────────── + // b"hi": len=2 as u64 LE + raw bytes + // b"": len=0 as u64 LE + (no bytes) + let mut data_digest = Sha256::new(); + data_digest.update(2_u64.to_le_bytes()); // 02 00 00 00 00 00 00 00 + data_digest.update(b"hi"); // 68 69 + data_digest.update(0_u64.to_le_bytes()); // 00 00 00 00 00 00 00 00 + let data_finalized = data_digest.finalize(); + + // ── Final (non-nullable) ───────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example D: non-nullable binary array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example E: Column-Order Independence + // Batch 1: columns [x: Int32, y: Boolean nullable] → x=10, y=true + // Batch 2: columns [y: Boolean nullable, x: Int32] → y=true, x=10 + // Both must produce the same hash. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_e_column_order_independence() { + let ints = Arc::new(Int32Array::from(vec![10_i32])) as ArrayRef; + let bools = Arc::new(BooleanArray::from(vec![Some(true)])) as ArrayRef; + + let batch_xy = RecordBatch::try_new( + Arc::new(Schema::new(vec![ + Field::new("x", DataType::Int32, false), + Field::new("y", DataType::Boolean, true), + ])), + vec![Arc::clone(&ints), Arc::clone(&bools)], + ) + .unwrap(); + + let batch_yx = RecordBatch::try_new( + Arc::new(Schema::new(vec![ + Field::new("y", DataType::Boolean, true), + Field::new("x", DataType::Int32, false), + ])), + vec![Arc::clone(&bools), Arc::clone(&ints)], + ) + .unwrap(); + + // ── Manual computation ─────────────────────────────────────────── + let schema_json = r#"{"x":{"data_type":"Int32","nullable":false},"y":{"data_type":"Boolean","nullable":true}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + // Field "x" (Int32, non-nullable): value 10 + let mut x_data = Sha256::new(); + x_data.update(10_i32.to_le_bytes()); // 0a 00 00 00 + let x_finalized = x_data.finalize(); + + // Field "y" (Boolean, nullable): value true (valid) + // Validity: [1] → bit_count=1, word=1 (Lsb0) + // Data: [true] Lsb0 → bit0=1 → 0x01 + let bit_count: u64 = 1; + let validity_word: u8 = 1; + + let mut y_data = Sha256::new(); + y_data.update([0x01_u8]); // true in Lsb0 = 0000_0001 + let y_finalized = y_data.finalize(); + + // Final combination: schema, then fields alphabetically (x, y) + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + // x (non-nullable) + final_digest.update(x_finalized); + // y (nullable) + final_digest.update(bit_count.to_le_bytes()); + final_digest.update(validity_word.to_le_bytes()); + final_digest.update(y_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + // ── Verify both column orderings produce the same hash ─────────── + let hash_xy = ArrowDigester::hash_record_batch(&batch_xy); + let hash_yx = ArrowDigester::hash_record_batch(&batch_yx); + + assert_eq!(hash_xy, hash_yx, "Column order should not affect hash"); + assert_eq!( + hash_xy, expected, + "Example E: column-order independence hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example F: Type Equivalence (Utf8 vs LargeUtf8, hash_array API) + // StringArray ["ab"] (Utf8, non-nullable) + // LargeStringArray ["ab"] (LargeUtf8, non-nullable) + // Both must produce the same hash. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_f_utf8_large_utf8_equivalence() { + let small = StringArray::from(vec!["ab"]); + let large = LargeStringArray::from(vec!["ab"]); + + // ── Manual computation ─────────────────────────────────────────── + // Type metadata: both canonicalize to "LargeUtf8" + let type_json = b"\"LargeUtf8\""; + + // Data: "ab" → len=2 as u64 LE + UTF-8 bytes + let mut data_digest = Sha256::new(); + data_digest.update(2_u64.to_le_bytes()); + data_digest.update(b"ab"); + let data_finalized = data_digest.finalize(); + + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&small), + expected, + "Example F: Utf8 hash mismatch" + ); + assert_eq!( + ArrowDigester::hash_array(&large), + expected, + "Example F: LargeUtf8 hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example G: Nullable Int32 Array with Nulls (hash_array API) + // Int32Array [Some(42), None, Some(-7), Some(0)] + // Tests nullable fixed-size path with actual nulls. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_g_nullable_int32_with_nulls() { + let array = Int32Array::from(vec![Some(42), None, Some(-7), Some(0)]); + + // ── Type metadata ──────────────────────────────────────────────── + let type_json = b"\"Int32\""; + + // ── Validity bits (Lsb0, u8) ────────────────────────────────── + // [valid, null, valid, valid] → bits [1, 0, 1, 1] → 0b1101 = 13 + let bit_count: u64 = 4; + let validity_word: u8 = 0b1101; // 13 + + // ── Data (only valid elements, in order) ───────────────────────── + // 42 as i32 LE: 2a 00 00 00 + // -7 as i32 LE: f9 ff ff ff + // 0 as i32 LE: 00 00 00 00 + let mut data_digest = Sha256::new(); + data_digest.update(42_i32.to_le_bytes()); + data_digest.update((-7_i32).to_le_bytes()); + data_digest.update(0_i32.to_le_bytes()); + let data_finalized = data_digest.finalize(); + + // ── Final (nullable) ───────────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(bit_count.to_le_bytes()); + final_digest.update(validity_word.to_le_bytes()); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example G: nullable int32 array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example H: Nullable String Array with Nulls (hash_array API) + // StringArray [Some("hello"), None, Some("world"), Some("")] + // Tests nullable variable-length path with type canonicalization. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_h_nullable_string_array_with_nulls() { + let array = StringArray::from(vec![Some("hello"), None, Some("world"), Some("")]); + + // ── Type metadata (canonicalized) ──────────────────────────────── + // Utf8 → LargeUtf8 + let type_json = b"\"LargeUtf8\""; + + // ── Validity bits (Lsb0, u8) ────────────────────────────────── + // [valid, null, valid, valid] → bits [1, 0, 1, 1] → 0b1101 = 13 + let bit_count: u64 = 4; + let validity_word: u8 = 0b1101; + + // ── Data (only valid elements) ─────────────────────────────────── + // "hello" → len=5 u64 LE + "hello" + // "world" → len=5 u64 LE + "world" + // "" → len=0 u64 LE + let mut data_digest = Sha256::new(); + data_digest.update(5_u64.to_le_bytes()); + data_digest.update(b"hello"); + // NULL: skipped + data_digest.update(5_u64.to_le_bytes()); + data_digest.update(b"world"); + data_digest.update(0_u64.to_le_bytes()); + let data_finalized = data_digest.finalize(); + + // ── Final (nullable) ───────────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(bit_count.to_le_bytes()); + final_digest.update(validity_word.to_le_bytes()); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example H: nullable string array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example I: Empty Table (schema only, no data) + // Tests that finalize() on a fresh digester with no update() calls + // produces schema_digest + empty field digests. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_i_empty_table() { + let schema = Schema::new(vec![ + Field::new("a", DataType::Int32, false), + Field::new("b", DataType::Boolean, true), + ]); + + // ── Schema digest ──────────────────────────────────────────────── + let schema_json = r#"{"a":{"data_type":"Int32","nullable":false},"b":{"data_type":"Boolean","nullable":true}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + // ── Field "a" (Int32, non-nullable): no data fed ───────────────── + // data_digest = SHA-256() with no updates → SHA-256 of empty input + let a_data_finalized = Sha256::digest(b""); + + // ── Field "b" (Boolean, nullable): no data fed ─────────────────── + // bit_count = 0 (no elements) + // as_raw_slice() = [] (no words) + // data_digest = SHA-256 of empty input + let bit_count: u64 = 0; + let b_data_finalized = Sha256::digest(b""); + + // ── Final ──────────────────────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + // Field "a" (non-nullable) + final_digest.update(a_data_finalized); + // Field "b" (nullable) — bit_count=0, no words, empty data digest + final_digest.update(bit_count.to_le_bytes()); + // no validity words (raw_slice is empty for 0-length BitVec) + final_digest.update(b_data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + let digester = ArrowDigester::new(&schema); + assert_eq!( + digester.finalize(), + expected, + "Example I: empty table hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example J: Multi-Batch Streaming + // Feeding two small batches must produce the same hash as feeding + // one combined batch (batch-split independence). + // Schema: {v: Int32 non-nullable} + // Batch 1: [1, 2] + // Batch 2: [3] + // Combined: [1, 2, 3] + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_j_multi_batch_streaming() { + let schema = Schema::new(vec![Field::new("v", DataType::Int32, false)]); + + // ── Two-batch path ─────────────────────────────────────────────── + let batch1 = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![Arc::new(Int32Array::from(vec![1_i32, 2])) as ArrayRef], + ) + .unwrap(); + let batch2 = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![Arc::new(Int32Array::from(vec![3_i32])) as ArrayRef], + ) + .unwrap(); + + let mut digester_stream = ArrowDigester::new(&schema); + digester_stream.update(&batch1); + digester_stream.update(&batch2); + let hash_stream = digester_stream.finalize(); + + // ── Single-batch path ──────────────────────────────────────────── + let combined = RecordBatch::try_new( + Arc::new(schema), + vec![Arc::new(Int32Array::from(vec![1_i32, 2, 3])) as ArrayRef], + ) + .unwrap(); + let hash_combined = ArrowDigester::hash_record_batch(&combined); + + assert_eq!( + hash_stream, hash_combined, + "Streaming two batches should equal single combined batch" + ); + + // ── Manual computation ─────────────────────────────────────────── + let schema_json = r#"{"v":{"data_type":"Int32","nullable":false}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + // Field "v": data is [1, 2, 3] as i32 LE — accumulated across batches + // The digester is streaming, so it updates the same SHA-256 state: + // update(01 00 00 00 02 00 00 00) from batch 1 + // update(03 00 00 00) from batch 2 + // SHA-256 is incremental, so this is identical to hashing all 12 bytes at once. + let mut v_data = Sha256::new(); + v_data.update(1_i32.to_le_bytes()); + v_data.update(2_i32.to_le_bytes()); + v_data.update(3_i32.to_le_bytes()); + let v_finalized = v_data.finalize(); + + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + final_digest.update(v_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + hash_stream, expected, + "Example J: multi-batch streaming hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example K: Struct Column in a Record Batch + // Schema: {person: Struct non-nullable} + // Row 0: {age: 25, name: "Alice"} + // Row 1: {age: 30, name: "Bob"} + // + // In the record-batch path, struct fields are decomposed into leaf + // fields: "person/age" and "person/name", each hashed independently. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_k_struct_column_in_record_batch() { + // ── Build the table ────────────────────────────────────────────── + let age = Arc::new(Int32Array::from(vec![25_i32, 30])) as ArrayRef; + let name = Arc::new(LargeStringArray::from(vec!["Alice", "Bob"])) as ArrayRef; + let struct_array = StructArray::from(vec![ + ( + Arc::new(Field::new("age", DataType::Int32, false)), + Arc::clone(&age), + ), + ( + Arc::new(Field::new("name", DataType::LargeUtf8, false)), + Arc::clone(&name), + ), + ]); + + let schema = Schema::new(vec![Field::new( + "person", + DataType::Struct( + vec![ + Field::new("age", DataType::Int32, false), + Field::new("name", DataType::LargeUtf8, false), + ] + .into(), + ), + false, + )]); + let batch = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![Arc::new(struct_array) as ArrayRef], + ) + .unwrap(); + + // ── Step 1: Schema digest ──────────────────────────────────────── + // Canonical JSON: struct fields sorted by name, keys sorted recursively + // "person" has data_type: {"Struct": [{"data_type": "Int32", "name": "age", "nullable": false}, + // {"data_type": "LargeUtf8", "name": "name", "nullable": false}]} + let schema_json = r#"{"person":{"data_type":{"Struct":[{"data_type":"Int32","name":"age","nullable":false},{"data_type":"LargeUtf8","name":"name","nullable":false}]},"nullable":false}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + assert_eq!( + ArrowDigester::hash_schema(&schema), + with_version(schema_digest.to_vec()), + "Example K: schema hash mismatch" + ); + + // ── Step 2: Leaf field "person/age" (Int32, non-nullable) ──────── + // Values: [25, 30] as i32 LE + let mut age_data = Sha256::new(); + age_data.update(25_i32.to_le_bytes()); + age_data.update(30_i32.to_le_bytes()); + let age_data_finalized = age_data.finalize(); + + // ── Step 3: Leaf field "person/name" (LargeUtf8, non-nullable) ─── + // Values: ["Alice", "Bob"] + let mut name_data = Sha256::new(); + name_data.update(5_u64.to_le_bytes()); // "Alice" length + name_data.update(b"Alice"); + name_data.update(3_u64.to_le_bytes()); // "Bob" length + name_data.update(b"Bob"); + let name_data_finalized = name_data.finalize(); + + // ── Step 4: Final combination ──────────────────────────────────── + // Fields alphabetically: "person/age", "person/name" + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + // "person/age" (non-nullable): just data digest + final_digest.update(age_data_finalized); + // "person/name" (non-nullable): just data digest + final_digest.update(name_data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_record_batch(&batch), + expected, + "Example K: struct column record batch hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example L: Struct Array via hash_array (non-nullable struct) + // StructArray [{a: 1, b: true}, {a: 2, b: false}] + // Children: a: Int32 non-null, b: Boolean non-null + // + // In hash_array, the struct is hashed compositely: + // type_json + data where data = finalized(child_a) || finalized(child_b) + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_l_struct_array_hash_array() { + let a = Arc::new(Int32Array::from(vec![1_i32, 2])) as ArrayRef; + let b = Arc::new(BooleanArray::from(vec![true, false])) as ArrayRef; + let struct_array = StructArray::from(vec![ + ( + Arc::new(Field::new("a", DataType::Int32, false)), + Arc::clone(&a), + ), + ( + Arc::new(Field::new("b", DataType::Boolean, false)), + Arc::clone(&b), + ), + ]); + + // ── Type metadata ──────────────────────────────────────────────── + // Canonical: {"Struct":[{"data_type":"Int32","name":"a","nullable":false}, + // {"data_type":"Boolean","name":"b","nullable":false}]} + let type_json = r#"{"Struct":[{"data_type":"Int32","name":"a","nullable":false},{"data_type":"Boolean","name":"b","nullable":false}]}"#; + + // ── Decomposition ──────────────────────────────────────────────── + // Struct is transparent: no BTreeMap entry for the struct itself. + // Children become separate entries, finalized directly into the + // final digest (no parent_data wrapper). + // + // BTreeMap entries (sorted by key): "a", "b" + + // ── Entry "a" (Int32, non-nullable) ────────────────────────────── + // data = SHA256(1_i32_le, 2_i32_le) + let mut data_a = Sha256::new(); + data_a.update(1_i32.to_le_bytes()); + data_a.update(2_i32.to_le_bytes()); + + // ── Entry "b" (Boolean, non-nullable) ──────────────────────────── + // Values: [true, false] → Lsb0: bit0=1(true), bit1=0(false) → 0x01 + let mut data_b = Sha256::new(); + data_b.update([0x01_u8]); + + // ── Final combination ──────────────────────────────────────────── + // type_json → finalize_digest("a") → finalize_digest("b") + // Each entry: non-nullable → no null_bits, no structural, just data.finalize() + let mut final_digest = Sha256::new(); + final_digest.update(type_json.as_bytes()); + final_digest.update(data_a.finalize()); + final_digest.update(data_b.finalize()); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&struct_array), + expected, + "Example L: struct array hash_array mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example M: Nullable Struct Array via hash_array (struct-level nulls) + // StructArray [Some({a: 10, b: "x"}), None, Some({a: 30, b: "z"})] + // Struct is nullable. Children: a: Int32 non-null, b: LargeUtf8 non-null + // + // Struct-level nulls propagate to children: at row 1 (null struct), + // children's data is undefined and must be skipped. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_m_nullable_struct_array_hash_array() { + // Build a nullable struct array with a null at row 1 + let a = Int32Array::from(vec![10_i32, 0, 30]); // row 1 value is undefined (0 placeholder) + let b = LargeStringArray::from(vec!["x", "", "z"]); // row 1 value is undefined + let struct_array = StructArray::from(( + vec![ + ( + Arc::new(Field::new("a", DataType::Int32, false)), + Arc::new(a) as ArrayRef, + ), + ( + Arc::new(Field::new("b", DataType::LargeUtf8, false)), + Arc::new(b) as ArrayRef, + ), + ], + // Struct-level validity: [valid, null, valid] + NullBuffer::from(vec![true, false, true]) + .into_inner() + .into_inner(), + )); + + // ── Type metadata ──────────────────────────────────────────────── + let type_json = r#"{"Struct":[{"data_type":"Int32","name":"a","nullable":false},{"data_type":"LargeUtf8","name":"b","nullable":false}]}"#; + + // ── Decomposition ──────────────────────────────────────────────── + // Struct is transparent: no BTreeMap entry. Struct-level nulls + // [1, 0, 1] are AND-propagated to children for data hashing. + // Children "a" and "b" are non-nullable per their Field definitions, + // so their entries have no null_bits — but null rows are skipped + // in the data stream. + // + // BTreeMap entries (sorted by key): "a", "b" + + // ── Entry "a" (Int32, non-nullable) ────────────────────────────── + // Struct nulls propagated: rows 0,2 valid → data = [10, 30] + let mut data_a = Sha256::new(); + data_a.update(10_i32.to_le_bytes()); + // row 1: skipped (struct null) + data_a.update(30_i32.to_le_bytes()); + + // ── Entry "b" (LargeUtf8, non-nullable) ───────────────────────── + // Struct nulls propagated: rows 0,2 valid → data = ["x", "z"] + let mut data_b = Sha256::new(); + data_b.update(1_u64.to_le_bytes()); // "x" len + data_b.update(b"x"); + // row 1: skipped (struct null) + data_b.update(1_u64.to_le_bytes()); // "z" len + data_b.update(b"z"); + + // ── Final combination ──────────────────────────────────────────── + // type_json → finalize_digest("a") → finalize_digest("b") + // Each entry: non-nullable → no null_bits, no structural, just data.finalize() + let mut final_digest = Sha256::new(); + final_digest.update(type_json.as_bytes()); + final_digest.update(data_a.finalize()); + final_digest.update(data_b.finalize()); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&struct_array), + expected, + "Example M: nullable struct array hash_array mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example N: List-of-Struct in a Record Batch + // Schema: {items: LargeList> nullable} + // Row 0: [{id: 1, label: "a"}, {id: 2, label: "b"}] (2 elements) + // Row 1: [{id: 3, label: "c"}] (1 element) + // + // Recursively decomposed into separate BTreeMap entries: + // "items" → validity-only (null_bits: [V, V]) + // "items/" → structural-only (list lengths: [2, 1]) + // "items//id" → data-only ([1, 2, 3] as i32 LE) + // "items//label"→ data-only (["a", "b", "c"] as LargeUtf8) + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_n_list_of_struct_record_batch() { + // ── Build the table ────────────────────────────────────────────── + let struct_fields = vec![ + Field::new("id", DataType::Int32, false), + Field::new("label", DataType::LargeUtf8, false), + ]; + let inner_struct_field = Field::new( + "item", + DataType::Struct(struct_fields.clone().into()), + false, + ); + let list_field = Field::new( + "items", + DataType::LargeList(Arc::new(inner_struct_field.clone())), + true, + ); + let schema = Schema::new(vec![list_field.clone()]); + + // Build struct sub-arrays + // Row 0: [{id:1, label:"a"}, {id:2, label:"b"}], Row 1: [{id:3, label:"c"}] + // Total struct rows: 3 (ids: [1,2,3], labels: ["a","b","c"]) + let ids = Int32Array::from(vec![1_i32, 2, 3]); + let labels = LargeStringArray::from(vec!["a", "b", "c"]); + let struct_array = StructArray::from(vec![ + ( + Arc::new(Field::new("id", DataType::Int32, false)), + Arc::new(ids) as ArrayRef, + ), + ( + Arc::new(Field::new("label", DataType::LargeUtf8, false)), + Arc::new(labels) as ArrayRef, + ), + ]); + + // Build large list array with offsets [0, 2, 3] + let list_array = LargeListArray::new( + Arc::new(inner_struct_field), + arrow::buffer::OffsetBuffer::new(vec![0_i64, 2, 3].into()), + Arc::new(struct_array) as ArrayRef, + None, // all list elements valid + ); + + let batch = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![Arc::new(list_array) as ArrayRef], + ) + .unwrap(); + + // ── Step 1: Schema digest ──────────────────────────────────────── + // Canonical: element type has no name (element_type_to_value drops "item") + // The inner struct's data_type is {"Struct": [sorted children]} + let schema_json = r#"{"items":{"data_type":{"LargeList":{"data_type":{"Struct":[{"data_type":"Int32","name":"id","nullable":false},{"data_type":"LargeUtf8","name":"label","nullable":false}]},"nullable":false}},"nullable":true}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + assert_eq!( + ArrowDigester::hash_schema(&schema), + with_version(schema_digest.to_vec()), + "Example N: schema hash mismatch" + ); + + // ── Step 2: Recursive decomposition ────────────────────────────── + // + // With recursive list/struct decomposition, entries are (sorted): + // "items" → validity-only: null_bits [V, V] (2 bits, both valid) + // "items/" → structural-only: list lengths [2, 1] + // "items//id" → data-only: [1, 2, 3] as i32 LE + // "items//label" → data-only: ["a", "b", "c"] as LargeUtf8 + + // ── Step 3: Final combination ──────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + + // Entry "items": null_bits V,V → bit_count=2, validity=0b11=3 + final_digest.update(2_u64.to_le_bytes()); + final_digest.update(3_u8.to_le_bytes()); + + // Entry "items/": structural [2, 1] + let mut items_structural = Sha256::new(); + items_structural.update(2_u64.to_le_bytes()); + items_structural.update(1_u64.to_le_bytes()); + final_digest.update(items_structural.finalize()); + + // Entry "items//id": data [1, 2, 3] as i32 LE + let mut id_data = Sha256::new(); + id_data.update(1_i32.to_le_bytes()); + id_data.update(2_i32.to_le_bytes()); + id_data.update(3_i32.to_le_bytes()); + final_digest.update(id_data.finalize()); + + // Entry "items//label": data ["a", "b", "c"] as LargeUtf8 + let mut label_data = Sha256::new(); + label_data.update(1_u64.to_le_bytes()); + label_data.update(b"a"); + label_data.update(1_u64.to_le_bytes()); + label_data.update(b"b"); + label_data.update(1_u64.to_le_bytes()); + label_data.update(b"c"); + final_digest.update(label_data.finalize()); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_record_batch(&batch), + expected, + "Example N: list-of-struct record batch hash mismatch" + ); + } +} diff --git a/tests/golden_files/schema_serialization_pretty.json b/tests/golden_files/schema_serialization_pretty.json index 70cb27d..f2ec2db 100644 --- a/tests/golden_files/schema_serialization_pretty.json +++ b/tests/golden_files/schema_serialization_pretty.json @@ -1,6 +1,6 @@ { "binary_name": { - "data_type": "Binary", + "data_type": "LargeBinary", "nullable": true }, "bool_name": { @@ -45,19 +45,9 @@ "doubly_nested_struct_name": { "data_type": { "Struct": [ - { - "data_type": "Int32", - "name": "outer_field", - "nullable": false - }, { "data_type": { "Struct": [ - { - "data_type": "Utf8", - "name": "middle_field", - "nullable": true - }, { "data_type": { "Struct": [ @@ -75,11 +65,21 @@ }, "name": "inner", "nullable": false + }, + { + "data_type": "LargeUtf8", + "name": "middle_field", + "nullable": true } ] }, "name": "middle", "nullable": false + }, + { + "data_type": "Int32", + "name": "outer_field", + "nullable": false } ] }, @@ -117,7 +117,6 @@ "data_type": { "LargeList": { "data_type": "Int32", - "name": "item", "nullable": true } }, @@ -129,9 +128,8 @@ }, "list_name": { "data_type": { - "List": { + "LargeList": { "data_type": "Int32", - "name": "item", "nullable": true } }, @@ -146,7 +144,7 @@ "nullable": false }, { - "data_type": "Utf8", + "data_type": "LargeUtf8", "name": "struct_field2", "nullable": true } @@ -195,7 +193,7 @@ "nullable": false }, "utf8_name": { - "data_type": "Utf8", + "data_type": "LargeUtf8", "nullable": true } }