diff --git a/docs/byte-layout-spec.md b/docs/byte-layout-spec.md new file mode 100644 index 0000000..0fd7791 --- /dev/null +++ b/docs/byte-layout-spec.md @@ -0,0 +1,1019 @@ +# Starfix Byte Layout Specification + +This document describes the **exact byte-level serialization** used by Starfix to compute deterministic hashes of Apache Arrow schemas and record batches. Every byte fed into SHA-256 is specified here, making it possible to implement a compatible hasher in any language. + +All multi-byte integers use **little-endian** byte order unless explicitly stated otherwise. + +--- + +## 1. Output Format + +Every Starfix hash is **35 bytes**: + +``` +[version: 3 bytes] [SHA-256 digest: 32 bytes] +``` + +The version prefix is currently `0x00 0x00 0x01` (version 0.0.1). + +When displayed as hex, a hash looks like: + +``` +000001 <64 hex chars of SHA-256> +``` + +--- + +## 2. Schema Serialization + +### 2.1 Canonical JSON String + +The schema is serialized as a **compact JSON string** (no whitespace) of an object where: + +- **Keys** are field names, sorted alphabetically (via `BTreeMap`). +- **Values** are objects with keys `"data_type"` and `"nullable"`, with JSON keys sorted alphabetically within every nested object (recursively). + +Because all JSON object keys are sorted recursively, the key order is always `"data_type"` before `"nullable"` (and `"data_type"` before `"name"` before `"nullable"` for struct children). + +#### Type Canonicalization + +Before serialization, these logical equivalence classes are collapsed: + +| Arrow type(s) | Canonical JSON form | +|----------------------------|-------------------------------| +| `Binary`, `LargeBinary` | `"LargeBinary"` | +| `Utf8`, `LargeUtf8` | `"LargeUtf8"` | +| `List(f)`, `LargeList(f)` | `{"LargeList": }` | +| `Dictionary(k, v)` | canonical form of `v` | + +#### Nested Type Serialization + +**Struct fields** are serialized as: +```json +{"Struct": []} +``` +Each child object: `{"data_type": ..., "name": "", "nullable": }`. + +**List / LargeList elements** are serialized as: +```json +{"LargeList": {"data_type": ..., "nullable": }} +``` +Note: the Arrow-internal field name (typically `"item"`) is **omitted** — only `data_type` and `nullable` are included. + +**Primitive types** use Arrow's built-in serde: +- `"Int32"`, `"Boolean"`, `"Float64"`, `"LargeBinary"`, `"LargeUtf8"`, etc. +- `{"Decimal128": [38, 5]}`, `{"Time32": "Second"}`, etc. + +### 2.2 Schema Digest + +``` +schema_digest = SHA-256(canonical_json_string_bytes) +``` + +The UTF-8 bytes of the JSON string are fed directly into SHA-256. The result is 32 bytes. + +### 2.3 Concrete Example + +Schema: `{name: LargeUtf8 nullable, age: Int32 non-nullable}` + +Canonical JSON string (compact, keys sorted): +``` +{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}} +``` + +Note: `"age"` comes before `"name"` alphabetically, and `"data_type"` comes before `"nullable"`. + +``` +schema_digest = SHA-256(b'{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}}') +``` + +--- + +## 3. Field Data Serialization + +Each leaf field in the schema is hashed independently into its own SHA-256 digest. Struct fields are flattened: a struct field `address` with children `city` and `zip` becomes two leaf fields `address/city` and `address/zip`. + +Each leaf field has a **digest buffer** containing up to three components: + +| Component | Present when | Purpose | +|-----------|-------------|---------| +| `null_bits` (BitVec) | field is nullable | Tracks which elements are valid vs null | +| `structural` (SHA-256) | field is a list type (`List` or `LargeList`) | Accumulates element counts (structure) | +| `data` (SHA-256) | always | Accumulates leaf data bytes | + +A field is nullable if the Arrow field's `nullable` flag is `true`. A field is "structured" if its (canonical) data type is `List` or `LargeList`. + +This separation of structural information from leaf data ensures that list element boundaries are hashed independently from the values they contain. For example, `[[1,2],[3]]` and `[[1],[2,3]]` differ in their structural digest (element counts `[2,1]` vs `[1,2]`) even though their leaf data digest is identical (`[1,2,3]`). + +### 3.1 Fixed-Size Types + +**Types**: `Int8`, `UInt8`, `Int16`, `UInt16`, `Int32`, `UInt32`, `Int64`, `UInt64`, `Float16`, `Float32`, `Float64`, `Date32`, `Date64`, `Time32(*)`, `Time64(*)`, `Decimal32`, `Decimal64`, `Decimal128`, `Decimal256`, `FixedSizeBinary(n)`. + +| Type | Bytes per element | +|------|-------------------| +| Int8 / UInt8 | 1 | +| Int16 / UInt16 / Float16 | 2 | +| Int32 / UInt32 / Float32 / Date32 / Decimal32 / Time32 | 4 | +| Int64 / UInt64 / Float64 / Date64 / Decimal64 / Time64 | 8 | +| Decimal128 | 16 | +| Decimal256 | 32 | +| FixedSizeBinary(n) | n | + +**Non-nullable path**: The entire contiguous byte buffer (all elements concatenated, little-endian) is fed into the data digest in a single update. + +**Nullable path**: +1. For each element `i`, push `is_valid(i)` (true=1, false=0) into the validity `BitVec`. +2. For each **valid** element, feed its little-endian bytes into the data digest. +3. **Null elements are skipped entirely** — no data bytes are fed. + +If a nullable field has no actual nulls (null buffer absent), all elements are marked valid and the entire buffer is fed in one update (same as non-nullable data path). + +### 3.2 Boolean Type + +Boolean values are **bit-packed** using **MSB-first** (`Msb0`) ordering into bytes. + +**Non-nullable**: All values are packed sequentially into a `BitVec`, then the raw bytes are fed into the data digest. + +**Nullable**: +1. Extend the validity `BitVec` as usual. +2. Only **valid** values are packed (nulls are skipped). +3. The packed bytes are fed into the data digest. + +**Example**: `[true, NULL, false, true]` (nullable, 4 elements) +- Validity bits: `[1, 0, 1, 1]` +- Data bits (valid only): `[true, false, true]` → Msb0 packed: `1_0_1_00000` = `0xA0` +- Bytes fed to data digest: `[0xA0]` + +### 3.3 Variable-Length Types (Binary, String) + +**Types**: `Binary`, `LargeBinary`, `Utf8`, `LargeUtf8`. + +Each element is serialized as: +``` +[length as u64 little-endian: 8 bytes] [raw bytes: length bytes] +``` + +The length prefix is **always `u64`** (8 bytes, little-endian) regardless of the Arrow offset type. + +**Non-nullable**: For each element, feed `(len as u64).to_le_bytes()` then the raw bytes. + +**Nullable**: +1. Extend the validity `BitVec`. +2. For valid elements: feed length prefix + raw bytes. +3. For null elements: **skip entirely** — no bytes fed to data digest. + +### 3.4 List Types + +**Types**: `List(field)`, `LargeList(field)`. + +List types use **structural hashing**: element counts are written to a separate `structural` SHA-256 digest, while leaf data from sub-arrays flows into the `data` digest. This separation prevents collisions between differently-grouped lists (e.g., `[[1,2],[3]]` vs `[[1],[2,3]]`). + +For each valid list element (a sub-array): + +1. **Structural digest** receives: `[sub-array element count as u64 little-endian: 8 bytes]` +2. **Data digest** receives: recursive serialization of the sub-array's leaf values + +**Nullable**: Extend validity `BitVec`; skip null list entries entirely (no bytes to either digest). + +Sub-array elements are hashed recursively using the same rules. If a list contains nested lists (e.g., `List>`), each nesting level writes its element counts to the same structural digest, and only the innermost leaf values reach the data digest. + +#### Concrete Example: Structural vs Leaf Separation + +For `LargeList` with data `[[1,2],[3]]`: + +``` +structural digest receives: + 02 00 00 00 00 00 00 00 (element 0: 2 items, u64 LE) + 01 00 00 00 00 00 00 00 (element 1: 1 item, u64 LE) + +data digest receives: + 01 00 00 00 (1 as i32 LE) + 02 00 00 00 (2 as i32 LE) + 03 00 00 00 (3 as i32 LE) +``` + +Compare with `[[1],[2,3]]`: + +``` +structural digest receives: + 01 00 00 00 00 00 00 00 (element 0: 1 item) + 02 00 00 00 00 00 00 00 (element 1: 2 items) + +data digest receives: + 01 00 00 00 (same leaf bytes) + 02 00 00 00 + 03 00 00 00 +``` + +The data digests are identical, but the structural digests differ — so the final hashes differ. + +### 3.5 Struct Types + +Struct fields are handled differently depending on context: + +#### Record-Batch Path (field decomposition) + +In the record-batch path (`hash_record_batch`, streaming `update`/`finalize`), struct fields are **decomposed into leaf fields**. Each leaf field within the struct is extracted and hashed independently under its own path key (e.g., `address/city`, `address/zip`). These paths live in a `BTreeMap`, so they are always processed in alphabetical order. The struct itself does not appear as a separate entry. + +#### Composite Path (`hash_array`, list sub-arrays) + +When a struct appears as a standalone array (`hash_array`) or as a sub-array within a list, it is hashed **compositely**: + +1. **Struct-level nulls**: If the parent digest is Nullable, push struct-level validity into the parent's `BitVec` (same as all other types via `handle_null_bits`). + +2. **Children sorted alphabetically** by field name. + +3. **For each child** (in sorted order): + - Create a fresh digest buffer for the child. The child is **effectively nullable** if either the child field is nullable OR the struct has null rows. The child gets a **structural digest** if it is a list type. + - If the struct has null rows, **propagate struct nulls** to the child: `combined_valid(i) = struct_valid(i) AND child_valid(i)`. This ensures undefined data at null struct positions is never hashed. + - Hash the child recursively via `array_digest_update`. + - **Finalize the child digest** and write the resulting bytes into the parent's data stream (in the order: null_bits, structural, data): + - Non-nullable, non-list child: `SHA-256(child_data).finalize()` (32 bytes) + - Nullable, non-list child: `bit_count LE (8B) || validity_words BE (8B each) || SHA-256(child_data).finalize() (32B)` + - Non-nullable list child: `SHA-256(child_structural).finalize() (32B) || SHA-256(child_data).finalize() (32B)` + - Nullable list child: `bit_count LE (8B) || validity_words BE (8B each) || SHA-256(child_structural).finalize() (32B) || SHA-256(child_data).finalize() (32B)` + +The parent's data stream thus contains the concatenation of all children's finalized bytes (in alphabetical order). + +### 3.6 Dictionary-Encoded Arrays + +Dictionary arrays are **resolved to their plain equivalent** before hashing. The dictionary is unpacked so that the data stream is identical to a non-dictionary array with the same logical values. + +--- + +## 4. Field Digest Finalization + +After all record batches have been fed, each field's digest buffer is finalized and fed into the **final combining digest**. The three components are written in this fixed order: + +``` +1. null_bits (if present — nullable fields only) +2. structural (if present — list fields only) +3. data (always present) +``` + +### 4.1 Non-Nullable, Non-List Field + +``` +final_digest.update( SHA-256(data_bytes).finalize() ) // 32 bytes +``` + +Only the data digest is finalized (32 bytes). + +### 4.2 Nullable, Non-List Field + +``` +final_digest.update( bit_count.to_le_bytes() ) // 8 bytes (usize LE = u64 LE on 64-bit) +for each word in validity_bitvec.as_raw_slice(): // each word is usize (8 bytes on 64-bit) + final_digest.update( word.to_be_bytes() ) // 8 bytes big-endian per word +final_digest.update( SHA-256(data_bytes).finalize() ) // 32 bytes +``` + +### 4.3 Non-Nullable List Field + +``` +final_digest.update( SHA-256(structural_bytes).finalize() ) // 32 bytes (element counts) +final_digest.update( SHA-256(data_bytes).finalize() ) // 32 bytes (leaf values) +``` + +### 4.4 Nullable List Field + +``` +final_digest.update( bit_count.to_le_bytes() ) // 8 bytes +for each word in validity_bitvec.as_raw_slice(): + final_digest.update( word.to_be_bytes() ) // 8 bytes per word +final_digest.update( SHA-256(structural_bytes).finalize() ) // 32 bytes (element counts) +final_digest.update( SHA-256(data_bytes).finalize() ) // 32 bytes (leaf values) +``` + +**Validity BitVec details** (applies to all nullable variants): +- Storage type: `usize` (8 bytes on 64-bit platforms). +- Bit order: `Lsb0` (least significant bit first within each word). +- `bit_count` = total number of elements (valid + null), serialized as `usize` little-endian. +- Each storage word is serialized as `usize` big-endian. +- The last word may have unused high bits (zero-padded). + +--- + +## 5. Final Combining Digest + +The final hash is computed by feeding into a fresh SHA-256: + +``` +final_digest = SHA-256() + +// 1. Schema digest (32 bytes) +final_digest.update( schema_digest ) + +// 2. Field digests in alphabetical order of field path +for field_path in sorted(field_paths): + finalize field's DigestBufferType into final_digest (see Section 4) + +raw_hash = final_digest.finalize() // 32 bytes +output = [0x00, 0x00, 0x01] ++ raw_hash // 35 bytes +``` + +--- + +## 6. `hash_array` API + +The `hash_array` function hashes a single array (without a schema context). It works slightly differently from the record-batch path: + +``` +final_digest = SHA-256() + +// 1. Type metadata (canonical JSON string) +canonical_type = data_type_to_value(effective_data_type) +json_string = JSON.serialize(canonical_type) // compact, keys sorted +final_digest.update( json_string.as_bytes() ) + +// 2. Data (with structural separation for list types) +digest_buffer = { + null_bits: BitVec if nullable, else absent + structural: SHA-256() if list type, else absent + data: SHA-256() +} +array_digest_update(effective_data_type, effective_array, digest_buffer) +finalize digest_buffer into final_digest (see Section 4) + +raw_hash = final_digest.finalize() // 32 bytes +output = [0x00, 0x00, 0x01] ++ raw_hash // 35 bytes +``` + +Dictionary arrays are resolved to their value type before hashing. + +--- + +## 7. Worked Examples + +### Example A: Simple Two-Column Table + +**Schema**: `{age: Int32 non-nullable, name: LargeUtf8 nullable}` + +**Data** (1 record batch, 2 rows): + +| age | name | +|-----|---------| +| 25 | "Alice" | +| 30 | NULL | + +#### Step 1: Schema Digest + +Canonical JSON (compact): +``` +{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}} +``` + +``` +schema_digest = SHA-256("{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}}") +``` + +#### Step 2: Field "age" (Int32, non-nullable) + +Values: `[25, 30]` + +Little-endian bytes: +- 25 as i32 LE: `19 00 00 00` +- 30 as i32 LE: `1e 00 00 00` + +Data fed to digest: `19 00 00 00 1e 00 00 00` (8 bytes, one contiguous slice) + +``` +age_data_digest = SHA-256(0x19000000_1e000000) +``` + +Finalization into final_digest (non-nullable): +``` +final_digest.update( age_data_digest.finalize() ) // 32 bytes +``` + +#### Step 3: Field "name" (LargeUtf8, nullable) + +Values: `["Alice", NULL]` + +**Validity bits** (Lsb0 in usize words): +- Element 0 ("Alice"): valid → bit = 1 +- Element 1 (NULL): null → bit = 0 +- BitVec contents: bits `[1, 0]`, bit_count = 2 +- As usize (Lsb0): bit 0 = 1, bit 1 = 0 → binary `...0000_0001` = 1 +- `as_raw_slice()` = `[1_usize]` + +Validity serialization: +``` +bit_count LE: 02 00 00 00 00 00 00 00 (2 as usize little-endian) +word 0 BE: 00 00 00 00 00 00 00 01 (1 as usize big-endian) +``` + +**Data bytes** (only valid elements): +- "Alice": length 5 as u64 LE = `05 00 00 00 00 00 00 00`, then UTF-8 bytes `41 6c 69 63 65` +- NULL: skipped entirely + +``` +name_data_digest = SHA-256(0x0500000000000000_416c696365) +``` + +Finalization into final_digest (nullable): +``` +final_digest.update( 0x0200000000000000 ) // bit count +final_digest.update( 0x0000000000000001 ) // word 0 BE +final_digest.update( name_data_digest.finalize() ) // 32 bytes +``` + +#### Step 4: Final Combination + +Fields in alphabetical order: `age`, then `name`. + +``` +final_digest = SHA-256() +final_digest.update( schema_digest ) // 32 bytes +final_digest.update( age_data_digest.finalize() ) // 32 bytes (non-nullable) +final_digest.update( 0x0200000000000000 ) // name bit count +final_digest.update( 0x0000000000000001 ) // name validity word +final_digest.update( name_data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example B: Boolean Array with Nulls (hash_array API) + +**Array**: `BooleanArray [true, NULL, false, true]` (nullable) + +#### Step 1: Type Metadata + +Canonical type JSON: `"Boolean"` (7 bytes as UTF-8) + +``` +final_digest.update(b'"Boolean"') +``` + +Note: `serde_json::to_string` of a JSON string value includes the surrounding quotes. + +#### Step 2: Data + +**Validity bits** (Lsb0 in usize): +- `[1, 0, 1, 1]` → bits: b0=1, b1=0, b2=1, b3=1 +- As usize (Lsb0): binary `...0000_1101` = 13 +- `as_raw_slice()` = `[13_usize]` + +**Data bits** (Msb0 packed, valid values only): +- Valid values: `[true, false, true]` (3 values) +- Msb0 packing: bit7=true(1), bit6=false(0), bit5=true(1), bits4-0=0 +- Byte: `10100000` = `0xA0` + +``` +data_digest = SHA-256(0xA0) +``` + +#### Step 3: Finalization + +``` +final_digest = SHA-256() +final_digest.update(b'"Boolean"') // type metadata +final_digest.update( 0x0400000000000000 ) // 4 bits (bit count LE) +final_digest.update( 0x000000000000000D ) // 13 as usize BE +final_digest.update( data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example C: Non-Nullable Int32 Array (hash_array API) + +**Array**: `Int32Array [1, 2, 3]` (non-nullable) + +#### Step 1: Type Metadata + +Canonical type JSON: `"Int32"` (6 bytes: `22 49 6e 74 33 32 22`... wait, `"Int32"` is the JSON string `"Int32"` including quotes) + +Actually: `serde_json::to_string(&json!("Int32"))` produces `"\"Int32\""`, but `data_type_to_value` for Int32 produces the JSON value `"Int32"` (a JSON string). Then `serde_json::to_string` of that JSON string value produces `"\"Int32\""` — the 7-byte string `"Int32"` with quotes. + +``` +final_digest.update(b'"Int32"') // 7 bytes: 22 49 6e 74 33 32 22 +``` + +#### Step 2: Data + +Values as i32 LE bytes: +- 1: `01 00 00 00` +- 2: `02 00 00 00` +- 3: `03 00 00 00` + +Entire buffer fed as one slice: `01 00 00 00 02 00 00 00 03 00 00 00` (12 bytes) + +``` +data_digest = SHA-256(0x010000000200000003000000) +``` + +#### Step 3: Finalization (non-nullable) + +``` +final_digest = SHA-256() +final_digest.update(b'"Int32"') // 7 bytes +final_digest.update( data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example D: Binary Array (hash_array API) + +**Array**: `BinaryArray [b"hi", b""]` (non-nullable) + +#### Step 1: Type Metadata + +`Binary` is canonicalized to `LargeBinary`. + +``` +final_digest.update(b'"LargeBinary"') // 13 bytes +``` + +#### Step 2: Data + +Each element: `[u64 LE length] [raw bytes]` + +- `b"hi"`: length 2 → `02 00 00 00 00 00 00 00` + `68 69` +- `b""`: length 0 → `00 00 00 00 00 00 00 00` (no raw bytes) + +``` +data_digest = SHA-256(0x0200000000000000_6869_0000000000000000) +``` + +#### Step 3: Finalization (non-nullable) + +``` +final_digest = SHA-256() +final_digest.update(b'"LargeBinary"') +final_digest.update( data_digest.finalize() ) +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example E: Column-Order Independence + +Two record batches with the same logical data but different column orders must produce identical hashes. + +**Batch 1** (columns: x, y): +``` +Schema: {x: Int32 non-nullable, y: Boolean nullable} +x: [10] +y: [true] +``` + +**Batch 2** (columns: y, x): +``` +Schema: {y: Boolean nullable, x: Int32 non-nullable} +y: [true] +x: [10] +``` + +Both produce the same canonical schema JSON: +``` +{"x":{"data_type":"Int32","nullable":false},"y":{"data_type":"Boolean","nullable":true}} +``` + +Both produce the same field digests (fields processed alphabetically: `x` then `y`): +- Field `x`: `SHA-256(0x0a000000)` (10 as i32 LE) +- Field `y`: validity `[1]` (1 bit, 1 word), data `0x80` (true packed Msb0) + +Therefore `hash_record_batch(batch1) == hash_record_batch(batch2)`. + +--- + +### Example F: Type Equivalence (Utf8 vs LargeUtf8) + +**Array 1**: `StringArray ["ab"]` (non-nullable, Arrow type `Utf8`) +**Array 2**: `LargeStringArray ["ab"]` (non-nullable, Arrow type `LargeUtf8`) + +Both produce the same type metadata: `"LargeUtf8"` (after canonicalization). + +Both produce the same data bytes: +``` +02 00 00 00 00 00 00 00 (length 2 as u64 LE) +61 62 ("ab" as UTF-8) +``` + +Therefore `hash_array(array1) == hash_array(array2)`. + +--- + +### Example G: Nullable Int32 Array with Nulls (hash_array API) + +**Array**: `Int32Array [Some(42), None, Some(-7), Some(0)]` (nullable) + +#### Step 1: Type Metadata + +``` +final_digest.update(b'"Int32"') // 7 bytes +``` + +#### Step 2: Data + +**Validity bits** (Lsb0 in usize): +- `[1, 0, 1, 1]` → bits: b0=1, b1=0, b2=1, b3=1 +- As usize (Lsb0): binary `...0000_1101` = 13 +- bit_count = 4 + +**Data bytes** (only valid elements): +- 42 as i32 LE: `2a 00 00 00` +- -7 as i32 LE: `f9 ff ff ff` +- 0 as i32 LE: `00 00 00 00` + +``` +data_digest = SHA-256(0x2a000000_f9ffffff_00000000) +``` + +#### Step 3: Finalization (nullable) + +``` +final_digest = SHA-256() +final_digest.update(b'"Int32"') // type metadata +final_digest.update( 0x0400000000000000 ) // 4 bits (bit count LE) +final_digest.update( 0x000000000000000D ) // 13 as usize BE +final_digest.update( data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example H: Nullable String Array with Nulls (hash_array API) + +**Array**: `StringArray [Some("hello"), None, Some("world"), Some("")]` (nullable, Arrow type `Utf8`) + +#### Step 1: Type Metadata + +`Utf8` is canonicalized to `LargeUtf8`. + +``` +final_digest.update(b'"LargeUtf8"') // 12 bytes +``` + +#### Step 2: Data + +**Validity bits** (Lsb0 in usize): +- `[1, 0, 1, 1]` → 0b1101 = 13 +- bit_count = 4 + +**Data bytes** (only valid elements, null skipped entirely): +- `"hello"`: `05 00 00 00 00 00 00 00` (len=5 as u64 LE) + `68 65 6c 6c 6f` +- `"world"`: `05 00 00 00 00 00 00 00` (len=5 as u64 LE) + `77 6f 72 6c 64` +- `""`: `00 00 00 00 00 00 00 00` (len=0 as u64 LE, no raw bytes) + +``` +data_digest = SHA-256(len+"hello" + len+"world" + len+"") +``` + +#### Step 3: Finalization (nullable) + +``` +final_digest = SHA-256() +final_digest.update(b'"LargeUtf8"') +final_digest.update( 0x0400000000000000 ) // bit_count=4 LE +final_digest.update( 0x000000000000000D ) // validity=13 BE +final_digest.update( data_digest.finalize() ) // 32 bytes +raw_hash = final_digest.finalize() +output = 0x000001 ++ raw_hash +``` + +--- + +### Example I: Empty Table (no data, schema only) + +**Schema**: `{a: Int32 non-nullable, b: Boolean nullable}` + +When no record batches are fed (i.e., `finalize()` is called immediately after construction), the field digests still exist — they just contain no data. + +#### Schema Digest + +``` +schema_json = '{"a":{"data_type":"Int32","nullable":false},"b":{"data_type":"Boolean","nullable":true}}' +schema_digest = SHA-256(schema_json) +``` + +#### Field "a" (Int32, non-nullable) + +No data was fed, so: +``` +a_data_digest = SHA-256("") // SHA-256 of empty input +``` + +#### Field "b" (Boolean, nullable) + +No data was fed: +- `bit_count` = 0 (no elements, BitVec is empty) +- `as_raw_slice()` = `[]` (no words) +- Data digest = SHA-256 of empty input + +#### Final Combination + +``` +final_digest = SHA-256() +final_digest.update( schema_digest ) // 32 bytes +final_digest.update( SHA-256("").finalize() ) // field "a" (non-nullable, 32 bytes) +final_digest.update( 0x0000000000000000 ) // field "b" bit_count=0 LE +// no validity words (raw_slice is empty for 0-length BitVec) +final_digest.update( SHA-256("").finalize() ) // field "b" data (32 bytes) +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +### Example J: Multi-Batch Streaming (batch-split independence) + +**Schema**: `{v: Int32 non-nullable}` + +Feeding two batches must produce the same hash as feeding one combined batch: + +- **Batch 1**: `v = [1, 2]` +- **Batch 2**: `v = [3]` +- **Combined**: `v = [1, 2, 3]` + +Because the internal SHA-256 state is incremental: +``` +update(01 00 00 00 02 00 00 00) // from batch 1 +update(03 00 00 00) // from batch 2 +``` +is identical to: +``` +update(01 00 00 00 02 00 00 00 03 00 00 00) // single combined batch +``` + +#### Manual Computation + +``` +schema_json = '{"v":{"data_type":"Int32","nullable":false}}' +schema_digest = SHA-256(schema_json) + +v_data_digest = SHA-256(0x010000000200000003000000) + +final_digest = SHA-256() +final_digest.update( schema_digest ) +final_digest.update( v_data_digest.finalize() ) +output = 0x000001 ++ final_digest.finalize() +``` + +Therefore `hash(batch1 + batch2) == hash(combined)`. + +--- + +### Example K: Struct Column in a Record Batch + +**Schema**: `{person: Struct non-nullable}` + +**Data** (2 rows): + +| person.age | person.name | +|------------|-------------| +| 25 | "Alice" | +| 30 | "Bob" | + +In the record-batch path, the struct is **decomposed into leaf fields**: `person/age` and `person/name`. Each is hashed independently. + +#### Step 1: Schema Digest + +Canonical JSON: +``` +{"person":{"data_type":{"Struct":[{"data_type":"Int32","name":"age","nullable":false},{"data_type":"LargeUtf8","name":"name","nullable":false}]},"nullable":false}} +``` + +#### Step 2: Leaf field "person/age" (Int32, non-nullable) + +``` +age_data_digest = SHA-256(0x19000000_1e000000) // [25, 30] as i32 LE +``` + +#### Step 3: Leaf field "person/name" (LargeUtf8, non-nullable) + +``` +name_data_digest = SHA-256( + 0x0500000000000000 "Alice" // len=5 u64 LE + UTF-8 + 0x0300000000000000 "Bob" // len=3 u64 LE + UTF-8 +) +``` + +#### Step 4: Final Combination + +Fields alphabetically: `person/age`, `person/name`. + +``` +final_digest = SHA-256() +final_digest.update( schema_digest ) // 32 bytes +final_digest.update( age_data_digest.finalize() ) // 32 bytes (non-nullable) +final_digest.update( name_data_digest.finalize() ) // 32 bytes (non-nullable) +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +### Example L: Struct Array via hash_array (non-nullable) + +**Array**: `StructArray [{a: 1, b: true}, {a: 2, b: false}]` + +Children: `a: Int32 non-null`, `b: Boolean non-null`. Struct is non-nullable. + +#### Step 1: Type Metadata + +Canonical type JSON (struct fields sorted alphabetically, keys sorted): +``` +{"Struct":[{"data_type":"Int32","name":"a","nullable":false},{"data_type":"Boolean","name":"b","nullable":false}]} +``` + +#### Step 2: Composite Data + +Children sorted by name: `a`, then `b`. + +**Child "a"** (Int32, non-nullable): +``` +child_a_data_digest = SHA-256(0x01000000_02000000) // [1, 2] as i32 LE +child_a_finalized = child_a_data_digest.finalize() // 32 bytes (non-nullable) +``` + +**Child "b"** (Boolean, non-nullable): +``` +// [true, false] → Msb0: bit7=1, bit6=0 → 0x80 +child_b_data_digest = SHA-256(0x80) +child_b_finalized = child_b_data_digest.finalize() // 32 bytes +``` + +**Parent data stream**: `child_a_finalized || child_b_finalized` + +``` +parent_data_digest = SHA-256( child_a_finalized || child_b_finalized ) +``` + +#### Step 3: Finalization (non-nullable) + +``` +final_digest = SHA-256() +final_digest.update( type_json_bytes ) // type metadata +final_digest.update( parent_data_digest.finalize() ) // 32 bytes +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +### Example M: Nullable Struct Array via hash_array (struct-level nulls) + +**Array**: `StructArray [Some({a: 10, b: "x"}), None, Some({a: 30, b: "z"})]` + +Children: `a: Int32 non-null`, `b: LargeUtf8 non-null`. Struct is **nullable**. + +Row 1 is a null struct — children's data at row 1 is undefined and must be skipped. + +#### Step 1: Type Metadata + +Same struct type JSON as above (with appropriate fields): +``` +{"Struct":[{"data_type":"Int32","name":"a","nullable":false},{"data_type":"LargeUtf8","name":"b","nullable":false}]} +``` + +#### Step 2: Struct-Level Validity + +Struct validity: `[valid, null, valid]` → bits `[1, 0, 1]` +- bit_count = 3 +- usize word (Lsb0): `0b101` = 5 + +This goes into the parent's BitVec (the top-level digest for `hash_array`). + +#### Step 3: Composite Data (children with struct-null propagation) + +**Child "a"** (Int32, effectively nullable due to struct nulls): +- Combined validity: struct AND child = `[1, 0, 1]` (child has no nulls) +- Valid data: `[10, 30]` (row 1 skipped) +- bit_count = 3, validity_word = 5 + +``` +child_a_data_digest = SHA-256(0x0a000000_1e000000) // [10, 30] as i32 LE +child_a_finalized = 0x0300000000000000 // bit_count=3 LE + || 0x0000000000000005 // validity word=5 BE + || child_a_data_digest.finalize() // 32 bytes +``` + +**Child "b"** (LargeUtf8, effectively nullable): +- Combined validity: `[1, 0, 1]` +- Valid data: `"x"`, `"z"` (row 1 skipped) + +``` +child_b_data_digest = SHA-256( + 0x0100000000000000 "x" // len=1 + "x" + 0x0100000000000000 "z" // len=1 + "z" +) +child_b_finalized = 0x0300000000000000 // bit_count=3 LE + || 0x0000000000000005 // validity word=5 BE + || child_b_data_digest.finalize() // 32 bytes +``` + +**Parent data stream**: `child_a_finalized || child_b_finalized` + +``` +parent_data_digest = SHA-256( child_a_finalized || child_b_finalized ) +``` + +#### Step 4: Finalization (nullable) + +``` +final_digest = SHA-256() +final_digest.update( type_json_bytes ) // type metadata +final_digest.update( 0x0300000000000000 ) // struct bit_count=3 LE +final_digest.update( 0x0000000000000005 ) // struct validity word=5 BE +final_digest.update( parent_data_digest.finalize() ) // 32 bytes +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +### Example N: List-of-Struct in a Record Batch + +**Schema**: `{items: LargeList> nullable}` + +**Data** (2 rows): + +| items | +|-------| +| `[{id: 1, label: "a"}, {id: 2, label: "b"}]` | +| `[{id: 3, label: "c"}]` | + +The list column is a single field "items" in the BTreeMap. Its sub-arrays are struct arrays, hashed compositely via `array_digest_update(Struct)`. + +#### Step 1: Schema Digest + +Canonical JSON (element type omits Arrow-internal field name "item"): +``` +{"items":{"data_type":{"LargeList":{"data_type":{"Struct":[{"data_type":"Int32","name":"id","nullable":false},{"data_type":"LargeUtf8","name":"label","nullable":false}]},"nullable":false}},"nullable":true}} +``` + +#### Step 2: Field "items" (nullable list — has null_bits, structural, and data) + +**Validity BitVec** (`null_bits`) — accumulates null bits from the list **and** all recursive sub-arrays that share this digest: + +1. List-level: `handle_null_bits(list)` → `[1, 1]` (both list elements valid) +2. Element 0 struct (2 rows, no nulls): `handle_null_bits(struct)` → `[1, 1]` +3. Element 1 struct (1 row, no nulls): `handle_null_bits(struct)` → `[1]` + +Total BitVec: `[1, 1, 1, 1, 1]` — 5 bits, all valid. +- bit_count = 5 +- usize word (Lsb0): `0b11111` = 31 + +**Structural digest** — receives element counts for each valid list element: + +``` +items_structural receives: + 0x0200000000000000 // element 0: 2 struct rows (u64 LE) + 0x0100000000000000 // element 1: 1 struct row (u64 LE) +``` + +**Data digest** — receives composite struct data (no element count prefixes): + +For each list element, the struct children are sorted alphabetically and their finalized digests are written into the data stream: + +**Element 0** (2 struct rows): + +Struct children (sorted: "id", "label"): +- Child "id" (Int32, non-nullable): `SHA-256(0x01000000_02000000).finalize()` — 32 bytes +- Child "label" (LargeUtf8, non-nullable): `SHA-256(0x0100000000000000 "a" 0x0100000000000000 "b").finalize()` — 32 bytes + +**Element 1** (1 struct row): + +- Child "id": `SHA-256(0x03000000).finalize()` — 32 bytes +- Child "label": `SHA-256(0x0100000000000000 "c").finalize()` — 32 bytes + +``` +items_data_digest = SHA-256( + SHA-256([1,2] as i32 LE).finalize() // element 0 child "id" + || SHA-256(len+"a"+len+"b").finalize() // element 0 child "label" + || SHA-256([3] as i32 LE).finalize() // element 1 child "id" + || SHA-256(len+"c").finalize() // element 1 child "label" +) +``` + +Note: element counts are **not** in the data digest — they are in the structural digest. + +#### Step 3: Final Combination + +Finalization order: null_bits → structural → data (see Section 4.4). + +``` +final_digest = SHA-256() +final_digest.update( schema_digest ) // 32 bytes + +// items field finalization (nullable list = null_bits + structural + data) +final_digest.update( 0x0500000000000000 ) // bit_count=5 LE +final_digest.update( 0x000000000000001F ) // validity word=31 BE +final_digest.update( items_structural_digest.finalize() ) // 32 bytes (element counts) +final_digest.update( items_data_digest.finalize() ) // 32 bytes (leaf data) + +output = 0x000001 ++ final_digest.finalize() +``` + +--- + +## 8. Platform Considerations + +- **Integer sizes**: All length prefixes use `u64` (8 bytes). Validity bit counts and validity words use `usize`, which is 8 bytes on 64-bit platforms. This means hashes are **platform-dependent** if `usize` differs (32-bit vs 64-bit). +- **Byte order**: Data values use little-endian. Validity words use big-endian. Bit counts use little-endian. +- **Floating point**: IEEE 754 representation is hashed directly. `NaN` values with different bit patterns produce different hashes. `+0.0` and `-0.0` produce different hashes. diff --git a/src/arrow_digester_core.rs b/src/arrow_digester_core.rs index 5dde5a6..112bdbe 100644 --- a/src/arrow_digester_core.rs +++ b/src/arrow_digester_core.rs @@ -7,24 +7,39 @@ use std::{collections::BTreeMap, iter::repeat_n}; use arrow::{ array::{ - Array, BinaryArray, BooleanArray, GenericBinaryArray, GenericListArray, GenericStringArray, - LargeBinaryArray, LargeListArray, LargeStringArray, ListArray, OffsetSizeTrait, - RecordBatch, StringArray, StructArray, + make_array, Array, BinaryArray, BooleanArray, GenericBinaryArray, GenericListArray, + GenericStringArray, LargeBinaryArray, LargeListArray, LargeStringArray, ListArray, + OffsetSizeTrait, RecordBatch, StringArray, StructArray, }, + buffer::NullBuffer, + compute::cast, datatypes::{DataType, Schema}, }; use arrow_schema::Field; use bitvec::prelude::*; use digest::Digest; -const NULL_BYTES: &[u8] = b"NULL"; - const DELIMITER_FOR_NESTED_FIELD: &str = "/"; #[derive(Clone)] -enum DigestBufferType { - NonNullable(D), - Nullable(BitVec, D), // Where first digest is for the bull bits, while the second is for the actual data +struct DigestBufferType { + null_bits: Option, + structural: Option, + data: D, +} + +impl DigestBufferType { + fn new(nullable: bool, structured: bool) -> Self { + Self { + null_bits: nullable.then(BitVec::new), + structural: structured.then(D::new), + data: D::new(), + } + } +} + +const fn is_list_type(data_type: &DataType) -> bool { + matches!(data_type, DataType::List(_) | DataType::LargeList(_)) } #[derive(Clone)] @@ -56,9 +71,10 @@ impl ArrowDigesterCore { /// Hash a record batch and update the internal digests. pub fn update(&mut self, record_batch: &RecordBatch) { - // Verify schema matches + // Verify schema matches logically (same fields regardless of order, with type canonicalization) assert!( - *record_batch.schema() == self.schema, + Self::serialized_schema(record_batch.schema().as_ref()) + == Self::serialized_schema(&self.schema), "Record batch schema does not match ArrowDigester schema" ); @@ -112,21 +128,33 @@ impl ArrowDigesterCore { /// This function will panic if JSON serialization of the data type fails. /// pub fn hash_array(array: &dyn Array) -> Vec { + // Resolve dictionary arrays to their plain value type + let (effective_type, resolved_array); + let effective_array: &dyn Array = + if let DataType::Dictionary(_, value_type) = array.data_type() { + resolved_array = cast(array, value_type.as_ref()) + .expect("Failed to cast dictionary to plain array"); + effective_type = value_type.as_ref().clone(); + resolved_array.as_ref() + } else { + effective_type = array.data_type().clone(); + array + }; + let mut final_digest = D::new(); - let data_type_serialized = serde_json::to_string(&array.data_type()) + // Use canonical type serialization for metadata + let canonical_type = Self::data_type_to_value(&effective_type); + let data_type_serialized = serde_json::to_string(&canonical_type) .expect("Failed to serialize data type to string"); // Update the digest buffer with the array metadata and field data final_digest.update(data_type_serialized); // Now we update it with the actual array data - let mut digest_buffer = if array.is_nullable() { - DigestBufferType::Nullable(BitVec::new(), D::new()) - } else { - DigestBufferType::NonNullable(D::new()) - }; - Self::array_digest_update(array.data_type(), array, &mut digest_buffer); + let mut digest_buffer = + DigestBufferType::new(effective_array.is_nullable(), is_list_type(&effective_type)); + Self::array_digest_update(&effective_type, effective_array, &mut digest_buffer); Self::finalize_digest(&mut final_digest, digest_buffer); // Finalize and return the digest @@ -164,18 +192,19 @@ impl ArrowDigesterCore { /// Finalize a single field digest into the final digest. /// Helpers to reduce code duplication. fn finalize_digest(final_digest: &mut D, digest: DigestBufferType) { - match digest { - DigestBufferType::NonNullable(data_digest) => { - final_digest.update(data_digest.finalize()); - } - DigestBufferType::Nullable(null_bit_digest, data_digest) => { - final_digest.update(null_bit_digest.len().to_le_bytes()); - for &word in null_bit_digest.as_raw_slice() { - final_digest.update(word.to_be_bytes()); - } - final_digest.update(data_digest.finalize()); + // Null bits first (if nullable) + if let Some(null_bit_vec) = &digest.null_bits { + final_digest.update(null_bit_vec.len().to_le_bytes()); + for &word in null_bit_vec.as_raw_slice() { + final_digest.update(word.to_be_bytes()); } } + // Structural digest (if list type) — sizes separated from leaf data + if let Some(structural) = digest.structural { + final_digest.update(structural.finalize()); + } + // Data/leaf digest + final_digest.update(digest.data.finalize()); } /// Serialize the schema into a `BTreeMap` for field name and its digest. @@ -201,33 +230,44 @@ impl ArrowDigesterCore { /// Convert a `DataType` to a JSON value, recursively converting any inner `Field` /// references to only include `name`, `data_type`, and `nullable`. fn data_type_to_value(data_type: &DataType) -> serde_json::Value { - match data_type { + let value = match data_type { DataType::Struct(fields) => { - let fields_json: Vec = fields + let mut sorted_fields: Vec<_> = fields.iter().collect(); + sorted_fields.sort_by_key(|f| f.name().clone()); + let fields_json: Vec = sorted_fields .iter() .map(|f| Self::inner_field_to_value(f)) .collect(); serde_json::json!({ "Struct": fields_json }) } - DataType::List(field) => { - serde_json::json!({ "List": Self::inner_field_to_value(field) }) - } - DataType::LargeList(field) => { - serde_json::json!({ "LargeList": Self::inner_field_to_value(field) }) + // Canonicalize List → LargeList; drop Arrow-internal field name ("item") + DataType::List(field) | DataType::LargeList(field) => { + serde_json::json!({ "LargeList": Self::element_type_to_value(field) }) } DataType::FixedSizeList(field, size) => { - serde_json::json!({ "FixedSizeList": [Self::inner_field_to_value(field), size] }) + serde_json::json!({ "FixedSizeList": [Self::element_type_to_value(field), size] }) } DataType::Map(field, sorted) => { serde_json::json!({ "Map": [Self::inner_field_to_value(field), sorted] }) } + // Canonicalize Binary → LargeBinary + DataType::Binary => { + serde_json::to_value(&DataType::LargeBinary).expect("Failed to serialize data type") + } + // Canonicalize Utf8 → LargeUtf8 + DataType::Utf8 => { + serde_json::to_value(&DataType::LargeUtf8).expect("Failed to serialize data type") + } + // Canonicalize Dictionary → value type + DataType::Dictionary(_, value_type) => Self::data_type_to_value(value_type.as_ref()), // For all non-nested types, Arrow's default serde is sufficient other => serde_json::to_value(other).expect("Failed to serialize data type"), - } + }; + Self::sort_json_value(value) } - /// Convert an inner field (e.g., list item, struct child) to a JSON value - /// with only `name`, `data_type`, and `nullable`. + /// Convert an inner field (e.g., struct child) to a JSON value + /// with `name`, `data_type`, and `nullable`. fn inner_field_to_value(field: &Field) -> serde_json::Value { serde_json::json!({ "name": field.name(), @@ -236,6 +276,15 @@ impl ArrowDigesterCore { }) } + /// Convert a container element field (e.g., list item) to a JSON value + /// with only `data_type` and `nullable`, omitting the Arrow-internal field name. + fn element_type_to_value(field: &Field) -> serde_json::Value { + serde_json::json!({ + "data_type": Self::data_type_to_value(field.data_type()), + "nullable": field.is_nullable(), + }) + } + /// Recursively sort all JSON object keys for deterministic serialization. fn sort_json_value(value: serde_json::Value) -> serde_json::Value { match value { @@ -327,30 +376,25 @@ impl ArrowDigesterCore { .downcast_ref::() .expect("Failed to downcast to BooleanArray"); - match digest { - DigestBufferType::NonNullable(data_digest) => { - // We want to bit pack the boolean values into bytes for hashing - let mut bit_vec = BitVec::::with_capacity(bool_array.len()); - for i in 0..bool_array.len() { + if let Some(ref mut null_bits) = digest.null_bits { + // Handle null bits first + Self::handle_null_bits(bool_array, null_bits); + + // Handle the data — only valid bits + let mut bit_vec = BitVec::::with_capacity(bool_array.len()); + for i in 0..bool_array.len() { + if bool_array.is_valid(i) { bit_vec.push(bool_array.value(i)); } - - data_digest.update(bit_vec.as_raw_slice()); } - DigestBufferType::Nullable(null_bit_vec, data_digest) => { - // Handle null bits first - Self::handle_null_bits(bool_array, null_bit_vec); - - // Handle the data - let mut bit_vec = BitVec::::with_capacity(bool_array.len()); - for i in 0..bool_array.len() { - // We only want the valid bits, for null we will discard from the hash since that is already capture by null_bits - if bool_array.is_valid(i) { - bit_vec.push(bool_array.value(i)); - } - } - data_digest.update(bit_vec.as_raw_slice()); + digest.data.update(bit_vec.as_raw_slice()); + } else { + // Non-nullable: pack all boolean values + let mut bit_vec = BitVec::::with_capacity(bool_array.len()); + for i in 0..bool_array.len() { + bit_vec.push(bool_array.value(i)); } + digest.data.update(bit_vec.as_raw_slice()); } } DataType::Int8 | DataType::UInt8 => Self::hash_fixed_size_array(array, digest, 1), @@ -432,9 +476,75 @@ impl ArrowDigesterCore { ); } DataType::LargeListView(_) => todo!(), - DataType::Struct(_) => todo!(), + DataType::Struct(fields) => { + let struct_array = array + .as_any() + .downcast_ref::() + .expect("Failed to downcast to StructArray"); + + // Push struct-level nulls to parent's BitVec (same pattern as other types) + if let Some(ref mut null_bits) = digest.null_bits { + Self::handle_null_bits(struct_array, null_bits); + } + + // Sort children alphabetically by field name + let mut sorted_fields: Vec<_> = fields.iter().enumerate().collect(); + sorted_fields.sort_by_key(|(_, f)| f.name().clone()); + + for (idx, child_field) in &sorted_fields { + let child_array = struct_array.column(*idx); + + // Child is effectively nullable if the child field is nullable + // OR the struct itself has nulls (struct-level nulls propagate down) + let effectively_nullable = + child_field.is_nullable() || struct_array.nulls().is_some(); + + let mut child_digest = DigestBufferType::new( + effectively_nullable, + is_list_type(child_field.data_type()), + ); + + if let Some(struct_nulls) = struct_array.nulls() { + // Propagate struct-level nulls into the child array by combining + // struct validity with child validity: combined = struct AND child + let combined_nulls = child_array.nulls().map_or_else( + || struct_nulls.clone(), + |child_nulls| { + NullBuffer::new(struct_nulls.inner() & child_nulls.inner()) + }, + ); + let child_data = child_array.to_data(); + let null_count = combined_nulls.null_count(); + let new_data = child_data + .into_builder() + .null_count(null_count) + .null_bit_buffer(Some(combined_nulls.into_inner().into_inner())) + .build() + .expect("Failed to rebuild child array with combined null buffer"); + let combined_child = make_array(new_data); + Self::array_digest_update( + child_field.data_type(), + combined_child.as_ref(), + &mut child_digest, + ); + } else { + Self::array_digest_update( + child_field.data_type(), + child_array.as_ref(), + &mut child_digest, + ); + } + + // Finalize child digest into parent's data stream + Self::finalize_child_into_data(digest, child_digest); + } + } DataType::Union(_, _) => todo!(), - DataType::Dictionary(_, _) => todo!(), + DataType::Dictionary(_, value_type) => { + let resolved = cast(array, value_type.as_ref()) + .expect("Failed to cast dictionary to plain array"); + Self::array_digest_update(value_type.as_ref(), resolved.as_ref(), digest); + } DataType::Decimal128(_, _) => { Self::hash_fixed_size_array(array, digest, 16); } @@ -469,41 +579,38 @@ impl ArrowDigesterCore { ) .expect("Failed to get buffer slice for FixedSizeBinaryArray"); - match digest_buffer { - DigestBufferType::NonNullable(data_digest) => { - // No nulls, we can hash the entire buffer directly - data_digest.update(slice); - } - DigestBufferType::Nullable(null_bits, data_digest) => { - // Handle null bits first - Self::handle_null_bits(array, null_bits); - - match array_data.nulls() { - Some(null_buffer) => { - // There are nulls, so we need to incrementally hash each value - for i in 0..array_data.len() { - if null_buffer.is_valid(i) { - let data_pos = i - .checked_mul(element_size_usize) - .expect("Data position multiplication overflow"); - let end_pos = data_pos - .checked_add(element_size_usize) - .expect("End position addition overflow"); - - data_digest.update( - slice - .get(data_pos..end_pos) - .expect("Failed to get data_slice"), - ); - } + if let Some(ref mut null_bits) = digest_buffer.null_bits { + // Handle null bits first + Self::handle_null_bits(array, null_bits); + + match array_data.nulls() { + Some(null_buffer) => { + // There are nulls, so we need to incrementally hash each value + for i in 0..array_data.len() { + if null_buffer.is_valid(i) { + let data_pos = i + .checked_mul(element_size_usize) + .expect("Data position multiplication overflow"); + let end_pos = data_pos + .checked_add(element_size_usize) + .expect("End position addition overflow"); + + digest_buffer.data.update( + slice + .get(data_pos..end_pos) + .expect("Failed to get data_slice"), + ); } } - None => { - // No nulls, we can hash the entire buffer directly - data_digest.update(slice); - } + } + None => { + // No nulls, we can hash the entire buffer directly + digest_buffer.data.update(slice); } } + } else { + // No nulls, we can hash the entire buffer directly + digest_buffer.data.update(slice); } } @@ -511,42 +618,16 @@ impl ArrowDigesterCore { array: &GenericBinaryArray, digest: &mut DigestBufferType, ) { - match digest { - DigestBufferType::NonNullable(data_digest) => { - for i in 0..array.len() { - let value = array.value(i); - data_digest.update(value.len().to_le_bytes()); - data_digest.update(value); - } - } - DigestBufferType::Nullable(null_bit_vec, data_digest) => { - // Deal with the null bits first - if let Some(null_buf) = array.nulls() { - // We would need to iterate through the null buffer and push it into the null_bit_vec - for i in 0..array.len() { - null_bit_vec.push(null_buf.is_valid(i)); - } + if let Some(ref mut null_bits) = digest.null_bits { + Self::handle_null_bits(array, null_bits); + } - for i in 0..array.len() { - if null_buf.is_valid(i) { - let value = array.value(i); - data_digest.update(value.len().to_le_bytes()); - data_digest.update(value); - } else { - data_digest.update(NULL_BYTES); - } - } - } else { - // All valid, therefore we can extend the bit vector with all true values - null_bit_vec.extend(repeat_n(true, array.len())); - - // Deal with the data - for i in 0..array.len() { - let value = array.value(i); - data_digest.update(value.len().to_le_bytes()); - data_digest.update(value); - } - } + let null_buf = array.nulls(); + for i in 0..array.len() { + if null_buf.is_none_or(|nb| nb.is_valid(i)) { + let value = array.value(i); + digest.data.update((value.len() as u64).to_le_bytes()); + digest.data.update(value); } } } @@ -555,38 +636,16 @@ impl ArrowDigesterCore { array: &GenericStringArray, digest: &mut DigestBufferType, ) { - match digest { - DigestBufferType::NonNullable(data_digest) => { - for i in 0..array.len() { - let value = array.value(i); - data_digest.update((value.len() as u64).to_le_bytes()); - data_digest.update(value.as_bytes()); - } - } - DigestBufferType::Nullable(null_bit_vec, data_digest) => { - // Deal with the null bits first - Self::handle_null_bits(array, null_bit_vec); - - match array.nulls() { - Some(null_buf) => { - for i in 0..array.len() { - if null_buf.is_valid(i) { - let value = array.value(i); - data_digest.update((value.len() as u64).to_le_bytes()); - data_digest.update(value.as_bytes()); - } else { - data_digest.update(NULL_BYTES); - } - } - } - None => { - for i in 0..array.len() { - let value = array.value(i); - data_digest.update((value.len() as u64).to_le_bytes()); - data_digest.update(value.as_bytes()); - } - } - } + if let Some(ref mut null_bits) = digest.null_bits { + Self::handle_null_bits(array, null_bits); + } + + let null_buf = array.nulls(); + for i in 0..array.len() { + if null_buf.is_none_or(|nb| nb.is_valid(i)) { + let value = array.value(i); + digest.data.update((value.len() as u64).to_le_bytes()); + digest.data.update(value.as_bytes()); } } } @@ -596,40 +655,27 @@ impl ArrowDigesterCore { field_data_type: &DataType, digest: &mut DigestBufferType, ) { - match digest { - // Wildcard `_` avoids binding so `digest` remains usable below - DigestBufferType::NonNullable(_) => { - for i in 0..array.len() { - let sub = array.value(i); - // Prefix sub-array element count to prevent cross-boundary collisions. - // Without this [[1,2],[3]] and [[1],[2,3]] produce identical byte streams. - // sub.len() returns usize, avoiding the non-primitive OffsetSizeTrait cast. - Self::update_data_digest(digest, (sub.len() as u64).to_le_bytes()); - Self::array_digest_update(field_data_type, sub.as_ref(), digest); - } - } - DigestBufferType::Nullable(bit_vec, _) => { - // Deal with null bits first; NLL ends bit_vec borrow after this call - Self::handle_null_bits(array, bit_vec); - - match array.nulls() { - Some(null_buf) => { - for i in 0..array.len() { - if null_buf.is_valid(i) { - let sub = array.value(i); - Self::update_data_digest(digest, (sub.len() as u64).to_le_bytes()); - Self::array_digest_update(field_data_type, sub.as_ref(), digest); - } - } - } - None => { - for i in 0..array.len() { - let sub = array.value(i); - Self::update_data_digest(digest, (sub.len() as u64).to_le_bytes()); - Self::array_digest_update(field_data_type, sub.as_ref(), digest); - } - } + // Handle null bits first (if nullable) + if let Some(ref mut null_bits) = digest.null_bits { + Self::handle_null_bits(array, null_bits); + } + + let null_buf = array.nulls(); + for i in 0..array.len() { + if null_buf.is_none_or(|nb| nb.is_valid(i)) { + let sub = array.value(i); + let size_bytes = (sub.len() as u64).to_le_bytes(); + + // Write element count to structural digest (separating structure from leaf data). + // If no structural digest exists, fall back to data digest for backward compat. + if let Some(ref mut structural) = digest.structural { + structural.update(size_bytes); + } else { + digest.data.update(size_bytes); } + + // Recurse into sub-array — leaf data goes to data digest + Self::array_digest_update(field_data_type, sub.as_ref(), digest); } } } @@ -655,11 +701,7 @@ impl ArrowDigesterCore { // Base case, just add the the combine field name to the map fields_digest_buffer.insert( Self::construct_field_name_hierarchy(parent_field_name, field.name()), - if field.is_nullable() { - DigestBufferType::Nullable(BitVec::new(), D::new()) - } else { - DigestBufferType::NonNullable(D::new()) - }, + DigestBufferType::new(field.is_nullable(), is_list_type(field.data_type())), ); } } @@ -672,12 +714,33 @@ impl ArrowDigesterCore { } } - /// Write bytes directly into the data digest portion of the buffer, bypassing null-bit tracking. + /// Write bytes directly into the data/leaf digest portion of the buffer, bypassing null-bit tracking. /// Used to write length prefixes that sit in the data stream but are not nullable values. fn update_data_digest(digest: &mut DigestBufferType, data: impl AsRef<[u8]>) { - match digest { - DigestBufferType::NonNullable(d) | DigestBufferType::Nullable(_, d) => d.update(data), + digest.data.update(data); + } + + /// Finalize a child's digest and write the resulting bytes into the parent's data stream. + /// Used for composite types (structs) where each child is independently hashed and then + /// its finalized representation is fed into the parent digest. + #[expect( + clippy::big_endian_bytes, + reason = "Use for bit packing the null_bit_values" + )] + fn finalize_child_into_data(parent: &mut DigestBufferType, child: DigestBufferType) { + // Null bits first (if nullable child) + if let Some(null_bit_vec) = &child.null_bits { + Self::update_data_digest(parent, null_bit_vec.len().to_le_bytes()); + for &word in null_bit_vec.as_raw_slice() { + Self::update_data_digest(parent, word.to_be_bytes()); + } + } + // Structural digest (if list child) + if let Some(structural) = child.structural { + Self::update_data_digest(parent, structural.finalize()); } + // Data/leaf digest + Self::update_data_digest(parent, child.data.finalize()); } fn handle_null_bits(array: &dyn Array, null_bit_vec: &mut BitVec) { @@ -727,7 +790,7 @@ mod tests { use pretty_assertions::assert_eq; use sha2::{Digest as _, Sha256}; - use crate::arrow_digester_core::{ArrowDigesterCore, DigestBufferType}; + use crate::arrow_digester_core::ArrowDigesterCore; use arrow::array::{Decimal256Array, Decimal64Array}; use arrow_buffer::i256; @@ -920,7 +983,7 @@ mod tests { // Check the digest assert_eq!( encode(digester.finalize()), - "9841aab2dfeb637872d41422d33fca1e939f06b8fa0dcec66ff3782592cf9565" + "e13ce8a993a636f70e30bc2f4c0667fa6a42aeef94d1a32e78e8fd8dbc59b0a0" ); } @@ -944,11 +1007,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 4); assert!(null_bit_vec[0], "index 0 (true) should be valid"); @@ -981,10 +1042,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = &buf.data; // [false, true, false] packed Msb0: bit0=0, bit1=1, bit2=0 → 0100_0000 = 0x40 let mut manual = Sha256::new(); @@ -1008,11 +1068,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1039,10 +1097,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = &buf.data; let mut manual = Sha256::new(); manual.update([0x01_u8, 0x02_u8, 0xFF_u8]); @@ -1067,11 +1124,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1102,10 +1157,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = &buf.data; let mut manual = Sha256::new(); manual.update(100_u16.to_le_bytes()); @@ -1138,10 +1192,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = &buf.data; let mut manual = Sha256::new(); manual.update(half::f16::from_f32(1.0).to_le_bytes()); @@ -1176,13 +1229,12 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = digester + let buf = digester .fields_digest_buffer .get("int32_col") - .expect("int32_col field should exist in digest buffer") - else { - panic!("Expected a Nullable digest buffer for int32_col"); - }; + .expect("int32_col field should exist in digest buffer"); + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; // The null bit vector should be [true, false, true, true] for [Some(42), None, Some(-7), Some(0)] assert_eq!(null_bit_vec.len(), 4); @@ -1217,11 +1269,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1256,11 +1306,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1296,11 +1344,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1333,10 +1379,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = &buf.data; let mut manual = Sha256::new(); manual.update(0_i32.to_le_bytes()); @@ -1361,11 +1406,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1392,11 +1435,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1429,10 +1470,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = &buf.data; let mut manual = Sha256::new(); manual.update(1.0_f64.to_le_bytes()); @@ -1464,11 +1504,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1501,10 +1539,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = &buf.data; let mut manual = Sha256::new(); manual.update(0_i64.to_le_bytes()); @@ -1529,11 +1566,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1560,11 +1595,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1601,11 +1634,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1640,11 +1671,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1680,11 +1709,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1724,11 +1751,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1766,11 +1791,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1789,8 +1812,8 @@ mod tests { #[test] fn digest_binary_nullable_bytes() { // [b"hello", None, b"world"] - // Valid entries: (length as usize LE) ++ bytes. - // Null entries contribute the sentinel b"NULL" to the data digest. + // Valid entries: (length as u64 LE) ++ bytes. + // Null entries are skipped entirely in the data digest. let array = BinaryArray::from(vec![Some(b"hello".as_ref()), None, Some(b"world".as_ref())]); let schema = Schema::new(vec![Field::new("col", DataType::Binary, true)]); let mut digester = ArrowDigesterCore::::new(schema); @@ -1802,11 +1825,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1814,10 +1835,10 @@ mod tests { assert!(null_bit_vec[2]); let mut manual = Sha256::new(); - manual.update(5_usize.to_le_bytes()); // len("hello") + manual.update(5_u64.to_le_bytes()); // len("hello") manual.update(b"hello"); - manual.update(b"NULL"); // null sentinel - manual.update(5_usize.to_le_bytes()); // len("world") + // null entry skipped — no sentinel bytes + manual.update(5_u64.to_le_bytes()); // len("world") manual.update(b"world"); assert_eq!(data_digest.clone().finalize(), manual.finalize()); } @@ -1840,15 +1861,14 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = &buf.data; let mut manual = Sha256::new(); - manual.update(2_usize.to_le_bytes()); + manual.update(2_u64.to_le_bytes()); manual.update(b"ab"); - manual.update(3_usize.to_le_bytes()); + manual.update(3_u64.to_le_bytes()); manual.update(b"cde"); assert_eq!(data_digest.clone().finalize(), manual.finalize()); } @@ -1859,7 +1879,7 @@ mod tests { fn digest_utf8_nullable_bytes() { // ["foo", None, "ba"] // Valid entries: (length as u64 LE) ++ UTF-8 bytes. - // Null entries contribute the sentinel b"NULL" to the data digest. + // Null entries are skipped entirely in the data digest. let array = StringArray::from(vec![Some("foo"), None, Some("ba")]); let schema = Schema::new(vec![Field::new("col", DataType::Utf8, true)]); let mut digester = ArrowDigesterCore::::new(schema); @@ -1871,11 +1891,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::Nullable(null_bit_vec, data_digest) = - &digester.fields_digest_buffer["col"] - else { - panic!("Expected Nullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + let null_bit_vec = buf.null_bits.as_ref().expect("Expected nullable"); + let data_digest = &buf.data; assert_eq!(null_bit_vec.len(), 3); assert!(null_bit_vec[0]); @@ -1885,7 +1903,7 @@ mod tests { let mut manual = Sha256::new(); manual.update(3_u64.to_le_bytes()); // len("foo") manual.update(b"foo"); - manual.update(b"NULL"); // null sentinel + // null entry skipped — no sentinel bytes manual.update(2_u64.to_le_bytes()); // len("ba") manual.update(b"ba"); assert_eq!(data_digest.clone().finalize(), manual.finalize()); @@ -1909,10 +1927,9 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let data_digest = &buf.data; let mut manual = Sha256::new(); manual.update(1_u64.to_le_bytes()); @@ -1958,18 +1975,28 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let structural_digest = buf + .structural + .as_ref() + .expect("Expected structural digest for list"); + let data_digest = &buf.data; + + // Structural digest: element count (sizes separated from leaf data) + let mut manual_structural = Sha256::new(); + manual_structural.update(3_u64.to_le_bytes()); // element count prefix + assert_eq!( + structural_digest.clone().finalize(), + manual_structural.finalize() + ); - // sub-array has 3 elements at offset 0 → raw buffer slice from byte 0 - let mut manual = Sha256::new(); - manual.update(3_u64.to_le_bytes()); // element count prefix - manual.update(10_i32.to_le_bytes()); - manual.update(20_i32.to_le_bytes()); - manual.update(30_i32.to_le_bytes()); - assert_eq!(data_digest.clone().finalize(), manual.finalize()); + // Data/leaf digest: only the raw leaf values + let mut manual_data = Sha256::new(); + manual_data.update(10_i32.to_le_bytes()); + manual_data.update(20_i32.to_le_bytes()); + manual_data.update(30_i32.to_le_bytes()); + assert_eq!(data_digest.clone().finalize(), manual_data.finalize()); } #[test] @@ -2001,16 +2028,27 @@ mod tests { .unwrap(), ); - let DigestBufferType::NonNullable(data_digest) = &digester.fields_digest_buffer["col"] - else { - panic!("Expected NonNullable buffer"); - }; + let buf = &digester.fields_digest_buffer["col"]; + assert!(buf.null_bits.is_none(), "Expected non-nullable"); + let structural_digest = buf + .structural + .as_ref() + .expect("Expected structural digest for list"); + let data_digest = &buf.data; + + // Structural digest: element count (sizes separated from leaf data) + let mut manual_structural = Sha256::new(); + manual_structural.update(3_u64.to_le_bytes()); + assert_eq!( + structural_digest.clone().finalize(), + manual_structural.finalize() + ); - let mut manual = Sha256::new(); - manual.update(3_u64.to_le_bytes()); - manual.update(1_i32.to_le_bytes()); - manual.update(2_i32.to_le_bytes()); - manual.update(3_i32.to_le_bytes()); - assert_eq!(data_digest.clone().finalize(), manual.finalize()); + // Data/leaf digest: only the raw leaf values + let mut manual_data = Sha256::new(); + manual_data.update(1_i32.to_le_bytes()); + manual_data.update(2_i32.to_le_bytes()); + manual_data.update(3_i32.to_le_bytes()); + assert_eq!(data_digest.clone().finalize(), manual_data.finalize()); } } diff --git a/tests/arrow_digester.rs b/tests/arrow_digester.rs index 303e258..45d9581 100644 --- a/tests/arrow_digester.rs +++ b/tests/arrow_digester.rs @@ -73,7 +73,7 @@ mod tests { assert_eq!( encode(ArrowDigester::new(schema.clone()).finalize()), - "0000019c75bd0c40bd2fb15e878418c151c0b792c966476b35ded7d0f6fd1922cf5a00" + "0000016a44e0dc5c25d5ca0c53312a6afcffa6e07168afc7f16f5e16c8ca052f09f1bb" ); let batch = RecordBatch::try_new( @@ -129,7 +129,7 @@ mod tests { // Hash the record batch assert_eq!( encode(ArrowDigester::hash_record_batch(&batch)), - "00000199f7ba7f6c7ec30ad487996c2b3eb6f0e1c750c318a32b09afcdfdce7de8c08e" + "0000010bc624523e362eb2377c47ccfaf9399a5631404bc20821fdd4e09ca25ea49fde" ); } @@ -199,10 +199,10 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&binary_array)); assert_eq!( hash, - "000001466801efd880d2acecd6c78915b5c2a51476870f9116912834d79de43a000071" + "000001fd0b85d56d72f59c5981c0b54cea148d3a737db10b696e3e3d1d444aed764893" ); - // Test large binary array with same data to ensure consistency + // Large binary array with same data should produce identical hash (type canonicalization) let large_binary_array = LargeBinaryArray::from(vec![ Some(b"hello".as_ref()), None, @@ -210,7 +210,7 @@ mod tests { Some(b"".as_ref()), ]); - assert_ne!( + assert_eq!( hex::encode(ArrowDigester::hash_array(&large_binary_array)), hash ); @@ -263,14 +263,14 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&string_array)); assert_eq!( hash, - "000001811f2407a0d2e90ef9688514d37cd92225242e7614f02ef5ef36abcae73ca374" + "000001088e379f978a8f8ed7148e118bfbcdda99f5bc28c203cdb793da765c76987a9b" ); - // Test large string array with same data to ensure consistency + // Large string array with same data should produce identical hash (type canonicalization) let large_string_array = LargeStringArray::from(vec![Some("hello"), None, Some("world"), Some("")]); - assert_ne!( + assert_eq!( hex::encode(ArrowDigester::hash_array(&large_string_array)), hash ); @@ -289,7 +289,7 @@ mod tests { let hash = hex::encode(ArrowDigester::hash_array(&list_array)); assert_eq!( hash, - "00000114b8faee7c56d2a94d77095db599152df41aaf4d11e485035eebc94e8981f769" + "00000125939ebc0815ab1fb13b19fd7c0f36a1b27c09ec33d8100f5ba9f0e0032442ae" ); // Collision test: [[1, 2], [3]] vs [[1], [2, 3]] @@ -603,7 +603,7 @@ mod tests { /// Two schemas with the same struct fields in different order should produce identical schema hashes. /// Bug: `data_type_to_value()` preserves struct field insertion order in the JSON Vec. #[test] - #[ignore = "Bug: struct fields not sorted in data_type_to_value (Issue 1)"] + fn struct_field_order_in_schema_should_not_affect_hash() { let schema1 = Schema::new(vec![Field::new( "my_struct", @@ -640,7 +640,7 @@ mod tests { /// Record batches with struct columns whose inner fields are reordered should produce identical hashes. #[test] - #[ignore = "Bug: struct fields not sorted in data_type_to_value (Issue 1)"] + fn struct_field_order_in_record_batch_should_not_affect_hash() { let schema1 = Arc::new(Schema::new(vec![Field::new( "s", @@ -667,8 +667,7 @@ mod tests { )])); let ints = Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef; - let bools = - Arc::new(BooleanArray::from(vec![Some(true), Some(false), None])) as ArrayRef; + let bools = Arc::new(BooleanArray::from(vec![Some(true), Some(false), None])) as ArrayRef; let struct1 = StructArray::from(vec![ ( @@ -692,10 +691,8 @@ mod tests { ), ]); - let batch1 = - RecordBatch::try_new(schema1, vec![Arc::new(struct1) as ArrayRef]).unwrap(); - let batch2 = - RecordBatch::try_new(schema2, vec![Arc::new(struct2) as ArrayRef]).unwrap(); + let batch1 = RecordBatch::try_new(schema1, vec![Arc::new(struct1) as ArrayRef]).unwrap(); + let batch2 = RecordBatch::try_new(schema2, vec![Arc::new(struct2) as ArrayRef]).unwrap(); assert_eq!( encode(ArrowDigester::hash_record_batch(&batch1)), @@ -707,7 +704,7 @@ mod tests { // ── Issue 5: Type canonicalization (Binary/LargeBinary, Utf8/LargeUtf8, List/LargeList) ── #[test] - #[ignore = "Bug: no type canonicalization for Binary vs LargeBinary (Issue 5)"] + fn binary_and_large_binary_schema_should_hash_equal() { let schema1 = Schema::new(vec![Field::new("col", DataType::Binary, true)]); let schema2 = Schema::new(vec![Field::new("col", DataType::LargeBinary, true)]); @@ -720,7 +717,7 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for Utf8 vs LargeUtf8 (Issue 5)"] + fn utf8_and_large_utf8_schema_should_hash_equal() { let schema1 = Schema::new(vec![Field::new("col", DataType::Utf8, true)]); let schema2 = Schema::new(vec![Field::new("col", DataType::LargeUtf8, true)]); @@ -733,7 +730,7 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for List vs LargeList (Issue 5)"] + fn list_and_large_list_schema_should_hash_equal() { let list_field = Field::new("item", DataType::Int32, true); let schema1 = Schema::new(vec![Field::new( @@ -755,18 +752,11 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for Binary vs LargeBinary in hash_array (Issue 5)"] + fn binary_and_large_binary_array_should_hash_equal() { - let bin = BinaryArray::from(vec![ - Some(b"hello".as_ref()), - None, - Some(b"world".as_ref()), - ]); - let large_bin = LargeBinaryArray::from(vec![ - Some(b"hello".as_ref()), - None, - Some(b"world".as_ref()), - ]); + let bin = BinaryArray::from(vec![Some(b"hello".as_ref()), None, Some(b"world".as_ref())]); + let large_bin = + LargeBinaryArray::from(vec![Some(b"hello".as_ref()), None, Some(b"world".as_ref())]); assert_eq!( encode(ArrowDigester::hash_array(&bin)), @@ -776,7 +766,7 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for Utf8 vs LargeUtf8 in hash_array (Issue 5)"] + fn utf8_and_large_utf8_array_should_hash_equal() { let arr = StringArray::from(vec![Some("hello"), None, Some("world")]); let large_arr = LargeStringArray::from(vec![Some("hello"), None, Some("world")]); @@ -789,7 +779,7 @@ mod tests { } #[test] - #[ignore = "Bug: no type canonicalization for Binary vs LargeBinary in hash_record_batch (Issue 5)"] + fn binary_and_large_binary_record_batch_should_hash_equal() { let schema1 = Arc::new(Schema::new(vec![Field::new("col", DataType::Binary, true)])); let schema2 = Arc::new(Schema::new(vec![Field::new( @@ -800,19 +790,13 @@ mod tests { let batch1 = RecordBatch::try_new( schema1, - vec![Arc::new(BinaryArray::from(vec![ - Some(b"abc".as_ref()), - None, - ])) as ArrayRef], + vec![Arc::new(BinaryArray::from(vec![Some(b"abc".as_ref()), None])) as ArrayRef], ) .unwrap(); let batch2 = RecordBatch::try_new( schema2, - vec![Arc::new(LargeBinaryArray::from(vec![ - Some(b"abc".as_ref()), - None, - ])) as ArrayRef], + vec![Arc::new(LargeBinaryArray::from(vec![Some(b"abc".as_ref()), None])) as ArrayRef], ) .unwrap(); @@ -826,7 +810,7 @@ mod tests { // ── Issue 6: Dictionary-encoded array equivalence ─────────────────── #[test] - #[ignore = "Bug: Dictionary arrays hit todo!() panic (Issue 6)"] + fn dictionary_utf8_should_hash_same_as_plain_string() { let plain = StringArray::from(vec![Some("apple"), Some("banana"), Some("apple")]); @@ -842,13 +826,12 @@ mod tests { } #[test] - #[ignore = "Bug: Dictionary arrays hit todo!() panic (Issue 6)"] + fn dictionary_int_values_should_hash_same_as_plain() { let plain = StringArray::from(vec![Some("x"), Some("y"), Some("x")]); - let dict: DictionaryArray = vec![Some("x"), Some("y"), Some("x")] - .into_iter() - .collect(); + let dict: DictionaryArray = + vec![Some("x"), Some("y"), Some("x")].into_iter().collect(); assert_eq!( encode(ArrowDigester::hash_array(&plain)), @@ -858,13 +841,12 @@ mod tests { } #[test] - #[ignore = "Bug: Dictionary arrays hit todo!() panic (Issue 6)"] + fn dictionary_with_nulls_should_hash_same_as_plain() { let plain = StringArray::from(vec![Some("a"), None, Some("b"), None]); - let dict: DictionaryArray = vec![Some("a"), None, Some("b"), None] - .into_iter() - .collect(); + let dict: DictionaryArray = + vec![Some("a"), None, Some("b"), None].into_iter().collect(); assert_eq!( encode(ArrowDigester::hash_array(&plain)), @@ -877,7 +859,7 @@ mod tests { /// Feeding a batch with reordered columns into a digester should not panic. #[test] - #[ignore = "Bug: update() uses strict schema equality including column order (Issue 7)"] + fn streaming_update_with_reordered_columns_should_succeed() { let schema = Schema::new(vec![ Field::new("a", DataType::Int32, false), @@ -908,7 +890,7 @@ mod tests { /// A digester fed batches with different column orders should produce the same hash /// as one fed batches in the original order. #[test] - #[ignore = "Bug: update() uses strict schema equality including column order (Issue 7)"] + fn streaming_reordered_columns_produce_same_hash() { let schema_ab = Schema::new(vec![ Field::new("a", DataType::Int32, false), diff --git a/tests/digest_bytes.rs b/tests/digest_bytes.rs index 5c6016f..f1df3c3 100644 --- a/tests/digest_bytes.rs +++ b/tests/digest_bytes.rs @@ -1,2 +1,977 @@ +/// Manual byte-level verification tests for the Starfix hashing specification. +/// +/// Each test in this module manually computes the expected SHA-256 hash by +/// feeding the exact bytes described in `docs/byte-layout-spec.md` into a +/// fresh SHA-256 hasher, then asserts that the library produces the identical +/// result. This serves as both a conformance check and a reference +/// implementation for anyone porting Starfix to another language. #[cfg(test)] -mod tests {} +mod tests { + #![expect(clippy::unwrap_used, reason = "Okay in test")] + #![expect( + clippy::similar_names, + reason = "child_a/child_b naming is clear in test context" + )] + #![expect(clippy::redundant_clone, reason = "Clones for clarity in test setup")] + #![expect(clippy::absolute_paths, reason = "One-off use in test")] + #![expect( + clippy::big_endian_bytes, + reason = "Starfix spec requires BE serialization of validity words" + )] + + use std::sync::Arc; + + use arrow::array::{ + ArrayRef, BinaryArray, BooleanArray, Int32Array, LargeListArray, LargeStringArray, + RecordBatch, StringArray, StructArray, + }; + use arrow::buffer::NullBuffer; + use arrow_schema::{DataType, Field, Schema}; + use sha2::{Digest as _, Sha256}; + use starfix::ArrowDigester; + + const VERSION: [u8; 3] = [0x00, 0x00, 0x01]; + + // ── Helper ─────────────────────────────────────────────────────────── + + /// Prepend the 3-byte version prefix to a 32-byte SHA-256 digest, + /// returning the full 35-byte Starfix hash. + fn with_version(digest: Vec) -> Vec { + let mut out = VERSION.to_vec(); + out.extend(digest); + out + } + + // ══════════════════════════════════════════════════════════════════════ + // Example A: Simple Two-Column Table (record batch) + // Schema: {age: Int32 non-nullable, name: LargeUtf8 nullable} + // Row 0: age=25, name="Alice" + // Row 1: age=30, name=NULL + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_a_two_column_table() { + // ── Build the table ────────────────────────────────────────────── + let schema = Schema::new(vec![ + Field::new("age", DataType::Int32, false), + Field::new("name", DataType::LargeUtf8, true), + ]); + let batch = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![ + Arc::new(Int32Array::from(vec![25_i32, 30])) as ArrayRef, + Arc::new(LargeStringArray::from(vec![Some("Alice"), None])) as ArrayRef, + ], + ) + .unwrap(); + + // ── Step 1: Schema digest ──────────────────────────────────────── + let schema_json = r#"{"age":{"data_type":"Int32","nullable":false},"name":{"data_type":"LargeUtf8","nullable":true}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + // Verify the library agrees on schema hash + assert_eq!( + ArrowDigester::hash_schema(&schema), + with_version(schema_digest.to_vec()), + "Schema hash mismatch — canonical JSON may differ" + ); + + // ── Step 2: Field "age" (Int32, non-nullable) ──────────────────── + // Values: [25, 30] → little-endian bytes + let mut age_data = Sha256::new(); + age_data.update(25_i32.to_le_bytes()); // 19 00 00 00 + age_data.update(30_i32.to_le_bytes()); // 1e 00 00 00 + let age_data_finalized = age_data.finalize(); + + // ── Step 3: Field "name" (LargeUtf8, nullable) ─────────────────── + // Values: ["Alice", NULL] + // + // Validity BitVec (Lsb0, usize storage): + // bit 0 = 1 (valid), bit 1 = 0 (null) + // → usize word = 0b01 = 1 + // bit_count = 2 + let bit_count: usize = 2; + let validity_word: usize = 1; // bits: [1, 0] in Lsb0 + + // Data bytes (only valid elements): + // "Alice" → len=5 as u64 LE, then UTF-8 bytes + // NULL → skipped + let mut name_data = Sha256::new(); + name_data.update(5_u64.to_le_bytes()); // length prefix + name_data.update(b"Alice"); // raw UTF-8 bytes + // NULL element: nothing fed + let name_data_finalized = name_data.finalize(); + + // ── Step 4: Final combination ──────────────────────────────────── + // Fields in alphabetical order: "age", "name" + let mut final_digest = Sha256::new(); + + // Schema + final_digest.update(schema_digest); + + // Field "age" (non-nullable → just the data digest) + final_digest.update(age_data_finalized); + + // Field "name" (nullable → bit_count + validity words + data digest) + final_digest.update(bit_count.to_le_bytes()); // 02 00 00 00 00 00 00 00 + final_digest.update(validity_word.to_be_bytes()); // 00 00 00 00 00 00 00 01 + final_digest.update(name_data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + // ── Verify ─────────────────────────────────────────────────────── + assert_eq!( + ArrowDigester::hash_record_batch(&batch), + expected, + "Example A: two-column table hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example B: Boolean Array with Nulls (hash_array API) + // BooleanArray [true, NULL, false, true] (nullable) + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_b_boolean_array_with_nulls() { + let array = BooleanArray::from(vec![Some(true), None, Some(false), Some(true)]); + + // ── Type metadata ──────────────────────────────────────────────── + // data_type_to_value(Boolean) → JSON value "Boolean" + // serde_json::to_string(json!("Boolean")) → "\"Boolean\"" + let type_json = b"\"Boolean\""; + + // ── Validity bits (Lsb0, usize storage) ───────────────────────── + // [valid, null, valid, valid] → bits [1, 0, 1, 1] + // Lsb0 in usize: bit0=1, bit1=0, bit2=1, bit3=1 → 0b1101 = 13 + let bit_count: usize = 4; + let validity_word: usize = 0b1101; // = 13 + + // ── Data bits (Msb0 packed, valid values only) ─────────────────── + // Valid values: [true, false, true] → 3 bits + // Msb0: bit7=1(true), bit6=0(false), bit5=1(true), bits4-0=0 + // Byte: 0b1010_0000 = 0xA0 + let mut data_digest = Sha256::new(); + data_digest.update([0xA0_u8]); + let data_finalized = data_digest.finalize(); + + // ── Final combination ──────────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + // Nullable finalization + final_digest.update(bit_count.to_le_bytes()); + final_digest.update(validity_word.to_be_bytes()); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example B: boolean array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example C: Non-Nullable Int32 Array (hash_array API) + // Int32Array [1, 2, 3] (non-nullable) + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_c_non_nullable_int32_array() { + let array = Int32Array::from(vec![1_i32, 2, 3]); + + // ── Type metadata ──────────────────────────────────────────────── + let type_json = b"\"Int32\""; + + // ── Data (contiguous LE buffer) ────────────────────────────────── + // [1, 2, 3] as i32 LE: + // 01 00 00 00 02 00 00 00 03 00 00 00 + let mut data_digest = Sha256::new(); + data_digest.update(1_i32.to_le_bytes()); + data_digest.update(2_i32.to_le_bytes()); + data_digest.update(3_i32.to_le_bytes()); + let data_finalized = data_digest.finalize(); + + // ── Final (non-nullable) ───────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example C: non-nullable int32 array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example D: Non-Nullable Binary Array (hash_array API) + // BinaryArray [b"hi", b""] (non-nullable) + // Tests type canonicalization: Binary → LargeBinary + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_d_non_nullable_binary_array() { + let array = BinaryArray::from(vec![b"hi".as_ref(), b"".as_ref()]); + + // ── Type metadata (canonicalized) ──────────────────────────────── + // Binary → LargeBinary in canonical form + let type_json = b"\"LargeBinary\""; + + // ── Data ───────────────────────────────────────────────────────── + // b"hi": len=2 as u64 LE + raw bytes + // b"": len=0 as u64 LE + (no bytes) + let mut data_digest = Sha256::new(); + data_digest.update(2_u64.to_le_bytes()); // 02 00 00 00 00 00 00 00 + data_digest.update(b"hi"); // 68 69 + data_digest.update(0_u64.to_le_bytes()); // 00 00 00 00 00 00 00 00 + let data_finalized = data_digest.finalize(); + + // ── Final (non-nullable) ───────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example D: non-nullable binary array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example E: Column-Order Independence + // Batch 1: columns [x: Int32, y: Boolean nullable] → x=10, y=true + // Batch 2: columns [y: Boolean nullable, x: Int32] → y=true, x=10 + // Both must produce the same hash. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_e_column_order_independence() { + let ints = Arc::new(Int32Array::from(vec![10_i32])) as ArrayRef; + let bools = Arc::new(BooleanArray::from(vec![Some(true)])) as ArrayRef; + + let batch_xy = RecordBatch::try_new( + Arc::new(Schema::new(vec![ + Field::new("x", DataType::Int32, false), + Field::new("y", DataType::Boolean, true), + ])), + vec![Arc::clone(&ints), Arc::clone(&bools)], + ) + .unwrap(); + + let batch_yx = RecordBatch::try_new( + Arc::new(Schema::new(vec![ + Field::new("y", DataType::Boolean, true), + Field::new("x", DataType::Int32, false), + ])), + vec![Arc::clone(&bools), Arc::clone(&ints)], + ) + .unwrap(); + + // ── Manual computation ─────────────────────────────────────────── + let schema_json = r#"{"x":{"data_type":"Int32","nullable":false},"y":{"data_type":"Boolean","nullable":true}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + // Field "x" (Int32, non-nullable): value 10 + let mut x_data = Sha256::new(); + x_data.update(10_i32.to_le_bytes()); // 0a 00 00 00 + let x_finalized = x_data.finalize(); + + // Field "y" (Boolean, nullable): value true (valid) + // Validity: [1] → bit_count=1, word=1 (Lsb0) + // Data: [true] Msb0 → bit7=1 → 0x80 + let bit_count: usize = 1; + let validity_word: usize = 1; + + let mut y_data = Sha256::new(); + y_data.update([0x80_u8]); // true in Msb0 = 1000_0000 + let y_finalized = y_data.finalize(); + + // Final combination: schema, then fields alphabetically (x, y) + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + // x (non-nullable) + final_digest.update(x_finalized); + // y (nullable) + final_digest.update(bit_count.to_le_bytes()); + final_digest.update(validity_word.to_be_bytes()); + final_digest.update(y_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + // ── Verify both column orderings produce the same hash ─────────── + let hash_xy = ArrowDigester::hash_record_batch(&batch_xy); + let hash_yx = ArrowDigester::hash_record_batch(&batch_yx); + + assert_eq!(hash_xy, hash_yx, "Column order should not affect hash"); + assert_eq!( + hash_xy, expected, + "Example E: column-order independence hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example F: Type Equivalence (Utf8 vs LargeUtf8, hash_array API) + // StringArray ["ab"] (Utf8, non-nullable) + // LargeStringArray ["ab"] (LargeUtf8, non-nullable) + // Both must produce the same hash. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_f_utf8_large_utf8_equivalence() { + let small = StringArray::from(vec!["ab"]); + let large = LargeStringArray::from(vec!["ab"]); + + // ── Manual computation ─────────────────────────────────────────── + // Type metadata: both canonicalize to "LargeUtf8" + let type_json = b"\"LargeUtf8\""; + + // Data: "ab" → len=2 as u64 LE + UTF-8 bytes + let mut data_digest = Sha256::new(); + data_digest.update(2_u64.to_le_bytes()); + data_digest.update(b"ab"); + let data_finalized = data_digest.finalize(); + + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&small), + expected, + "Example F: Utf8 hash mismatch" + ); + assert_eq!( + ArrowDigester::hash_array(&large), + expected, + "Example F: LargeUtf8 hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example G: Nullable Int32 Array with Nulls (hash_array API) + // Int32Array [Some(42), None, Some(-7), Some(0)] + // Tests nullable fixed-size path with actual nulls. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_g_nullable_int32_with_nulls() { + let array = Int32Array::from(vec![Some(42), None, Some(-7), Some(0)]); + + // ── Type metadata ──────────────────────────────────────────────── + let type_json = b"\"Int32\""; + + // ── Validity bits (Lsb0, usize) ───────────────────────────────── + // [valid, null, valid, valid] → bits [1, 0, 1, 1] → 0b1101 = 13 + let bit_count: usize = 4; + let validity_word: usize = 0b1101; // 13 + + // ── Data (only valid elements, in order) ───────────────────────── + // 42 as i32 LE: 2a 00 00 00 + // -7 as i32 LE: f9 ff ff ff + // 0 as i32 LE: 00 00 00 00 + let mut data_digest = Sha256::new(); + data_digest.update(42_i32.to_le_bytes()); + data_digest.update((-7_i32).to_le_bytes()); + data_digest.update(0_i32.to_le_bytes()); + let data_finalized = data_digest.finalize(); + + // ── Final (nullable) ───────────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(bit_count.to_le_bytes()); + final_digest.update(validity_word.to_be_bytes()); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example G: nullable int32 array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example H: Nullable String Array with Nulls (hash_array API) + // StringArray [Some("hello"), None, Some("world"), Some("")] + // Tests nullable variable-length path with type canonicalization. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_h_nullable_string_array_with_nulls() { + let array = StringArray::from(vec![Some("hello"), None, Some("world"), Some("")]); + + // ── Type metadata (canonicalized) ──────────────────────────────── + // Utf8 → LargeUtf8 + let type_json = b"\"LargeUtf8\""; + + // ── Validity bits (Lsb0, usize) ───────────────────────────────── + // [valid, null, valid, valid] → bits [1, 0, 1, 1] → 0b1101 = 13 + let bit_count: usize = 4; + let validity_word: usize = 0b1101; + + // ── Data (only valid elements) ─────────────────────────────────── + // "hello" → len=5 u64 LE + "hello" + // "world" → len=5 u64 LE + "world" + // "" → len=0 u64 LE + let mut data_digest = Sha256::new(); + data_digest.update(5_u64.to_le_bytes()); + data_digest.update(b"hello"); + // NULL: skipped + data_digest.update(5_u64.to_le_bytes()); + data_digest.update(b"world"); + data_digest.update(0_u64.to_le_bytes()); + let data_finalized = data_digest.finalize(); + + // ── Final (nullable) ───────────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(type_json); + final_digest.update(bit_count.to_le_bytes()); + final_digest.update(validity_word.to_be_bytes()); + final_digest.update(data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&array), + expected, + "Example H: nullable string array hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example I: Empty Table (schema only, no data) + // Tests that finalize() on a fresh digester with no update() calls + // produces schema_digest + empty field digests. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_i_empty_table() { + let schema = Schema::new(vec![ + Field::new("a", DataType::Int32, false), + Field::new("b", DataType::Boolean, true), + ]); + + // ── Schema digest ──────────────────────────────────────────────── + let schema_json = r#"{"a":{"data_type":"Int32","nullable":false},"b":{"data_type":"Boolean","nullable":true}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + // ── Field "a" (Int32, non-nullable): no data fed ───────────────── + // data_digest = SHA-256() with no updates → SHA-256 of empty input + let a_data_finalized = Sha256::digest(b""); + + // ── Field "b" (Boolean, nullable): no data fed ─────────────────── + // bit_count = 0 (no elements) + // as_raw_slice() = [] (no words) + // data_digest = SHA-256 of empty input + let bit_count: usize = 0; + let b_data_finalized = Sha256::digest(b""); + + // ── Final ──────────────────────────────────────────────────────── + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + // Field "a" (non-nullable) + final_digest.update(a_data_finalized); + // Field "b" (nullable) — bit_count=0, no words, empty data digest + final_digest.update(bit_count.to_le_bytes()); + // no validity words (raw_slice is empty for 0-length BitVec) + final_digest.update(b_data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + let digester = ArrowDigester::new(schema); + assert_eq!( + digester.finalize(), + expected, + "Example I: empty table hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example J: Multi-Batch Streaming + // Feeding two small batches must produce the same hash as feeding + // one combined batch (batch-split independence). + // Schema: {v: Int32 non-nullable} + // Batch 1: [1, 2] + // Batch 2: [3] + // Combined: [1, 2, 3] + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_j_multi_batch_streaming() { + let schema = Schema::new(vec![Field::new("v", DataType::Int32, false)]); + + // ── Two-batch path ─────────────────────────────────────────────── + let batch1 = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![Arc::new(Int32Array::from(vec![1_i32, 2])) as ArrayRef], + ) + .unwrap(); + let batch2 = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![Arc::new(Int32Array::from(vec![3_i32])) as ArrayRef], + ) + .unwrap(); + + let mut digester_stream = ArrowDigester::new(schema.clone()); + digester_stream.update(&batch1); + digester_stream.update(&batch2); + let hash_stream = digester_stream.finalize(); + + // ── Single-batch path ──────────────────────────────────────────── + let combined = RecordBatch::try_new( + Arc::new(schema), + vec![Arc::new(Int32Array::from(vec![1_i32, 2, 3])) as ArrayRef], + ) + .unwrap(); + let hash_combined = ArrowDigester::hash_record_batch(&combined); + + assert_eq!( + hash_stream, hash_combined, + "Streaming two batches should equal single combined batch" + ); + + // ── Manual computation ─────────────────────────────────────────── + let schema_json = r#"{"v":{"data_type":"Int32","nullable":false}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + // Field "v": data is [1, 2, 3] as i32 LE — accumulated across batches + // The digester is streaming, so it updates the same SHA-256 state: + // update(01 00 00 00 02 00 00 00) from batch 1 + // update(03 00 00 00) from batch 2 + // SHA-256 is incremental, so this is identical to hashing all 12 bytes at once. + let mut v_data = Sha256::new(); + v_data.update(1_i32.to_le_bytes()); + v_data.update(2_i32.to_le_bytes()); + v_data.update(3_i32.to_le_bytes()); + let v_finalized = v_data.finalize(); + + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + final_digest.update(v_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + hash_stream, expected, + "Example J: multi-batch streaming hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example K: Struct Column in a Record Batch + // Schema: {person: Struct non-nullable} + // Row 0: {age: 25, name: "Alice"} + // Row 1: {age: 30, name: "Bob"} + // + // In the record-batch path, struct fields are decomposed into leaf + // fields: "person/age" and "person/name", each hashed independently. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_k_struct_column_in_record_batch() { + // ── Build the table ────────────────────────────────────────────── + let age = Arc::new(Int32Array::from(vec![25_i32, 30])) as ArrayRef; + let name = Arc::new(LargeStringArray::from(vec!["Alice", "Bob"])) as ArrayRef; + let struct_array = StructArray::from(vec![ + ( + Arc::new(Field::new("age", DataType::Int32, false)), + Arc::clone(&age), + ), + ( + Arc::new(Field::new("name", DataType::LargeUtf8, false)), + Arc::clone(&name), + ), + ]); + + let schema = Schema::new(vec![Field::new( + "person", + DataType::Struct( + vec![ + Field::new("age", DataType::Int32, false), + Field::new("name", DataType::LargeUtf8, false), + ] + .into(), + ), + false, + )]); + let batch = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![Arc::new(struct_array) as ArrayRef], + ) + .unwrap(); + + // ── Step 1: Schema digest ──────────────────────────────────────── + // Canonical JSON: struct fields sorted by name, keys sorted recursively + // "person" has data_type: {"Struct": [{"data_type": "Int32", "name": "age", "nullable": false}, + // {"data_type": "LargeUtf8", "name": "name", "nullable": false}]} + let schema_json = r#"{"person":{"data_type":{"Struct":[{"data_type":"Int32","name":"age","nullable":false},{"data_type":"LargeUtf8","name":"name","nullable":false}]},"nullable":false}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + assert_eq!( + ArrowDigester::hash_schema(&schema), + with_version(schema_digest.to_vec()), + "Example K: schema hash mismatch" + ); + + // ── Step 2: Leaf field "person/age" (Int32, non-nullable) ──────── + // Values: [25, 30] as i32 LE + let mut age_data = Sha256::new(); + age_data.update(25_i32.to_le_bytes()); + age_data.update(30_i32.to_le_bytes()); + let age_data_finalized = age_data.finalize(); + + // ── Step 3: Leaf field "person/name" (LargeUtf8, non-nullable) ─── + // Values: ["Alice", "Bob"] + let mut name_data = Sha256::new(); + name_data.update(5_u64.to_le_bytes()); // "Alice" length + name_data.update(b"Alice"); + name_data.update(3_u64.to_le_bytes()); // "Bob" length + name_data.update(b"Bob"); + let name_data_finalized = name_data.finalize(); + + // ── Step 4: Final combination ──────────────────────────────────── + // Fields alphabetically: "person/age", "person/name" + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + // "person/age" (non-nullable): just data digest + final_digest.update(age_data_finalized); + // "person/name" (non-nullable): just data digest + final_digest.update(name_data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_record_batch(&batch), + expected, + "Example K: struct column record batch hash mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example L: Struct Array via hash_array (non-nullable struct) + // StructArray [{a: 1, b: true}, {a: 2, b: false}] + // Children: a: Int32 non-null, b: Boolean non-null + // + // In hash_array, the struct is hashed compositely: + // type_json + data where data = finalized(child_a) || finalized(child_b) + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_l_struct_array_hash_array() { + let a = Arc::new(Int32Array::from(vec![1_i32, 2])) as ArrayRef; + let b = Arc::new(BooleanArray::from(vec![true, false])) as ArrayRef; + let struct_array = StructArray::from(vec![ + ( + Arc::new(Field::new("a", DataType::Int32, false)), + Arc::clone(&a), + ), + ( + Arc::new(Field::new("b", DataType::Boolean, false)), + Arc::clone(&b), + ), + ]); + + // ── Type metadata ──────────────────────────────────────────────── + // Canonical: {"Struct":[{"data_type":"Int32","name":"a","nullable":false}, + // {"data_type":"Boolean","name":"b","nullable":false}]} + let type_json = r#"{"Struct":[{"data_type":"Int32","name":"a","nullable":false},{"data_type":"Boolean","name":"b","nullable":false}]}"#; + + // ── Child "a" (Int32, non-nullable) ────────────────────────────── + // Values: [1, 2] + let mut child_a_data = Sha256::new(); + child_a_data.update(1_i32.to_le_bytes()); + child_a_data.update(2_i32.to_le_bytes()); + let child_a_finalized = child_a_data.finalize(); + + // ── Child "b" (Boolean, non-nullable) ──────────────────────────── + // Values: [true, false] → Msb0: bit7=1(true), bit6=0(false) → 0x80 + let mut child_b_data = Sha256::new(); + child_b_data.update([0x80_u8]); + let child_b_finalized = child_b_data.finalize(); + + // ── Parent data digest ─────────────────────────────────────────── + // Children sorted by name: "a" then "b" + // Each child is non-nullable, so finalized = SHA256(data).finalize() (32 bytes) + let mut parent_data = Sha256::new(); + // Child "a" finalized (non-nullable → just data digest) + parent_data.update(child_a_finalized); + // Child "b" finalized (non-nullable → just data digest) + parent_data.update(child_b_finalized); + let parent_data_finalized = parent_data.finalize(); + + // ── Final combination ──────────────────────────────────────────── + // Struct is non-nullable → NonNullable finalization + let mut final_digest = Sha256::new(); + final_digest.update(type_json.as_bytes()); + final_digest.update(parent_data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&struct_array), + expected, + "Example L: struct array hash_array mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example M: Nullable Struct Array via hash_array (struct-level nulls) + // StructArray [Some({a: 10, b: "x"}), None, Some({a: 30, b: "z"})] + // Struct is nullable. Children: a: Int32 non-null, b: LargeUtf8 non-null + // + // Struct-level nulls propagate to children: at row 1 (null struct), + // children's data is undefined and must be skipped. + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_m_nullable_struct_array_hash_array() { + // Build a nullable struct array with a null at row 1 + let a = Int32Array::from(vec![10_i32, 0, 30]); // row 1 value is undefined (0 placeholder) + let b = LargeStringArray::from(vec!["x", "", "z"]); // row 1 value is undefined + let struct_array = StructArray::from(( + vec![ + ( + Arc::new(Field::new("a", DataType::Int32, false)), + Arc::new(a) as ArrayRef, + ), + ( + Arc::new(Field::new("b", DataType::LargeUtf8, false)), + Arc::new(b) as ArrayRef, + ), + ], + // Struct-level validity: [valid, null, valid] + // Buffer from NullBuffer: true=valid, false=null + NullBuffer::from(vec![true, false, true]) + .into_inner() + .into_inner(), + )); + + // ── Type metadata ──────────────────────────────────────────────── + let type_json = r#"{"Struct":[{"data_type":"Int32","name":"a","nullable":false},{"data_type":"LargeUtf8","name":"b","nullable":false}]}"#; + + // ── Struct-level validity (Lsb0, usize) ───────────────────────── + // [valid, null, valid] → bits [1, 0, 1] → 0b101 = 5 + let struct_bit_count: usize = 3; + let struct_validity_word: usize = 0b101; // 5 + + // ── Child "a" (Int32, effectively nullable due to struct nulls) ── + // Combined validity: struct AND child = [1, 0, 1] (child has no nulls of its own) + // Valid data: [10, 30] (row 1 skipped) + let child_a_bit_count: usize = 3; + let child_a_validity_word: usize = 0b101; + + let mut child_a_data = Sha256::new(); + child_a_data.update(10_i32.to_le_bytes()); + // row 1: skipped (null) + child_a_data.update(30_i32.to_le_bytes()); + let child_a_data_finalized = child_a_data.finalize(); + + // ── Child "b" (LargeUtf8, effectively nullable due to struct nulls) + let child_b_bit_count: usize = 3; + let child_b_validity_word: usize = 0b101; + + let mut child_b_data = Sha256::new(); + child_b_data.update(1_u64.to_le_bytes()); // "x" len + child_b_data.update(b"x"); + // row 1: skipped (null) + child_b_data.update(1_u64.to_le_bytes()); // "z" len + child_b_data.update(b"z"); + let child_b_data_finalized = child_b_data.finalize(); + + // ── Parent data digest ─────────────────────────────────────────── + // Children sorted by name: "a", "b" + // Each child is effectively nullable → finalized as: + // bit_count LE + validity_words BE + data_digest.finalize() + let mut parent_data = Sha256::new(); + // Child "a" finalized (nullable) + parent_data.update(child_a_bit_count.to_le_bytes()); + parent_data.update(child_a_validity_word.to_be_bytes()); + parent_data.update(child_a_data_finalized); + // Child "b" finalized (nullable) + parent_data.update(child_b_bit_count.to_le_bytes()); + parent_data.update(child_b_validity_word.to_be_bytes()); + parent_data.update(child_b_data_finalized); + let parent_data_finalized = parent_data.finalize(); + + // ── Final combination ──────────────────────────────────────────── + // Struct is nullable → parent finalization includes struct validity + let mut final_digest = Sha256::new(); + final_digest.update(type_json.as_bytes()); + // Struct-level nullable finalization + final_digest.update(struct_bit_count.to_le_bytes()); + final_digest.update(struct_validity_word.to_be_bytes()); + final_digest.update(parent_data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_array(&struct_array), + expected, + "Example M: nullable struct array hash_array mismatch" + ); + } + + // ══════════════════════════════════════════════════════════════════════ + // Example N: List-of-Struct in a Record Batch + // Schema: {items: LargeList> nullable} + // Row 0: [{id: 1, label: "a"}, {id: 2, label: "b"}] (2 elements) + // Row 1: [{id: 3, label: "c"}] (1 element) + // + // The list column is decomposed into leaf fields: + // "items" in the BTreeMap (the list field itself, not its inner struct fields). + // But the list's sub-arrays ARE struct arrays, which are now hashed + // compositely via array_digest_update(Struct). + // ══════════════════════════════════════════════════════════════════════ + + #[test] + fn example_n_list_of_struct_record_batch() { + // ── Build the table ────────────────────────────────────────────── + let struct_fields = vec![ + Field::new("id", DataType::Int32, false), + Field::new("label", DataType::LargeUtf8, false), + ]; + let inner_struct_field = Field::new( + "item", + DataType::Struct(struct_fields.clone().into()), + false, + ); + let list_field = Field::new( + "items", + DataType::LargeList(Arc::new(inner_struct_field.clone())), + true, + ); + let schema = Schema::new(vec![list_field.clone()]); + + // Build struct sub-arrays + // Row 0: [{id:1, label:"a"}, {id:2, label:"b"}], Row 1: [{id:3, label:"c"}] + // Total struct rows: 3 (ids: [1,2,3], labels: ["a","b","c"]) + let ids = Int32Array::from(vec![1_i32, 2, 3]); + let labels = LargeStringArray::from(vec!["a", "b", "c"]); + let struct_array = StructArray::from(vec![ + ( + Arc::new(Field::new("id", DataType::Int32, false)), + Arc::new(ids) as ArrayRef, + ), + ( + Arc::new(Field::new("label", DataType::LargeUtf8, false)), + Arc::new(labels) as ArrayRef, + ), + ]); + + // Build large list array with offsets [0, 2, 3] + let list_array = LargeListArray::new( + Arc::new(inner_struct_field), + arrow::buffer::OffsetBuffer::new(vec![0_i64, 2, 3].into()), + Arc::new(struct_array) as ArrayRef, + None, // all list elements valid + ); + + let batch = RecordBatch::try_new( + Arc::new(schema.clone()), + vec![Arc::new(list_array) as ArrayRef], + ) + .unwrap(); + + // ── Step 1: Schema digest ──────────────────────────────────────── + // Canonical: element type has no name (element_type_to_value drops "item") + // The inner struct's data_type is {"Struct": [sorted children]} + let schema_json = r#"{"items":{"data_type":{"LargeList":{"data_type":{"Struct":[{"data_type":"Int32","name":"id","nullable":false},{"data_type":"LargeUtf8","name":"label","nullable":false}]},"nullable":false}},"nullable":true}}"#; + let schema_digest = Sha256::digest(schema_json.as_bytes()); + + assert_eq!( + ArrowDigester::hash_schema(&schema), + with_version(schema_digest.to_vec()), + "Example N: schema hash mismatch" + ); + + // ── Step 2: Field "items" (LargeList, nullable) ────────── + // + // With structural hashing, list sizes go to a separate structural digest, + // while leaf data (struct composites) goes to the data/leaf digest. + // + // The BitVec accumulates ALL null bits from the list AND its sub-arrays. + // List-level: handle_null_bits(list) → [1, 1] (both list elements valid) + // Then for each list element, the struct sub-array also pushes its validity: + // Element 0 struct (2 rows, no nulls): → [1, 1] + // Element 1 struct (1 row, no nulls): → [1] + // Total BitVec: [1, 1, 1, 1, 1] → 5 bits, all valid + let items_bit_count: usize = 5; + let items_validity_word: usize = 0b11111; // 31 + + // ── Structural digest: element counts (sizes) ──────────────────── + let mut items_structural = Sha256::new(); + items_structural.update(2_u64.to_le_bytes()); // element 0 has 2 struct rows + items_structural.update(1_u64.to_le_bytes()); // element 1 has 1 struct row + let items_structural_finalized = items_structural.finalize(); + + // ── Data/leaf digest: struct composites (no size prefixes) ──────── + // + // --- List element 0: [{id:1,label:"a"}, {id:2,label:"b"}] (2 rows) --- + // Struct composite: children sorted by name: "id" then "label" + // No struct-level nulls, children are non-nullable + // + // Child "id" (Int32, non-null): values [1, 2] + let mut e0_child_id_data = Sha256::new(); + e0_child_id_data.update(1_i32.to_le_bytes()); + e0_child_id_data.update(2_i32.to_le_bytes()); + let e0_child_id_finalized = e0_child_id_data.finalize(); + + // Child "label" (LargeUtf8, non-null): values ["a", "b"] + let mut e0_child_label_data = Sha256::new(); + e0_child_label_data.update(1_u64.to_le_bytes()); // "a" len + e0_child_label_data.update(b"a"); + e0_child_label_data.update(1_u64.to_le_bytes()); // "b" len + e0_child_label_data.update(b"b"); + let e0_child_label_finalized = e0_child_label_data.finalize(); + + // --- List element 1: [{id:3,label:"c"}] (1 row) --- + // Child "id": values [3] + let mut e1_child_id_data = Sha256::new(); + e1_child_id_data.update(3_i32.to_le_bytes()); + let e1_child_id_finalized = e1_child_id_data.finalize(); + + // Child "label": values ["c"] + let mut e1_child_label_data = Sha256::new(); + e1_child_label_data.update(1_u64.to_le_bytes()); // "c" len + e1_child_label_data.update(b"c"); + let e1_child_label_finalized = e1_child_label_data.finalize(); + + // Build leaf digest: struct composites for each list element + let mut items_data = Sha256::new(); + // List element 0: struct children finalized into data (no size prefix here) + items_data.update(e0_child_id_finalized); // non-nullable child: 32 bytes + items_data.update(e0_child_label_finalized); // non-nullable child: 32 bytes + // List element 1: struct children finalized into data + items_data.update(e1_child_id_finalized); + items_data.update(e1_child_label_finalized); + let items_data_finalized = items_data.finalize(); + + // ── Step 3: Final combination ──────────────────────────────────── + // For list fields (nullable): bit_count + validity_words + structural_digest + data_digest + let mut final_digest = Sha256::new(); + final_digest.update(schema_digest); + // "items" (nullable, structured): null bits + structural + leaf + final_digest.update(items_bit_count.to_le_bytes()); + final_digest.update(items_validity_word.to_be_bytes()); + final_digest.update(items_structural_finalized); + final_digest.update(items_data_finalized); + + let expected = with_version(final_digest.finalize().to_vec()); + + assert_eq!( + ArrowDigester::hash_record_batch(&batch), + expected, + "Example N: list-of-struct record batch hash mismatch" + ); + } +} diff --git a/tests/golden_files/schema_serialization_pretty.json b/tests/golden_files/schema_serialization_pretty.json index 70cb27d..f2ec2db 100644 --- a/tests/golden_files/schema_serialization_pretty.json +++ b/tests/golden_files/schema_serialization_pretty.json @@ -1,6 +1,6 @@ { "binary_name": { - "data_type": "Binary", + "data_type": "LargeBinary", "nullable": true }, "bool_name": { @@ -45,19 +45,9 @@ "doubly_nested_struct_name": { "data_type": { "Struct": [ - { - "data_type": "Int32", - "name": "outer_field", - "nullable": false - }, { "data_type": { "Struct": [ - { - "data_type": "Utf8", - "name": "middle_field", - "nullable": true - }, { "data_type": { "Struct": [ @@ -75,11 +65,21 @@ }, "name": "inner", "nullable": false + }, + { + "data_type": "LargeUtf8", + "name": "middle_field", + "nullable": true } ] }, "name": "middle", "nullable": false + }, + { + "data_type": "Int32", + "name": "outer_field", + "nullable": false } ] }, @@ -117,7 +117,6 @@ "data_type": { "LargeList": { "data_type": "Int32", - "name": "item", "nullable": true } }, @@ -129,9 +128,8 @@ }, "list_name": { "data_type": { - "List": { + "LargeList": { "data_type": "Int32", - "name": "item", "nullable": true } }, @@ -146,7 +144,7 @@ "nullable": false }, { - "data_type": "Utf8", + "data_type": "LargeUtf8", "name": "struct_field2", "nullable": true } @@ -195,7 +193,7 @@ "nullable": false }, "utf8_name": { - "data_type": "Utf8", + "data_type": "LargeUtf8", "nullable": true } }