Skip to content

perf: collect nested struct addresses once in field-major append [jvm shuffle / r2c]#3661

Draft
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:perf/struct-collect-addrs-once
Draft

perf: collect nested struct addresses once in field-major append [jvm shuffle / r2c]#3661
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:perf/struct-collect-addrs-once

Conversation

@andygrove
Copy link
Copy Markdown
Member

Summary

  • In append_struct_fields_field_major, the first pass now collects nested struct addresses and sizes alongside the null bitmap
  • The per-field second pass uses these pre-collected addresses via point_to() instead of re-reading from parent row pointer arrays (read_row_at!) and calling get_struct() for every field of every row
  • Same optimization applied to the Binary, Utf8, Decimal128, nested Struct, and List/Map field cases

Rationale

Previously, for a struct with F fields and N rows, the code performed NF pointer dereferences into the parent row address/size arrays plus NF get_struct() calls (each involving get_offset_and_len which reads an i64 and does bit manipulation). After this change, parent row reads and get_struct calls happen only N times total in the first pass, and the second pass uses cheap point_to() calls with the cached addresses.

Test plan

  • cargo clippy --all-targets --workspace -- -D warnings passes
  • Existing struct row-to-columnar tests cover these code paths

In append_struct_fields_field_major, the first pass now collects nested
struct addresses and sizes alongside the null bitmap. The per-field
second pass uses these pre-collected addresses instead of re-reading
from the parent row pointer arrays and calling get_struct for every
field of every row.

For a struct with F fields and N rows, this reduces parent row pointer
dereferences from N*F to N, and get_struct calls from N*F to N.
@andygrove andygrove marked this pull request as draft March 11, 2026 00:29
@andygrove andygrove added the area:shuffle Shuffle (JVM and native) label Apr 9, 2026
@andygrove andygrove changed the title perf: collect nested struct addresses once in field-major append perf: collect nested struct addresses once in field-major append [jvm shuffle / r2c] Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:shuffle Shuffle (JVM and native)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant