-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Optimize arrow bytes view map #20055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Use values.views() instead of values.iter() for direct u128 access - Use is_valid(i) for efficient null checking via validity bitmap - Avoid dereferencing overhead for inline strings - No additional memory overhead in Entry struct Closes apache#19961
Following review feedback, changed from usize to u32 for the builder index field. This saves 4 bytes per entry on 64-bit systems while still supporting up to 4 billion distinct values - more than enough for practical workloads.
Per reviewer feedback, replaced Vec<bool> with BooleanBufferBuilder for tracking null values. This uses bit-packed storage (1 bit per value) instead of byte-per-value, reducing memory usage by 8x for the null bitmap. Also fixed clippy warnings for mem_replace_with_default.
…etch This change optimizes the interning process in ArrowBytesViewMap by:\n1. Performing direct u128 view comparisons for inlined values.\n2. Fetching the input value only once per row to avoid redundant work during hash collisions and insertion.\n3. Addresses feedback from @Dandandan regarding redundant value fetches.
…-fetch optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the ArrowBytesViewMap by replacing the GenericByteViewBuilder with manual buffer management, enabling direct u128 view comparisons for inline strings and reducing redundant memory access during hash collisions.
Changes:
- Replaced
GenericByteViewBuilderwith manual buffer management (views, in_progress, completed, nulls vectors) - Implemented direct u128 bitwise equality checks for inline strings (≤12 bytes) as a fast path
- Refactored comparison logic to use view-based comparisons before falling back to byte-slice comparison
Comments suppressed due to low confidence (5)
datafusion/physical-expr-common/src/binary_view_map.rs:447
- For accurate memory accounting, the nulls size calculation should use
self.nulls.capacity() * size_of::<bool>()to match the pattern used for in_progress_size at line 445 and to account for actual allocated capacity rather than just used length. Similarly, views_size at line 444 should consider using capacity for consistency, though this depends on whether the intent is to track allocated or used memory.
let views_size = self.views.len() * size_of::<u128>();
let in_progress_size = self.in_progress.capacity();
let completed_size: usize = self.completed.iter().map(|b| b.len()).sum();
let nulls_size = self.nulls.len();
datafusion/physical-expr-common/src/binary_view_map.rs:117
- The constant BYTE_VIEW_MAX_BLOCK_SIZE is defined in an unusual location between the documentation comment and the struct definition. For better code organization and readability, this constant should be moved to the module level (e.g., after the imports around line 30) or before the struct's documentation comment. This follows Rust conventions where constants are typically defined at the module level or clearly separated from type definitions.
/// Max size of the in-progress buffer before flushing to completed buffers
const BYTE_VIEW_MAX_BLOCK_SIZE: usize = 2 * 1024 * 1024;
datafusion/physical-expr-common/src/binary_view_map.rs:386
- The unsafe call to
BinaryViewArray::new_uncheckedlacks a SAFETY comment explaining why it's safe. Consider adding a comment explaining that: (1) all views are correctly constructed viamake_viewto point to valid data in the buffers, (2) buffer indices and offsets are valid by construction throughappend_value, and (3) the null buffer has the correct length matching the views. This is separate from the SAFETY comment forto_string_view_uncheckedat line 391.
let array =
unsafe { BinaryViewArray::new_unchecked(views, self.completed, null_buffer) };
datafusion/physical-expr-common/src/binary_view_map.rs:415
- The cast to u32 at line 414 could theoretically overflow if
self.completed.len()exceeds u32::MAX. While this is extremely unlikely in practice (would require billions of 2MB buffers), consider adding a debug assertion likedebug_assert!(self.completed.len() <= u32::MAX as usize, "Too many completed buffers")to catch this during development. The same applies to the offset cast at line 415.
let buffer_index = self.completed.len() as u32;
let offset = self.in_progress.len() as u32;
datafusion/physical-expr-common/src/binary_view_map.rs:382
- The null buffer construction iterates over
self.nullstwice: once to check if any nulls exist, and once to invert the boolean values. Consider optimizing by inverting and collecting in a single pass, then checking if all values are true to decide whether to create a NullBuffer. However, this is a minor performance consideration and may not be worth the code complexity.
let null_buffer = if self.nulls.iter().any(|&is_null| is_null) {
Some(NullBuffer::from(
self.nulls
.iter()
.map(|&is_null| !is_null)
.collect::<Vec<_>>(),
))
} else {
None
};
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else { | ||
| &in_progress[offset..offset + stored_len] |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checking which buffer to read from should be more defensive. If buffer_index > completed.len(), this would incorrectly try to access in_progress with potentially invalid offsets. While this shouldn't happen with the current append logic, consider adding an assertion or explicit error handling to catch any future bugs. For example: else if buffer_index == completed.len() followed by else { panic!("Invalid buffer_index") }.
| } else { | |
| &in_progress[offset..offset + stored_len] | |
| } else if buffer_index == completed.len() { | |
| &in_progress[offset..offset + stored_len] | |
| } else { | |
| panic!("Invalid buffer_index: {}", buffer_index); |
|
run benchmarks |
|
🤖 |
|
Run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmarks |
| let len = view_u128 as u32; | ||
|
|
||
| // Check if value already exists | ||
| let mut value: Option<&[u8]> = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch — you’re right to flag this.
The intent here was to defer materializing the byte slice until we know it’s needed, but the current structure makes that unclear and introduces unnecessary indirection.
I’ll refactor this so the slice is explicitly materialized once (at the top of the loop) and then reused consistently for both lookup and insertion paths. That should make the logic clearer and align better with the original optimization goal.
| } else { | ||
| // no existing value, make a new one | ||
| let value: &[u8] = values.value(i).as_ref(); | ||
| let value = value.unwrap_or_else(|| values.value(i).as_ref()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still accesses it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you’re absolutely right — this still ends up calling values.value(i) in the insert path.
That contradicts the goal of fetching the value once per row. I’ll rework this to eagerly bind the slice to a local variable and thread that through both the lookup and insert branches so values.value(i) is only accessed a single time.
|
🤖 |
|
🤖: Benchmark completed Details
|
Rationale for this change
The ArrowBytesViewMap is a high-performance structure used for interning string and binary views in operators like
GROUP BYandCOUNT DISTINCT. This PR addresses performance bottlenecks identified during maintainer review:values.value(i)) was fetched and converted to a byte slice multiple times during hash collisions and insertion. This added unnecessary overhead to the interning loop.u128bitwise equality.What changes are included in this PR?
u128View Comparison: Updated the interning logic to perform bitwise equality checks on theu128views for inlined values, providing a single-cycle fast path.u128view equality for inlined strings (≤12 bytes).GenericByteViewArray<B>to provide specialized access for StringView/BinaryView types.Are these changes tested?
Yes. I have verified the changes using the existing unit test suite:
cargo test -p datafusion-physical-expr-common --lib binary_view_map::testsAre there any user-facing changes?
No. This is a internal performance optimization.