Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions datafusion/common/src/utils/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,12 @@ use arrow::array::{
cast::AsArray,
};
use arrow::array::{
ArrowPrimitiveType, Datum, GenericListArray, Int32Array, Int64Array,
ArrowPrimitiveType, BooleanArray, Datum, GenericListArray, Int32Array, Int64Array,
MutableArrayData, PrimitiveArray, make_array,
};
use arrow::array::{LargeListViewArray, ListViewArray};
use arrow::buffer::{OffsetBuffer, ScalarBuffer};
use arrow::compute::kernels::cmp::neq;
use arrow::compute::kernels::cmp::eq;
use arrow::compute::kernels::length::length;
use arrow::compute::{SortColumn, SortOptions, partition};
use arrow::datatypes::{
Expand Down Expand Up @@ -1129,6 +1129,7 @@ pub fn remove_list_null_values(array: &ArrayRef) -> Result<ArrayRef> {
}
}

/// Create a new list array where all the nulls point to empty lists
fn truncate_list_nulls<O: OffsetSizeTrait>(
list: &GenericListArray<O>,
) -> Result<GenericListArray<O>> {
Expand All @@ -1142,17 +1143,18 @@ fn truncate_list_nulls<O: OffsetSizeTrait>(
&Int64Array::new_scalar(0)
};

let not_empty = neq(&lengths, zero)?;
let null_and_non_empty = &!nulls.inner() & not_empty.values();
let (mut valid_or_empty, _nulls) = eq(&lengths, zero)?.into_parts();
valid_or_empty |= nulls.inner();
let valid_or_empty = BooleanArray::from(valid_or_empty);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking can we optimize even more and avoid allocating BooleanArray here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BooleanArray::has_false used below I believe has no performatic BooleanBuffer equivalent: .iter().any(|v| !v) checks bit for bit (as a future improvement, the non-null branch could be moved to BooleanBuffer). But that BooleanArray::from BooleanBuffer should be cheap anyway.

Copy link
Copy Markdown
Member Author

@rluvaton rluvaton May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO This is really premature optimization, BooleanArray allocation is super cheap

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the proposal to move the code from BooleanArray::has_false to BooleanBuffer::has_false upstream? That seems reasonable for what it is worth

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like that too, need to create a pr for that


if null_and_non_empty.count_set_bits() > 0 {
if valid_or_empty.has_false() {
let array_data = list.values().to_data();
let offsets = list.offsets();
let capacity = offsets[offsets.len() - 1] - offsets[0];
let mut mutable_array_data =
MutableArrayData::new(vec![&array_data], false, capacity.as_usize());

let valid_or_empty = nulls.inner() | &!not_empty.values();
let (valid_or_empty, _nulls) = valid_or_empty.into_parts();

for (start, end) in valid_or_empty.set_slices() {
mutable_array_data.extend(
Expand Down
Loading