-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Enhance list casting, adding more cases for list views #9274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e.g. cast ListView(Int32) to ListView(Float32)
e.g. Int32 to ListView(Int32) [1, 2, 3] -> [[1], [2], [3]]
e.g. FixedSizeList(Int32, 2) -> ListView(Float32)
e.g. List(Int32) -> ListView(Float32); previously we didn't account for inner types differing so it would cast to ListView(Int32) only
|
I would recommend reviewing commit by commit, as the commits more clearly show which changes are new features vs refactors, and each commit should be a manageable diff size to review. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this commit-by-commit as suggested and it was much easier that way
Thank you @Jefffrey -- this looks really nice to me (though I have a few minor comments)
arrow-cast/src/cast/list.rs
Outdated
| let array = array.as_list_view::<O>(); | ||
|
|
||
| let nullable = cast_options.safe || array.null_count() != 0; | ||
| let mut nulls = nullable.then(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be able to use NullBufferBuilder here simplify the null handling (it already has a Option<...> internally to avoid allocating if there are no nulls
I think you would have to add set_bit in NullBufferBuilder but that is probably useful anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me play around with that idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea I have is to use the take kernel to construct the values array instead of mutable data; I think this is more efficient based on some previous PRs I've seen do the same, though I'm stuck on what elements we fill in the null slots with, especially for weird cases such as when the values array is empty 🤔
- I guess can handle those cases separately
- We can always fill it with the first element in the array for example, though maybe that could be inefficient if its a nested type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NullBufferBuilder approach here: d28dc92
Introduces NullBufferBuilder::set_bit
Maybe using take can be a followup issue
arrow-cast/src/cast/list.rs
Outdated
| // Nulls in FixedSizeListArray take up space and so we must pad the values | ||
| if cast_options.safe || array.is_null(idx) { | ||
| mutable.extend_nulls(size as _); | ||
| nulls.as_mut().unwrap().set_bit(idx, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we know here nulls is non null? Maybe because above arrays.nulls would have set it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially yeah; I'll give the NullBufferBuilder idea a shot to see if can simplify this
| let list_view_offsets = list_view.offsets(); | ||
| let sizes = list_view.sizes(); | ||
|
|
||
| let mut take_indices: Vec<O::Native> = Vec::with_capacity(list_view.values().len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got codex to give me an esoteric bug here -- basically the with_capacity can be massive for list_view with RLE:
#[test]
fn test_cast_list_view_to_list_large_logical_len() {
// Use a RunArray to get a huge logical length without allocating
// large buffers. The list view references a single element at offset 0.
let run_ends = Int64Array::from(vec![i64::MAX - 1, i64::MAX]);
let values = Int32Array::from(vec![10, 99]);
let run_array = RunArray::<Int64Type>::try_new(&run_ends, &values).unwrap();
let item_field: FieldRef =
Arc::new(Field::new_list_field(run_array.data_type().clone(), true));
let list_view = LargeListViewArray::try_new(
Arc::clone(&item_field),
vec![0i64].into(),
vec![1i64].into(),
Arc::new(run_array),
None,
)
.unwrap();
let target_type = DataType::List(item_field);
let cast_result = cast(&list_view, &target_type).unwrap();
let got_list = cast_result.as_list::<i32>();
assert_eq!(got_list.len(), 1);
let got_values = got_list.values().as_run::<Int64Type>();
let typed = got_values.downcast::<Int32Array>().unwrap();
assert_eq!(typed.value(0), 10);
}Fails like this:
thread 'cast::tests::test_cast_list_view_to_list_large_logical_len' (15048336) panicked at arrow-cast/src/cast/list.rs:368:44:
capacity overflow
stack backtrace:
I don't think this is a huge issue personally -- I can file a follow on ticket if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder how hard we should cater for weird cases like nesting REE within a List 🤔
That would likely cause issues with other code paths I'd imagine
Which issue does this PR close?
Rationale for this change
While we do have support for casting list views, some cases are missing. This PR goes through the list casting logic for all list types and tries to ensure we have comprehensive & consistent coverage for casting cases, also refactoring to try make this coverage more obvious at a glance.
What changes are included in this PR?
This PR has a mix of new features (new casts supported) and refactoring.
New features:
List/LargeListtoUtf8ViewUtf8/LargeUtf8, but view was missingListView/LargeListViewto string typesListViewtoLargeListView, but castingListView<Int32>toListView<Float32>was not supported; this adds that ability, which regularListarrays supportedListView/LargeListViewFixedSizeListtoListView/LargeListViewList/LargeListtoListView/LargeListViewwhen inner types differListtoLargeListView,LargeListtoListView,ListViewtoLargeList,LargeListViewtoListListView/LargeListViewtoFixedSizeListRefactors:
cast_with_optionsto organize the list casts together so it's easier to see we have coverage for casting between all list typesarrow-cast/src/cast/list_view.rsmerged intoarrow-cast/src/cast/list.rs) since it's easier to track them together, especially with the cross list <-> listview cast functionsAre these changes tested?
Yes, added new tests.
Are there any user-facing changes?
No.