Skip to content

Conversation

@Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Jan 27, 2026

Which issue does this PR close?

Rationale for this change

While we do have support for casting list views, some cases are missing. This PR goes through the list casting logic for all list types and tries to ensure we have comprehensive & consistent coverage for casting cases, also refactoring to try make this coverage more obvious at a glance.

What changes are included in this PR?

This PR has a mix of new features (new casts supported) and refactoring.

New features:

  • Cast from List/LargeList to Utf8View
    • We supported casting to Utf8/LargeUtf8, but view was missing
  • Cast from ListView/LargeListView to string types
  • Cast between list view types when inner types differ
    • Previously we could cast ListView to LargeListView, but casting ListView<Int32> to ListView<Float32> was not supported; this adds that ability, which regular List arrays supported
  • Cast non-list types to ListView/LargeListView
  • Cast FixedSizeList to ListView/LargeListView
  • Cast List/LargeList to ListView/LargeListView when inner types differ
    • This again was something we supported, except we never accounted for the inner types differing; so fix to ensure we cast inner types
  • Cast List to LargeListView, LargeList to ListView, ListView to LargeList, LargeListView to List
    • These cross-offset casts were missing
  • Cast ListView/LargeListView to FixedSizeList

Refactors:

  • Consolidate arms in cast_with_options to organize the list casts together so it's easier to see we have coverage for casting between all list types
  • Condense testing code, trying to use existing downcast methods and array builders to reduce verbosity
  • Consolidate list view cast helper functions into the list cast helper functions file (arrow-cast/src/cast/list_view.rs merged into arrow-cast/src/cast/list.rs) since it's easier to track them together, especially with the cross list <-> listview cast functions

Are these changes tested?

Yes, added new tests.

Are there any user-facing changes?

No.

e.g. cast ListView(Int32) to ListView(Float32)
e.g. Int32 to ListView(Int32)

[1, 2, 3] -> [[1], [2], [3]]
e.g. FixedSizeList(Int32, 2) -> ListView(Float32)
e.g. List(Int32) -> ListView(Float32); previously we didn't account for
inner types differing so it would cast to ListView(Int32) only
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 27, 2026
@Jefffrey Jefffrey marked this pull request as ready for review January 27, 2026 05:22
@Jefffrey
Copy link
Contributor Author

I would recommend reviewing commit by commit, as the commits more clearly show which changes are new features vs refactors, and each commit should be a manageable diff size to review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this commit-by-commit as suggested and it was much easier that way

Thank you @Jefffrey -- this looks really nice to me (though I have a few minor comments)

let array = array.as_list_view::<O>();

let nullable = cast_options.safe || array.null_count() != 0;
let mut nulls = nullable.then(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be able to use NullBufferBuilder here simplify the null handling (it already has a Option<...> internally to avoid allocating if there are no nulls

I think you would have to add set_bit in NullBufferBuilder but that is probably useful anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me play around with that idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea I have is to use the take kernel to construct the values array instead of mutable data; I think this is more efficient based on some previous PRs I've seen do the same, though I'm stuck on what elements we fill in the null slots with, especially for weird cases such as when the values array is empty 🤔

  • I guess can handle those cases separately
  • We can always fill it with the first element in the array for example, though maybe that could be inefficient if its a nested type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullBufferBuilder approach here: d28dc92

Introduces NullBufferBuilder::set_bit

Maybe using take can be a followup issue

// Nulls in FixedSizeListArray take up space and so we must pad the values
if cast_options.safe || array.is_null(idx) {
mutable.extend_nulls(size as _);
nulls.as_mut().unwrap().set_bit(idx, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know here nulls is non null? Maybe because above arrays.nulls would have set it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially yeah; I'll give the NullBufferBuilder idea a shot to see if can simplify this

let list_view_offsets = list_view.offsets();
let sizes = list_view.sizes();

let mut take_indices: Vec<O::Native> = Vec::with_capacity(list_view.values().len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got codex to give me an esoteric bug here -- basically the with_capacity can be massive for list_view with RLE:

    #[test]
    fn test_cast_list_view_to_list_large_logical_len() {
        // Use a RunArray to get a huge logical length without allocating
        // large buffers. The list view references a single element at offset 0.
        let run_ends = Int64Array::from(vec![i64::MAX - 1, i64::MAX]);
        let values = Int32Array::from(vec![10, 99]);
        let run_array = RunArray::<Int64Type>::try_new(&run_ends, &values).unwrap();

        let item_field: FieldRef =
            Arc::new(Field::new_list_field(run_array.data_type().clone(), true));
        let list_view = LargeListViewArray::try_new(
            Arc::clone(&item_field),
            vec![0i64].into(),
            vec![1i64].into(),
            Arc::new(run_array),
            None,
        )
        .unwrap();

        let target_type = DataType::List(item_field);
        let cast_result = cast(&list_view, &target_type).unwrap();
        let got_list = cast_result.as_list::<i32>();
        assert_eq!(got_list.len(), 1);
        let got_values = got_list.values().as_run::<Int64Type>();
        let typed = got_values.downcast::<Int32Array>().unwrap();
        assert_eq!(typed.value(0), 10);
    }

Fails like this:

thread 'cast::tests::test_cast_list_view_to_list_large_logical_len' (15048336) panicked at arrow-cast/src/cast/list.rs:368:44:
capacity overflow
stack backtrace:

I don't think this is a huge issue personally -- I can file a follow on ticket if needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder how hard we should cater for weird cases like nesting REE within a List 🤔

That would likely cause issues with other code paths I'd imagine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve casting for listview types

2 participants