Skip to content

Sync fork#13

Merged
amlynczak-rel merged 722 commits into
relativityone:mainfrom
apache:main
Jan 14, 2026
Merged

Sync fork#13
amlynczak-rel merged 722 commits into
relativityone:mainfrom
apache:main

Conversation

@amlynczak-rel

Copy link
Copy Markdown

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Closes #NNN.

Rationale for this change

Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.

What changes are included in this PR?

There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

rluvaton and others added 30 commits October 17, 2025 12:18
# Which issue does this PR close?

N/A

# Rationale for this change

not testing the correct length

# What changes are included in this PR?

remove * 8 as the length of the buffer is in bytes already

# Are these changes tested?
created tests to make sure they are failing before AND created tests
that make sure that ceil is used for future changes

# Are there any user-facing changes?

Nope
…tArray` (#8627)

# Which issue does this PR close?

- Closes #8610

# Rationale for this change

Since the fields of `VariantArray` impl `PartialEq`, this PR simply
derives `PartialEq` for `VariantArray`
out.

Based off of #8625
…trings for error messages (#8636)

# Which issue does this PR close?

This is a small performance improvement for the thrift remodeling

- Part of #5853.

# Rationale for this change

Some of the often-called methods in the thrift protocol implementation
created `ParquetError` instances with a string message that had to be
allocated and formatted. This formatting code and probably also some
drop glue bloats these otherwise small methods and prevented inlining.

# What changes are included in this PR?

Introduce a separate error type `ThriftProtocolError` that is smaller
than `ParquetError` and does not contain any allocated data. The
`ReadThrift` trait is not changed, since its custom implementations
actually require the more expressive `ParquetError`.

# Are these changes tested?

The success path is covered by existing tests. Testing the error paths
would require crafting some actually malformed files, or using a fuzzer.

# Are there any user-facing changes?

The `ThriftProtocolError` is crate-internal so there should be no api
changes. Some error messages might differ slightly.
# Which issue does this PR close?

N/A

# Rationale for this change

I have a PR to improve zip perf for scalar but I don't see any
benchmarks for it:
- #8653 

# What changes are included in this PR?

created zip benchmarks for scalar and non scalar with different masks 

# Are these changes tested?
N/A

# Are there any user-facing changes?

Nope
# Which issue does this PR close?

- Partial fix for apache/datafusion#17857

# Rationale for this change

These changes add a safer version of `append_value` in `ByteViewBuilder`
that handles panics called `try_append_value`. Datafusions will consume
the API and handle the Result coming back from the function.

# What changes are included in this PR?

# Are these changes tested?

The method is already covered by existing tests.

# Are there any user-facing changes?

No breaking changes, as the original `append_value` method hasn't
changed.

---------

Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com>
Co-authored-by: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com>
Co-authored-by: Jörn Horstmann <git@jhorstmann.net>
…8642)

# Which issue does this PR close?

- Closes #8639

# Rationale for this change

add write_batch_size config and change compression to use parquet
Compression

# What changes are included in this PR?

add write_batch_size config and change compression to use parquet
Compression

# Are these changes tested?

I've try these command by myself.

# Are there any user-facing changes?

Yeah

1. zstd level previously is default 1, not change to 3
2. str zStd might not pass
# Which issue does this PR close?

- Part of #7835 

# Rationale for this change

Let's get the code to the people

# What changes are included in this PR?

Update version number and CHANGELOG. See rendered version here:
https://github.com/alamb/arrow-rs/blob/alamb/prepare_57.0.0/CHANGELOG.md

# Are these changes tested?

N/A

# Are there any user-facing changes?

No
# Which issue does this PR close?
- Closes #8218

# Rationale for this change

# What changes are included in this PR?
Split the `builder.rs` to three files, `/builder/list`,
`/builder/object` and `builder/metadata`

# Are these changes tested?
Yes
# Are there any user-facing changes?
No
# Which issue does this PR close?

Related to: 
- #7456
- #8565

# Rationale for this change

Improve the performance in ParquetRecoredBatchReader, especially when
the `rowselector` is short.
- By changing a hash map to a enum array

# What changes are included in this PR?
For `parquet/src/arrow/array_reader/cached_array_reader.rs`, update the
hash function

# Are these changes tested?
The hashmaps are already covered by existing tests.
Also tested by manual read parquets.

# Are there any user-facing changes?
No

# Performance results in arrow_reader_row_filter.rs
on my 3950X
Benchmark | Change | Verdict
-- | -- | --
int64 == 9999 / all_columns / async | 🟢 -1.61% | Improved
int64 == 9999 / all_columns / sync | 🔴 +1.56% | Regressed
int64 == 9999 / exclude_filter_column / async | 🟢 -1.11% | Improved
int64 == 9999 / exclude_filter_column / sync | ⚪ -0.97% | Within noise
float64 > 99.0 / all_columns / async | 🟢 -6.25% | Improved
float64 > 99.0 / all_columns / sync | 🟢 -11.24% | Improved
float64 > 99.0 / exclude_filter_column / async | 🟢 -11.10% | Improved
float64 > 99.0 / exclude_filter_column / sync | 🟢 -3.31% | Improved
ts ≥ 9000 / all_columns / async | 🔴 +2.77% | Regressed
ts ≥ 9000 / all_columns / sync | ⚪ -0.06% | Within noise
ts ≥ 9000 / exclude_filter_column / async | 🟢 -2.54% | Improved
ts ≥ 9000 / exclude_filter_column / sync | ⚪ +0.28% | Within noise
int64 > 90 / all_columns / async | 🟢 -14.68% | Improved
int64 > 90 / all_columns / sync | 🟢 -21.00% | Improved
int64 > 90 / exclude_filter_column / async | 🟢 -17.66% | Improved
int64 > 90 / exclude_filter_column / sync | 🟢 -14.53% | Improved
float64 ≤ 99.0 / all_columns / async | 🟢 -9.20% | Improved
float64 ≤ 99.0 / all_columns / sync | 🟢 -11.07% | Improved
float64 ≤ 99.0 / exclude_filter_column / async | 🟢 -10.01% | Improved
float64 ≤ 99.0 / exclude_filter_column / sync | 🟢 -11.80% | Improved
ts < 9000 / all_columns / async | 🟢 -3.43% | Improved
ts < 9000 / all_columns / sync | 🟢 -6.23% | Improved
ts < 9000 / exclude_filter_column / async | 🟢 -4.00% | Improved
ts < 9000 / exclude_filter_column / sync | 🟢 -3.91% | Improved
utf8View <> '' / all_columns / async | 🟢 -16.56% | Improved
utf8View <> '' / all_columns / sync | 🟢 -12.10% | Improved
utf8View <> '' / exclude_filter_column / async | 🟢 -13.00% | Improved
utf8View <> '' / exclude_filter_column / sync | 🟢 -17.29% | Improved
float64 > 99.0 AND ts ≥ 9000 / all_columns / async | 🔴 +3.51% |
Regressed
float64 > 99.0 AND ts ≥ 9000 / all_columns / sync | 🟢 -2.19% | Improved
float64 > 99.0 AND ts ≥ 9000 / exclude_filter_column / async | 🟢 -2.63%
| Improved
float64 > 99.0 AND ts ≥ 9000 / exclude_filter_column / sync | 🟢 -2.72% |
Improved
# Which issue does this PR close?

N/A

# Rationale for this change

doing `OffsetBuffer::from_lengths(std::iter::repeat_n(size,
value.len()));` does not utilize SIMD (I explain further if you want)
See [GodBolt Link](https://godbolt.org/z/PTsfvfjqx)

Extracted from:
- #8653

After this and the pr below is merged will improve the datafusion scalar
to array to use this and make it really really fast:
- #8658

# What changes are included in this PR?

added new function

# Are these changes tested?

yes

# Are there any user-facing changes?

yes
…<repeat>));` with `OffsetBuffer::from_repeated_length(<val>, <repeat>);` (#8669)

# Which issue does this PR close?

N/A

# Rationale for this change

Use the dedicated faster function for creating offset with the same
length

# What changes are included in this PR?

replace
```rust
OffsetBuffer::from_lengths(std::iter::repeat_n(<val>, <repeat>));
```

with
```rust
OffsetBuffer::from_repeated_length(<val>, <repeat>);
```

# Are these changes tested?

Existing tests

# Are there any user-facing changes?

Nope

----

Related to:
- #8656
# Which issue does this PR close?

N/A

# Rationale for this change

I want to repeat the same value multiple times in a very fast way
which will be used in:
- #8653

After this and the pr below is merged will improve the datafusion scalar
to array to use this and make it really really fast:
- #8656 

# What changes are included in this PR?

Created a function in `MutableBuffer` to repeat a slice a number of
times in a logarithmic way to reduce memcopy calls

# Are these changes tested?

Yes

# Are there any user-facing changes?

Yes, and added docs

-------

Extracted from:
- #8653

Benchmark results on local machine

| Slice Length | Repetitions (n) | repeat_slice_n_times |
extend_from_slice loop | Speedup |

|--------------|-----------------|----------------------|------------------------|---------|
| 3 | 3 | 47.092 ns | 41.910 ns | 0.89x |
| 3 | 64 | 63.548 ns | 222.29 ns | 3.50x |
| 3 | 1024 | 105.57 ns | 3.031 µs | 28.7x |
| 3 | 8192 | 405.71 ns | 24.170 µs | 59.6x |
| 20 | 3 | 48.437 ns | 46.437 ns | 0.96x |
| 20 | 64 | 74.993 ns | 319.04 ns | 4.25x |
| 20 | 1024 | 350.94 ns | 4.437 µs | 12.6x |
| 20 | 8192 | 2.440 µs | 35.524 µs | 14.6x |
| 100 | 3 | 50.369 ns | 47.568 ns | 0.94x |
| 100 | 64 | 119.70 ns | 165.37 ns | 1.38x |
| 100 | 1024 | 1.734 µs | 2.623 µs | 1.51x |
| 100 | 8192 | 10.615 µs | 19.750 µs | 1.86x |

these are the results:

<details>
<summary>Result</summary>


```
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=3
                        time:   [46.719 ns 47.092 ns 47.453 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=3
                        time:   [41.833 ns 41.910 ns 41.996 ns]
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=64
                        time:   [62.935 ns 63.548 ns 64.183 ns]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=64
                        time:   [221.75 ns 222.29 ns 222.86 ns]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=1024
                        time:   [105.15 ns 105.57 ns 106.01 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=1024
                        time:   [3.0240 µs 3.0308 µs 3.0395 µs]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=8192
                        time:   [401.57 ns 405.71 ns 409.94 ns]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=8192
                        time:   [24.124 µs 24.170 µs 24.222 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=3
                        time:   [48.287 ns 48.437 ns 48.606 ns]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=3
                        time:   [46.289 ns 46.437 ns 46.611 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=64
                        time:   [74.625 ns 74.993 ns 75.395 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=64
                        time:   [318.20 ns 319.04 ns 319.98 ns]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=1024
                        time:   [346.66 ns 350.94 ns 355.17 ns]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=1024
                        time:   [4.4251 µs 4.4369 µs 4.4506 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=8192
                        time:   [2.4336 µs 2.4401 µs 2.4465 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=8192
                        time:   [35.466 µs 35.524 µs 35.589 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=3
                        time:   [50.209 ns 50.369 ns 50.530 ns]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=3
                        time:   [47.439 ns 47.568 ns 47.701 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=64
                        time:   [117.77 ns 119.70 ns 122.00 ns]
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=64
                        time:   [164.88 ns 165.37 ns 166.07 ns]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=1024
                        time:   [1.7278 µs 1.7335 µs 1.7398 µs]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=1024
                        time:   [2.6176 µs 2.6232 µs 2.6305 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=8192
                        time:   [10.583 µs 10.615 µs 10.649 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=8192
                        time:   [19.471 µs 19.750 µs 20.185 µs]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
```

</details>
# Which issue does this PR close?

- Followup of #8552.

# Rationale for this change

Code cleanup and optimization

# What changes are included in this PR?

Addressed the post-comments in #8552 and refactor/optimize the method
`rescale_decimal`

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

No
# Which issue does this PR close?
- Closes #8637.
- Support Variant to Arrow for Null/Time/Decimlal{4,8,16} 

# What changes are included in this PR?
- Add logic in
`typed_value_to_variant`/`PrimitiveVariantToArrowRowBuilder` for
`Null/Time/Decimal{4,8,16}`
- Implement `PrimitiveFromVariant` for `Time64MicrosecondType`
- Add tests to cover the added logic
- 
# Are these changes tested?
Added some tests

# Are there any user-facing changes?

No
# Which issue does this PR close?

Add utf8-view support for json key

# Rationale for this change

Add utf8-view support for json key

# What changes are included in this PR?

Add utf8-view support for json key

# Are these changes tested?

* [x] TODO

# Are there any user-facing changes?

No
# Which issue does this PR close?

- Closes #8674

# Rationale for this change

Add json encoding for binary view

# What changes are included in this PR?

Add BinaryViewEncoder

# Are these changes tested?

* [x] TODO

# Are there any user-facing changes?

No
# Which issue does this PR close?

- part of #7835 

# Rationale for this change

We added a new crate so let's add that to the instructions too
# What changes are included in this PR?


# Are these changes tested?

# Are there any user-facing changes?
# Which issue does this PR close?

N/A

# Rationale for this change

It is not obvious that the thrift macros produce public enums only (e.g.
see
#8680 (comment)).
This should be made clear in the documentation.

# What changes are included in this PR?

Add said clarification.

# Are these changes tested?

Documentation only, so no tests required.

# Are there any user-facing changes?

No, only changes to private documentation
# Which issue does this PR close?

- Closes #8691

# Rationale for this change

The `README.md` file in `arrow-avro` instructs users to install version
56. This is invalid and should be changed to version 57.

# What changes are included in this PR?

Updated the `README.md` file to reference version 57.

# Are these changes tested?

N/A since this a small `README.md` file change.

# Are there any user-facing changes?

The `README.md` file in `arrow-avro` now instructs users to install
version 57.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?

chore: add test case of `RowSelection::trim`

# Rationale for this change

Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

# What changes are included in this PR?

There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.

# Are these changes tested?

We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
…DING_SLOTS (#8663)

# Which issue does this PR close?

- Closes [#8662]

# Rationale for this change

Related to #8607
We need to know how many encoding are support to create a decoder slot.

# What changes are included in this PR?
Update the `thrift_enum` to know the fields count of enum `Encoding`,
and the value is passed to `EncodingMask` And the `ENCODING_SLOTS`


# Are these changes tested?
1. Originally I think add a UT can prevent failure after the new
encoding are introduced, then I realized the counts are already
transferred, the UT is not required, the original tests can already
cover the code.

# Are there any user-facing changes?
No
- A follow up from #8625

# Rationale for this change

While working on a separate task, I noticed `create_test_variant_array`
was redundant. Since `VariantArray` can already be constructed directly
from an iterator of Variants, this PR removes the now-unnecessary test
helper.
# Which issue does this PR close?

- Closes #8685.

# What changes are included in this PR?

In the implementation of `RowConverter::from_binary`, the `BinaryArray`
is broken into parts and an attempt is made to convert the data buffer
into `Vec` at no copying cost with `Buffer::into_vec`. Only if this
fails, the data is copied out for a newly allocated `Vec`.

# Are these changes tested?

Passes existing tests using `RowConverter::from_binary`, which all
convert a non-shared buffer taking advantage of the optimization.
Another test is added to cover the copying path.

# Are there any user-facing changes?

No
# Which issue does this PR close?

- Closes #8692.

# Rationale for this change

Explained in issue.

# What changes are included in this PR?

- Adds `FilterPredicate::filter_record_batch`
- Adapts the free function `filter_record_batch` to use the new function
- Uses `new_unchecked` to create the filtered result. The rationale for
this is identical to #8583

# Are these changes tested?

Covered by existing tests for `filter_record_batch`

# Are there any user-facing changes?

No

---------

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Part of #7806 
# Rationale for this change

The `DeltaBitPackDecoder` can panic if it encounters a bit width in the
encoded data that is larger than the bit width of the data type being
decoded.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?

- Part of #5853.

# Rationale for this change

While converting to the new Thrift model, the `ConvertedType` enum was
done manually due to the `NONE` variant, which used the discriminant of
`0`. This PR changes that to `-1` which allows the `thrift_enum` macro
to be used instead. This improves code maintainability.

# What changes are included in this PR?

See above.

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

No, this only changes the discriminant value for a unit variant enum.
# Which issue does this PR close?

- Contribues towards the RunEndEncoded (REE) epic #3520, but there is no
specific issue for casting.
- Replaces PRs #7713 and
#8384.

# Rationale for this change

This PR implements casting support for RunEndEncoded arrays in Apache
Arrow.

# What changes are included in this PR?
- `run_end_encoded_cast` in `arrow-cast/src/cast/run_array.rs`
- `cast_to_run_end_encoded` in `arrow-cast/src/cast/run_array.rs`
- Tests in `arrow-cast/src/cast/mod.rs`

# Are these changes tested?
Yes!

# Are there any user-facing changes?

No breaking changes, just new functionality

---------

Co-authored-by: Richard Baah <richard.baah@datadoghq.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…erator (#8696)

# Which issue does this PR close?

N/A

# Rationale for this change

overriding this function improve performance over the fallback
implementation

# What changes are included in this PR?

Override implementation of:
- `count` which is not optimized away even when `ExactSizeIterator` is
implemented
- `nth` to avoid calling `next` `n + 1` times (which is also used when
doing `.skip`)
- `nth_back`
- `last`
- `max`

# Are these changes tested?

Yes, I've added a lot of tests

# Are there any user-facing changes?

Nope
# Rationale for this change

We've caused some unexpected panics from our internal testing. We've put
in error checks for all of these so that they don't affect other users.

# What changes are included in this PR?

Various error checks to ensure panics don't occur.

# Are these changes tested?
Tests should continue to pass.

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
Existing tests should cover these changes.

# Are there any user-facing changes?
None.

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
# Which issue does this PR close?

Part of #5375

Vortex was encountering some issues after we switched our preferred List
type to `ListView`, the first thing we noticed was that
`arrow_select::filter_array` would fail on ListView (and LargeListView,
though we don't use that).

This PR addresses some missing select kernel implementations for
ListView and LargeListView.

This also fixes an existing bug in the ArrayData validation for ListView
arrays that would trigger an out of bounds index panic.

# Are these changes tested?

- [x] filter_array
- [x] concat
- [x] take


# Are there any user-facing changes?

ListView/LargeListView can now be used with the `take`, `concat` and
`filter_array` kernels

You can now use the `PartialEq` to compare ListView arrays.

---------

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Dandandan and others added 20 commits January 10, 2026 18:16
# Which issue does this PR close?

- Closes #9135 

# Rationale for this change
The code was calling `.reserve(batch_size)` which reserves space to at
least `batch_size` additional elements
(https://doc.rust-lang.org/std/vec/struct.Vec.html#method.reserve).

This also improves performance a bit:

```
filter: primitive, 8192, nulls: 0, selectivity: 0.001
                        time:   [59.509 ms 59.660 ms 59.856 ms]
                        change: [−3.0781% −2.7917% −2.4795%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

filter: primitive, 8192, nulls: 0, selectivity: 0.01
                        time:   [6.0072 ms 6.0226 ms 6.0428 ms]
                        change: [−8.7042% −7.1161% −6.0455%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Benchmarking filter: primitive, 8192, nulls: 0, selectivity: 0.1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 50.
filter: primitive, 8192, nulls: 0, selectivity: 0.1
                        time:   [1.8664 ms 1.8709 ms 1.8772 ms]
                        change: [−15.187% −14.905% −14.632%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

filter: primitive, 8192, nulls: 0, selectivity: 0.8
                        time:   [2.5191 ms 2.5444 ms 2.5717 ms]
                        change: [−13.064% −11.414% −10.003%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

Benchmarking filter: primitive, 8192, nulls: 0.1, selectivity: 0.001: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.7s, or reduce sample count to 60.
filter: primitive, 8192, nulls: 0.1, selectivity: 0.001
                        time:   [76.422 ms 76.671 ms 76.973 ms]
                        change: [−5.5096% −4.0229% −2.8048%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.01
                        time:   [10.197 ms 10.228 ms 10.262 ms]
                        change: [−3.6627% −3.0569% −2.4919%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

filter: primitive, 8192, nulls: 0.1, selectivity: 0.1
                        time:   [4.6635 ms 4.6750 ms 4.6915 ms]
                        change: [−9.4939% −8.5908% −7.8383%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.8
                        time:   [4.7777 ms 4.8115 ms 4.8467 ms]
                        change: [−9.9226% −9.1384% −8.3813%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```

# What changes are included in this PR?
Changes it to use `self.views.reserve(self.batch_size -
self.views.len())` to avoid allocating more than necessary (i.e. 2x the
amount).

# Are these changes tested?

Existing tests

# Are there any user-facing changes?
)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #NNN.

# Rationale for this change

Improve JSON binary decoding performance by avoiding per-value
allocations and enabling direct hex decoding into builders.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

Optimized binary hex decoding paths to reduce allocations and improve
throughput.


```
decode_binary_hex_json  time:   [3.6780 ms 3.6953 ms 3.7150 ms]
                        change: [−61.051% −60.818% −60.565%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

decode_fixed_binary_hex_json
                        time:   [4.0404 ms 4.1400 ms 4.2901 ms]
                        change: [−56.149% −55.040% −53.330%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  7 (7.00%) high mild
  12 (12.00%) high severe

decode_binary_view_hex_json
                        time:   [4.3731 ms 4.4242 ms 4.4767 ms]
                        change: [−53.305% −52.771% −52.239%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- related to #9085
- related to #9022

# Rationale for this change

the `nullif` kernel has an optimization for counting nulls as part of
applying nullif, and I did not know if that made a difference (turns out
it does).

I made a benchmark to be able to measure this difference

# What changes are included in this PR?

Add a `nullif_kernel` benchmark

# Are these changes tested?

I ran them manually
# Are there any user-facing changes?

No new benchmark
…9129)

# Which issue does this PR close?

- part of #9128

# Rationale for this change

While studying / profiling the Parquet reader I have noticed several
places where unecessary allocations are happening

# What changes are included in this PR?

Avoid ArrayData allocation in `PrimitiveArray::reinterpret_cast`

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
# Which issue does this PR close?


- Part of #9136

# Rationale for this change

API for #8951 , as part of a 2-3x
speedup for filtering primitive types.

# What changes are included in this PR?

Adds `BooleanBufferBuilder::extend`, a fast way to extend the buffer
from an iterator.

# Are these changes tested?

Yes, new tests

# Are there any user-facing changes?
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #NNN.

# Rationale for this change

One of my new years resolutions is to try and encourage / build the
arrow-rs community more. Part of doing so is to lower the barrier for
new contributors with clear communication and documentation.

Thus I would like to improve the main readme

# What changes are included in this PR?

Update the README. See rendered preview here:
https://github.com/alamb/arrow-rs/blob/alamb/improved_comms_docs/README.md

1. Move community section to the top of the README file
2. Clarify that most communication happens on github

<img width="1348" height="775" alt="Screenshot 2026-01-09 at 9 33 06 AM"
src="https://github.com/user-attachments/assets/fc3f6041-478d-4170-bb40-3956764a9b8d"
/>


# Are these changes tested?

CI
# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
…9134)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #9133

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Allow casting null arrays to all types we support. We missed (large)
list view, run end encoded and union.

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Refactor from null arm to accept all target types to enable casting to
large list view, list view, run end encoded and union types.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Added tests.

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

No.
… return type (#9139)

### Which issue does this PR close?

Closes #9105

### Rationale for this change

The documentation for `VariantObject::get` previously described
`Result`-style
semantics (`Ok(None)` / `Err`), but the method actually returns
`Option<Variant>`. This mismatch could confuse users of the API.

### What changes are included in this PR?

- Update the documentation for `VariantObject::get` to correctly
describe its
  `Option` return type.
- No functional or behavioral changes are included.
…ray` (#9114)

# Which issue does this PR close?


- Part of #9061
- broken out of #9058

# Rationale for this change

The current implementation of `make_array` for StructArray and
GenericByteViewArray clones `ArrayData` which allocates a new Vec. This
is unnecessary given that `make_array` is passed an owned ArrayData


# What changes are included in this PR?

1. Add a new API to ArrayData to break it down into parts (`into_parts`)
2. Use that API to avoid cloning while constructing StructArray and
GenericByteViewArray

# Are these changes tested?

Yes by CI

# Are there any user-facing changes?
A few fewer allocations when creating arrays

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #9147 .

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

Check issue

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?

It's a test

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
Closes #9131

This PR updates the Apache Software Foundation copyright year
in `arrow-rs/NOTICE.txt`, as discussed in the release verification
follow-up. No third-party entries are modified.
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Part of #9018 .

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
To consider offset in slicing of RunArray.

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
1. Considered offset in slicing of RunArray.
2. Enhanced RunArray slice API.
# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
3. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
yes
# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
Yes, extended API to access RunArray slices directly than getting it
from index.
# Which issue does this PR close?

No known issue, but this confirms that nested dicts do indeed work in
Map and Union arrays. It was brought up here:
#9126 (comment)

# Rationale for this change

Ensure that IPC roundtripping nested dicts works in Map and Union
arrays.

# What changes are included in this PR?

Unit tests testing the functionality.

# Are these changes tested?

The whole PR consists of tests only.

# Are there any user-facing changes?

No

@alamb @Jefffrey
# Which issue does this PR close?

N/A

# Rationale for this change

allow to reserve so we can avoid reallocating

# What changes are included in this PR?

added `reserve` function to `Rows` + tests

# Are these changes tested?

yes

# Are there any user-facing changes?

yes
…ow conversion (#9080)

# Which issue does this PR close?

N/A

# Rationale for this change

Making the row length calculation faster which result in faster row
conversion

# What changes are included in this PR?

1. Instead of iterating over the bytes and getting the length from the
byte slice, we use the offsets directly, this is faster as it saves us
going to the buffer
2. Added new API for `GenericByteViewArray` (explained below)

# Are these changes tested?

Yes

# Are there any user-facing changes?

Yes, added `lengths` function to `GenericByteViewArray` to get an
iterator over the lengths of the items in the array

-----

Related to:
- #9078 
- #9079

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…rayData` (#9122)

# Which issue does this PR close?
- related to #9061
- Part of #9128


# Rationale for this change

- similarly to #9120

Creating Arrays via ArrayData / `make_array` has overhead (at least 2
Vec allocations) compared to simply creating the arrays directly

# What changes are included in this PR?

Update the parquet reader to create `PrimitiveArray`s directly

# Are these changes tested?
By CI

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
# Which issue does this PR close?

- Closes #9096.

# Rationale for this change

The RowFilter API does exist and can evaluate predicates during
evaluation, but it has no examples.

# What changes are included in this PR?

- Added a rustdoc example and blog link to
`ParquetRecordBatchReaderBuilder::with_row_filter`.
- Added a running example in `parquet/examples/read_with_row_filter.rs`

# Are these changes tested?

Yes 
```
cargo run -p parquet --example read_with_row_filter
cargo test -p parquet --doc
```

# Are there any user-facing changes?

Yes, doc only. No API changes.
…ta` (1% improvement) (#9120)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- part of #9061
- Part of #9128

# Rationale for this change

I noticed on #9061 that there
is non trivial overhead to create struct arrays. I am trying to improve
`make_array` in parallel, but @tustvold had an even better idea in
#9058 (comment)

> My 2 cents is it would be better to move the codepaths relying on
ArrayData over to using the typed arrays directly, this should not only
cut down on allocations but unnecessary validation and dispatch
overheads.

# What changes are included in this PR?

Update the parquet `StructArray` reader (used for the top level
RecordBatch) to directly construct StructArray rather than using
ArrayData

# Are these changes tested?
By existing CI

Benchmarks show a small repeatable improvement of a few percent. For
example

```
arrow_reader_clickbench/async/Q10    1.00     12.7±0.35ms        ? ?/sec    1.02     12.9±0.44ms        ? ?/sec
```

I am pretty sure this is because the click bench dataset has more than
100 columns. Creating such a struct array requires cloning 100
`ArrayData` (one for each child) which each has a Vec<Buffers>. So this
saves (at least) 100 allocations per batch

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
…p) (#9086)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #NNN.

# Rationale for this change

Optimize JSON struct decoding on wide objects by reducing per-row
allocations and repeated field lookups.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

Reuse a flat child-position buffer in `StructArrayDecoder` and add an
optional field-name index for object mode.
Skip building the field-name index for list mode; add
overflow/allocation checks.

```
decode_wide_object_i64_json
                        time:   [11.828 ms 11.865 ms 11.905 ms]
                        change: [−67.828% −67.378% −67.008%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

decode_wide_object_i64_serialize
                        time:   [7.6923 ms 7.7402 ms 7.7906 ms]
                        change: [−75.652% −75.483% −75.331%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

```

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
Yes
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?
No
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
…om `ArrayData` (#9156)

# Which issue does this PR close?

- part of #9061
- follow on #9114


# Rationale for this change

@scovich noted in
#9114 (comment) that
calling `Vec::remove` does an extra copy and that `Vec::from` doesn't
actually reuse the allocation the way I thought it did


# What changes are included in this PR?

Build the Arc for buffers directly

# Are these changes tested?

BY existing tests

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
alamb and others added 3 commits January 14, 2026 13:28
# Which issue does this PR close?

- Part of #9061
- broken out of #9058

# Rationale for this change

Let's make arrow-rs the fastest we can and the fewer allocations the
better

# What changes are included in this PR?

Apply pattern from #9114 

# Are these changes tested?

Existing tests 

# Are there any user-facing changes?

No
…9160)

# Which issue does this PR close?

- Part of #9061
- broken out of #9058

# Rationale for this change

Let's make arrow-rs the fastest we can and the fewer allocations the
better

# What changes are included in this PR?

Apply pattern from #9114 

# Are these changes tested?

Existing tests 

# Are there any user-facing changes?

No
…n row conversion (#9078)

# Which issue does this PR close?

N/A

# Rationale for this change

Making the row length calculation faster which result in faster row
conversion

# What changes are included in this PR?

Instead of iterating over the items in the array and getting the length
from the byte slice, we use the offsets directly and zip with nulls if
necessary

# Are these changes tested?

Existing tests

# Are there any user-facing changes?

Faster encoding

------

Split to 2 more PRs as the other 2 add a change to the public API

Related to:
- #9079
- #9080

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
… conversion (#9079)

# Which issue does this PR close?

N/A

# Rationale for this change

Making the row length calculation faster which result in faster row
conversion

# What changes are included in this PR?

1. Instead of iterating over the rows and getting the length from the
byte slice, we use the offsets directly, this
2. Added 3 new APIs for `Rows` (explained below)

# Are these changes tested?

Yes

# Are there any user-facing changes?

Yes, added 3 functions to `Rows`:
- `row_len` - get the row length at index
- `row_len_unchecked` - get the row length at index without bound checks
- `lengths` - get iterator over the lengths of the rows

-----

Related to:
- #9078
- #9080

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@amlynczak-rel amlynczak-rel merged commit 7194aae into relativityone:main Jan 14, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.