Skip to content

update 57.2.0 to 57.3.0#15

Open
mandrush wants to merge 430 commits into
relativityone:v57.2.0-mainfrom
apache:main
Open

update 57.2.0 to 57.3.0#15
mandrush wants to merge 430 commits into
relativityone:v57.2.0-mainfrom
apache:main

Conversation

@mandrush

Copy link
Copy Markdown

Which issue does this PR close?

  • Closes #NNN.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

kunalsinghdadhwal and others added 25 commits March 19, 2026 21:14
# Which issue does this PR close?


- Closes #9378 

# Rationale for this change

the optimizations as listed in the issue description

- Align to 8 bytes
- Don't try to return a buffer with bit_offset 0 but round it to a
multiple of 64
- Use chunk_exact for the fallback path


# What changes are included in this PR?

When both inputs share the same sub-64-bit alignment (left_offset % 64
== right_offset % 64), the optimized path is used. This covers the
common cases (both offset 0, both sliced equally, etc.). The BitChunks
fallback is retained only when the two offsets have different sub-64-bit
alignment.

# Are these changes tested?

Yes the tests are changed and they are included

# Are there any user-facing changes?
  

Yes, this is a minor breaking change to from_bitwise_binary_op:

- The returned BooleanBuffer may now have a non-zero offset (previously
always 0)
- The returned BooleanBuffer may have padding bits set outside the
logical range in values()

---------

Signed-off-by: Kunal Singh Dadhwal <kunalsinghdadhwal@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?

None

# Rationale for this change

We want to use the SBBF Bloom Filter, but need to construct/serialize it
manually. Currently there is no way to create a new `Sbbf` outside of
this crate. Alongside this: we want to store the `Sbbf` in a
`FixedSizedBinary` column for some fancy indexing.

# What changes are included in this PR?

Some methods become public

# Are these changes tested?

N/A

# Are there any user-facing changes?

Yes, we add a few more public methods to the `Sbbf` struct
# Which issue does this PR close?

- Closes #NNN.

# Rationale for this change

Rust implementation of apache/arrow#45360

Traditional Parquet writing splits data pages at fixed sizes, so a
single inserted or deleted row causes all subsequent pages to shift —
resulting in nearly every byte being re-uploaded to content-addressable
storage (CAS) systems. CDC determines page boundaries via a rolling
gearhash over column values, so unchanged data produces identical pages
across different writes enabling storage cost reductions and faster
upload times.

See more details in https://huggingface.co/blog/parquet-cdc

The original C++ implementation
apache/arrow#45360

Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator
where I already integrated this PR to verify that deduplication
effectiveness is on par with parquet-cpp (lower is better):

<img width="984" height="411" alt="image"
src="https://github.com/user-attachments/assets/e6e80931-ac76-4bdd-bf9c-ba7e06559411"
/>


# What changes are included in this PR?

- **Content-defined chunker**  at `parquet/src/column/chunker/`
- **Arrow writer integration** integrated in `ArrowColumnWriter`
- **Writer properties** via `CdcOptions` struct (`min_chunk_size`,
`max_chunk_size`, `norm_level`)
- **ColumnDescriptor**: added `repeated_ancestor_def_level` field to for
nested field values iteration

# Are these changes tested?

Yes — unit tests are located in `cdc.rs` and ported from the C++
implementation.

# Are there any user-facing changes?

New **experimental** API, disabled by default — no behavior change for
existing code:

```rust
// Simple toggle (256 KiB min, 1 MiB max, norm_level 0)
let props = WriterProperties::builder()
    .set_content_defined_chunking(true)
    .build();

// Excpliti CDC parameters
let props = WriterProperties::builder()
    .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 })
    .build();
```

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
#9576)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #9526 

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

`shred_variant` already supports Binary and LargeBinary types (#9525,
#9554), but unshred_variant does not handle these types. This means
shredded Binary/LargeBinary columns cannot be converted back to
unshredded VariantArrays.

# What changes are included in this PR?

Adds unshred_variant support for DataType::Binary and
DataType::LargeBinary in parquet-variant-compute/src/unshred_variant.rs:
  - New enum variants PrimitiveBinary and PrimitiveLargeBinary
  - Match arms in append_row and try_new_opt
  - AppendToVariantBuilder impls for BinaryArray and LargeBinaryArray



# Are these changes tested?

Yes

# Are there any user-facing changes?

No breaking changes

---------

Signed-off-by: Kunal Singh Dadhwal <kunalsinghdadhwal@gmail.com>
# Which issue does this PR close?

- part of #9108

# Rationale for this change

Prepare for next release

# What changes are included in this PR?

1. Update version to `58.1.0`
2. Add changelog. See rendered preview here:
https://github.com/alamb/arrow-rs/blob/alamb/prepare_58.1.0/CHANGELOG.md

# Are these changes tested?

By CI
# Are there any user-facing changes?

Yes
…#9590)

## Summary

- Reserve `output.views` capacity in
`ByteViewArrayDecoderDictionary::read` before the decode loop
- Reserve `output.offsets` capacity in
`ByteArrayDecoderDictionary::read` before the decode loop

This avoids per-chunk reallocation during `extend` calls inside the
dictionary decode loop.

Closes #9587

## Test plan

- [ ] Existing tests pass (no functional change, only pre-allocation)
- [ ] Benchmark dictionary-encoded StringView/BinaryView/String reads

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Rationale for this change

In some cases, it is desirable to print strings with surrounding
quotation marks. A typical example that we run into in
https://github.com/rerun-io/rerun is a `StructArray` that contains empty
strings:

Current formatting:

```text
{name: }
```

Added option in this PR:

```text
{name: ""}
```

# What changes are included in this PR?

This PR relies on `std::fmt::Debug` to do the actual formatting of
strings, which means that all escaping is handled out of the box.

# Are these changes tested?

This PR contains test for different types of inputs, including escape
sequences. Additionally, it also tests the `StructArray` example
outlined above.

# Are there any user-facing changes?

By default this option is false, making the feature opt-in.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
## Which issue does this PR close?

Closes #9580

## Rationale

The current VLQ decoder calls `get_aligned` for each byte, which
involves repeated offset calculations and bounds checks in the hot loop.

## What changes are included in this PR?

Align to the byte boundary once, then iterate directly over the buffer
slice, avoiding per-byte overhead from `get_aligned`.

## Are there any user-facing changes?

No.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Rationale for this change

The `object_store` crate release 0.13.2 breaks the build of parquet
because it feature-gates the `buffered` module. I have filed
apache/arrow-rs-object-store#677 about the
breakage; meanwhile this fix is made in expectation that 0.13.2 will not
be yanked and the feature gate will remain.

# What changes are included in this PR?

Bump the version to 0.13.2 and requesting the "tokio" feature.

# Are these changes tested?

The build should succeed in CI workflows.

# Are there any user-facing changes?

No

Co-authored-by: Mikhail Zabaluev <mikhail.zabaluev@gmail.com>
Updates the requirements on [sha2](https://github.com/RustCrypto/hashes)
to permit the latest version.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/RustCrypto/hashes/commit/ffe093984c004769747e998f77da8ff7c0e7a765"><code>ffe0939</code></a>
Release sha2 0.11.0 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/806">#806</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/8991b65fe400c31c4cc189510f86ae642c470cd9"><code>8991b65</code></a>
Use the standard order of the <code>[package]</code> section fields (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/807">#807</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/3d2bc57db40fd6aeb25d6c6da98d67e2784c2985"><code>3d2bc57</code></a>
sha2: refactor backends (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/802">#802</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/faa55fb83697c8f3113636d88070e5f5edc8c335"><code>faa55fb</code></a>
sha3: bump <code>keccak</code> to v0.2 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/803">#803</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/d3e6489e56f8486d4a93ceb7a8abf4924af1de7b"><code>d3e6489</code></a>
sha3 v0.11.0-rc.9 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/801">#801</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/bbf6f51ff97f81ab15e6e5f6cf878bfbcb1f47c8"><code>bbf6f51</code></a>
sha2: tweak backend docs (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/800">#800</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/155dbbf2959dbec0ec75948a82590ddaede2d3bc"><code>155dbbf</code></a>
sha3: add default value for the <code>DS</code> generic parameter on
<code>TurboShake128/256</code>...</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/ed514f2b34526683b3b7c41670f1887982c3df64"><code>ed514f2</code></a>
Use published version of <code>keccak</code> v0.2 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/799">#799</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/702bcd83735a49c928c0fc24506924f5c0aa22af"><code>702bcd8</code></a>
Migrate to closure-based <code>keccak</code> (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/796">#796</a>)</li>
<li><a
href="https://github.com/RustCrypto/hashes/commit/827c043f82d57666a0b146d156e91c39535c1305"><code>827c043</code></a>
sha3 v0.11.0-rc.8 (<a
href="https://redirect.github.com/RustCrypto/hashes/issues/794">#794</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/RustCrypto/hashes/compare/groestl-v0.10.0...sha2-v0.11.0">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #9340.

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Support `ListView` codec in arrow-json. Using `ListLikeArray` trait to
simplify implementation.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Tests added

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

New encoder/decoder
… verification (#9604)

# Which issue does this PR close?

- Closes #9603 

# Rationale for this change

The release and dev KEYS files could get out of synch.
We should use the release/ version:
- Users use the release/ version not dev/ version when they verify our
artifacts' signature
- https://dist.apache.org/ may reject our request when we request many
times by CI

# What changes are included in this PR?

Use
`https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/KEYS`
to download the KEYS file and the expected
`https://dist.apache.org/repos/dist/dev/arrow` for the RC artifacts.

# Are these changes tested?

Yes, I've verified 58.1.0 1 both previous to the change and after the
change.

# Are there any user-facing changes?

No
…uct)` (#9597)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #9596.

# Rationale for this change

Check issue
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

Reuse `shred_basic_variant` as a fast path for unshredded `Struct`
handling in `variant_get(..., Struct)`
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?

Yes, added two unit tests to establish safe mode behavior.
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
## Summary

- Fix `MutableArrayData::extend_nulls` which previously panicked
unconditionally for both sparse and dense Union arrays
- For sparse unions: append the first type_id and extend nulls in all
children
- For dense unions: append the first type_id, compute offsets into the
first child, and extend nulls in that child only

## Background

This bug was discovered via DataFusion. `CaseExpr` uses
`MutableArrayData` via `scatter()` to build result arrays. When a `CASE`
expression returns a Union type (e.g., from `json_get` which returns a
JSON union) and there are rows where no `WHEN` branch matches (implicit
`ELSE NULL`), `scatter` calls `extend_nulls` which panics with "cannot
call extend_nulls on UnionArray as cannot infer type".

Any query like:
```sql
SELECT CASE WHEN condition THEN returns_union(col, 'key') END FROM table
```
would panic if `condition` is false for any row.

## Root Cause

The `extend_nulls` implementation for Union arrays unconditionally
panicked because it claimed it "cannot infer type". However, the Union's
field definitions (child types and type IDs) are available in the
`MutableArrayData`'s data type — there's enough information to produce
valid null entries by picking the first declared type_id.

## Test plan

- [x] Added test for sparse union `extend_nulls`
- [x] Added test for dense union `extend_nulls`
- [x] Existing `test_union_dense` continues to pass
- [x] All `array_transform` tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Relates to #9497.

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

As part of the effort to move the Json reader away from `ArrayData`
toward typed `ArrayRef` APIs, it's necessary to change the
`ArrayDecoder::decode` interface to return `ArrayRef` directly and
updates all decoder implementations (list, struct, map, run-end encoded)
to construct typed arrays without intermediate `ArrayData` round-trips.
New benchmarks for map and run-end encoded decoding are added to verify
there is no performance regression.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Yes

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
No
# Which issue does this PR close?

- closes #9593

# Rationale for this change

In a previous PR (#9593), I change instances of `truncate(0)` to
`clear()`. However, this breaks the test `test_truncate_with_pool` at
`arrow-buffer/src/buffer/mutable.rs:1357`, due to an inconsistency
between the implementation of `truncate` and `clear`. This PR fixes that
test.

# What changes are included in this PR?

This PR copies a section of code related to the `pool` feature present
in `truncate` but absent in `clear`, fixing the failing unit test.

# Are these changes tested?

Yes.

# Are there any user-facing changes?

No.
# Rationale for this change

CdcOptions only contains primitive fields (usize, usize, i32) so
deriving PartialEq and Eq is straightforward. This is needed by
downstream crates such as DataFusion that embed CdcOptions in their own
configuration structs and need to compare them.

# What changes are included in this PR?

Implemented PartialEq and Eq for CdcOptions.

# Are these changes tested?

Added an equality test.

# Are there any user-facing changes?

No.
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #8400.

# Rationale for this change

Check issue
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

- Added `AppendNullMode` enum supporting all semantics.
- Replaced the bool logic to the new enum
- Fix test outputs for List Array cases

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
- Added unit tests
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
# Rationale for this change

Makes the code simpler and more readable by relying on new PyO3 and Rust
features. No behavior should have changed outside of an error message if
`__arrow_c_array__` does not return a tuple

# What changes are included in this PR?

- use `.call_method0(M)?` instead of `.getattr(M)?.call0()`
- Use `.extract()` that allows more advanced features like directly
extracting tuple elements
- remove temporary variables just before returning
- use &raw const and &raw mut pointers instead of casting and addr_of!
# Which issue does this PR close?

- Part of #9637
# Rationale for this change

I can't benchmark the arrow-writer changes in
#9447 due to hitting a panic:
- #9637

# What changes are included in this PR?

Temporarily disable the cdc benchmarks until the underlying bug is fixed

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
…ly (#9447)

# Which issue does this PR close?

- Closes #9446.
- closes #9636

# Rationale for this change

When writing a Parquet column with very sparse data,
`GenericColumnWriter` accumulates unbounded memory for definition and
repetition levels. The raw `i16` values are appended into `Vec<i16>`
sinks on every `write_batch` call and only RLE-encoded in bulk when a
data page is flushed. For a column that is almost entirely nulls, the
actual RLE-encoded output can be tiny, yet the intermediate buffer grows
linearly with the number of rows.

# What changes are included in this PR?

Replace the two raw-level `Vec<i16>` sinks (`def_levels_sink` /
`rep_levels_sink`) with streaming `LevelEncoder` fields
(`def_levels_encoder` / `rep_levels_encoder`). Behavior is the same, but
we keep running RLE-encoded state rather than the full list of rows in
memory. Existing logic is reused.

# Are these changes tested?

Yes, all tests passing.
Benchmarks show no regression. `list_primitive` benches improved by
3-5%:

```
Benchmarking list_primitive/default: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
list_primitive/default  time:   [1.2109 ms 1.2171 ms 1.2248 ms]
                        thrpt:  [1.6999 GiB/s 1.7105 GiB/s 1.7194 GiB/s]
                 change:
                        time:   [−3.7197% −2.8848% −2.0036%] (p = 0.00 < 0.05)
                        thrpt:  [+2.0445% +2.9705% +3.8634%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Benchmarking list_primitive/bloom_filter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
list_primitive/bloom_filter
                        time:   [1.4405 ms 1.4810 ms 1.5292 ms]
                        thrpt:  [1.3615 GiB/s 1.4058 GiB/s 1.4452 GiB/s]
                 change:
                        time:   [−6.4332% −4.7568% −2.9048%] (p = 0.00 < 0.05)
                        thrpt:  [+2.9917% +4.9944% +6.8755%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Benchmarking list_primitive/parquet_2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.3s, enable flat sampling, or reduce sample count to 60.
list_primitive/parquet_2
                        time:   [1.2271 ms 1.2311 ms 1.2362 ms]
                        thrpt:  [1.6841 GiB/s 1.6911 GiB/s 1.6966 GiB/s]
                 change:
                        time:   [−5.8536% −4.9672% −4.1905%] (p = 0.00 < 0.05)
                        thrpt:  [+4.3738% +5.2269% +6.2175%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
list_primitive/zstd     time:   [2.0056 ms 2.0148 ms 2.0262 ms]
                        thrpt:  [1.0275 GiB/s 1.0333 GiB/s 1.0381 GiB/s]
                 change:
                        time:   [−4.7073% −3.6719% −2.6698%] (p = 0.00 < 0.05)
                        thrpt:  [+2.7431% +3.8118% +4.9398%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe
list_primitive/zstd_parquet_2
                        time:   [2.0455 ms 2.0730 ms 2.1120 ms]
                        thrpt:  [1009.4 MiB/s 1.0043 GiB/s 1.0178 GiB/s]
                 change:
                        time:   [−5.8626% −3.7672% −1.4196%] (p = 0.00 < 0.05)
                        thrpt:  [+1.4401% +3.9146% +6.2277%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Benchmarking list_primitive_non_null/default: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.6s, enable flat sampling, or reduce sample count to 60.
list_primitive_non_null/default
                        time:   [1.3199 ms 1.3333 ms 1.3504 ms]
                        thrpt:  [1.5384 GiB/s 1.5581 GiB/s 1.5740 GiB/s]
                 change:
                        time:   [−4.1662% −2.3491% −0.7148%] (p = 0.01 < 0.05)
                        thrpt:  [+0.7200% +2.4056% +4.3473%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
Benchmarking list_primitive_non_null/bloom_filter: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50.
list_primitive_non_null/bloom_filter
                        time:   [1.6567 ms 1.6668 ms 1.6805 ms]
                        thrpt:  [1.2362 GiB/s 1.2464 GiB/s 1.2540 GiB/s]
                 change:
                        time:   [−2.7884% −1.3493% +0.2820%] (p = 0.07 > 0.05)
                        thrpt:  [−0.2812% +1.3677% +2.8684%]
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
Benchmarking list_primitive_non_null/parquet_2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.2s, enable flat sampling, or reduce sample count to 50.
list_primitive_non_null/parquet_2
                        time:   [1.4279 ms 1.4409 ms 1.4551 ms]
                        thrpt:  [1.4277 GiB/s 1.4418 GiB/s 1.4550 GiB/s]
                 change:
                        time:   [−2.0598% −0.9952% −0.1318%] (p = 0.04 < 0.05)
                        thrpt:  [+0.1319% +1.0052% +2.1032%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
list_primitive_non_null/zstd
                        time:   [2.6966 ms 2.7358 ms 2.7994 ms]
                        thrpt:  [759.93 MiB/s 777.60 MiB/s 788.89 MiB/s]
                 change:
                        time:   [−3.8379% −2.1418% +0.0785%] (p = 0.03 < 0.05)
                        thrpt:  [−0.0784% +2.1887% +3.9911%]
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
list_primitive_non_null/zstd_parquet_2
                        time:   [2.7684 ms 2.7861 ms 2.8099 ms]
                        thrpt:  [757.07 MiB/s 763.55 MiB/s 768.44 MiB/s]
                 change:
                        time:   [−6.4460% −4.1387% −2.1474%] (p = 0.00 < 0.05)
                        thrpt:  [+2.1946% +4.3174% +6.8901%]
                        Performance has improved.
```

# Are there any user-facing changes?

None. Some internal symbols are now unused. I added some
`#[allow(dead_code)]` statements since these were experimental-visible
and might be externally relied on.

---------

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?

- Follow on to #9594

# Rationale for this change


@kylebarron says
#9594 (comment):

> fwiw previously there was a nice user-facing error here, while now the
error generated from extract will be much more obtuse. Ideally this
exception will never be raised except if the producer doesn't follow the
spec correctly.

# What changes are included in this PR?
Restore the nice error

# Are these changes tested?

yes, added a test

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
# Which issue does this PR close?

- Closes #NNN.

# Rationale for this change

Miri in CI is VERY slow (around 2.5 hours), but the github runners
actually have 4 vCPUs and some memory, so using nextest can give us some
speedup.

# What changes are included in this PR?

Install nextest in CI and then use it to run Miri

# Are these changes tested?

tested the script locally

# Are there any user-facing changes?

No
…er (#9497)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Part of #9298.

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

While implementing `ListViewArrayDecoder` in arrow-json, I noticed we
could potentially retire `ArrayDataBuilder` inside `ListArrayDecoder`.
Therefore, I'd like to use a small PR here to make sure there's no
regression

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Replace `ArrayDataBuilder` with `GenericListArray` in `ListArrayDecoder`

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Covered by existing tests

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

No
# Which issue does this PR close?

- Closes #9627.

# Rationale for this change

Adding benchmarks makes it easier to measure performance and evaluate
the impact of changes to the implementation. I also have a PR including
some significant improvements, but figured its worth splitting it into
two parts, LMK if its better to do that in one step.

# What changes are included in this PR?

Add a couple of utility functions to generate list and list_view arrays
without providing a seed

# Are these changes tested?

Benchmarks run locally, same setup as other benchmarks.

# Are there any user-facing changes?

No
Rich-T-kid and others added 30 commits June 4, 2026 13:48
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->
- Closes #10029.

# Rationale for this change
Increase the duplex buffer from 1 MB to 64 MB to eliminate artificial
back-pressure in the roundtrip benchmarks.
See rational in this
[comment](#10044 (comment))
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?
bumps `max_buf_size` to 64**MB**
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
n/a
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

# Are there any user-facing changes?
n/a
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
# Which issue does this PR close?

- Part of #9110

# Rationale for this change

This prepares for the `59.0.0` (major) release of the Rust Arrow /
Parquet crates.

# What changes are included in this PR?

1. Update version to `59.0.0`
2. Update CHANGELOG. See rendered preview here:
https://github.com/alamb/arrow-rs/blob/alamb/make_release_59.0.0/CHANGELOG.md

# Are these changes tested?

By CI

# Are there any user-facing changes?

yes
# Which issue does this PR close?

- Issue raised in #9110

# Rationale for this change
Add a "bad_data" test for newly added file in parquet-testing

# What changes are included in this PR?

Adds a new test so the `bad_data` unit test doesn't fail.

# Are these changes tested?

Yes

# Are there any user-facing changes?

No, only tests
… comment (#10072)

# Which issue does this PR close?

Follow-up to #9972.

# Rationale for this change

A test comment added in #9972 described granular mode as writing "more
pages than `main`". As noted in [review
feedback](#9972 (comment)),
comparing to `main` is confusing now that the PR has merged — that code
*is* main. This rephrases the comment to compare against the default
batched path instead, which the same comment already references.

# What changes are included in this PR?

- Reword one test comment in
`test_arrow_writer_granular_mode_roundtrip`. No behavior change.

# Are there any user-facing changes?

No.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Which issue does this PR close?

- Closes #10080.

# Rationale for this change

`From` is already implemented for all other signed integer primitives,
ran into it working on decimal aggregations in DataFusion, which this
will make much simpler.

# What changes are included in this PR?

Adds an additional trait implementation for i256. I've also considered
deprecating `i256::from_i128` as a public function, but figured I'll see
what reviewers think.

# Are these changes tested?

Just exposes an additional path for existing functionality.

# Are there any user-facing changes?

No

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
…s` (#10089)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Spawn off from #9848
- Contributes to #9731

# Rationale for this change

The recursive `build_reader` / `build_*_reader` methods in the array
reader builder thread `field` and `mask` through every call.

# What changes are included in this PR?

Bundle them into a small `Copy` `ReaderArgs` struct so the recursive
signatures stay compact and there is a single, documented home for
per-field reader options added in the future. This is a mechanical,
behavior-preserving change: `build_array_reader` constructs the args at
the entry point, group readers recurse with `args.with_field(child)`,
and leaf readers read `args.field` and `args.mask`.

# Are these changes tested?

All tests passing.

# Are there any user-facing changes?

No.
# Which issue does this PR close?

- Closes #9815.

# Rationale for this change

As noted in
#9813 (comment),
Rust debug builds panic on arithmetic overflow / underflow but release
builds do not (they simply overflow / underflow). This means that some
code paths may panic in debug builds that would have silently failed in
release builds.

As we harden down the security posture of arrow-rs I would like to start
testing in release mode too to ensure overflows such as
#9813 can be properly validation

# What changes are included in this PR?

Add Some new release mode tests: `linux-release-test:` et al

# Are these changes tested?

They are only tests, no code changes

# Are there any user-facing changes?

No
#10014)

# Which issue does this PR close?

- Closes #10013
- Related to #6736

# Rationale for this change

`variant_get` / `variant_to_arrow` can already convert Variant values
into many native Arrow array layouts, but requesting
`DataType::Dictionary` or `DataType::RunEndEncoded` was not supported.

This PR adds support for those output encodings without changing Variant
shredding semantics. `Dictionary` and `RunEndEncoded` are produced as
Arrow result arrays only; they are not introduced as valid Parquet
Variant shredded `typed_value` layouts.

# What changes are included in this PR?

1. Adds an encoded output builder in `variant_to_arrow` for
`DataType::Dictionary` and `DataType::RunEndEncoded`.
2. Builds the logical child value array using the existing
Variant-to-Arrow builders, then delegates the final Dictionary/REE
encoding to Arrow's existing cast kernels.
3. Adds `variant_get` regression coverage for string dictionary, numeric
dictionary, and run-end encoded outputs.

# Are these changes tested?

Yes:

- `cargo fmt --check`
- `cargo test -p parquet-variant-compute`
- `cargo test -p parquet-variant`
- `cargo clippy --workspace --all-targets`

# Are there any user-facing changes?

Yes. `variant_get` with `as_type` set to `DataType::Dictionary` or
`DataType::RunEndEncoded` can now return those Arrow array encodings.

Co-authored-by: Neetika Mittal <mneetika@users.noreply.github.com>
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #10095.

# Rationale for this change

- check issue
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

- Replaced old .md issue templates with new .yaml templates following
DataFusion.
- config.yaml is new, but has a convenient link to discussions since we
get rid of `question` template
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
n/a
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

# Are there any user-facing changes?

- Yes, better UX for issue creation.
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
# Which issue does this PR close?

- Closes #10083 .

# Rationale for this change

Add benchmarks for list types with nested repetition levels:
- `list_nested`: List<List<Int32>>
- `list_struct_with_list`: List<Struct<a:Int32, b:Float32,
c:List<Int32>>>

These exercise the per-slot (non-batched) write path where
child_has_no_nested_rep() returns false, providing a baseline for future
optimizations.

# What changes are included in this PR?

Add some benchmarks

# Are these changes tested?

They're already tests

# Are there any user-facing changes?

No

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
…ve (#10025)

# Which issue does this PR close?

- Closes #10022.

# Rationale for this change

Optimize interleave_list when child is primitive type.

# What changes are included in this PR?

1. Special path when child is primitive type.
2. new `interleave_list_primitive_child` function

# Are these changes tested?

Covered by existing

# Are there any user-facing changes?

no
…ow selectivity filters and inlined Utf8View/BinaryView (#9755)

## Summary
- fuse the sparse inline `BinaryView` filter and coalescing paths so
primitive columns and inline views can be appended directly without
materialising an intermediate filtered `RecordBatch`
- reuse optimised filter indices and null-mask handling for coalescing,
while preserving the existing fallback paths for dense and non-inline
`BinaryView` inputs
- add focused tests and benchmarks for single-column and mixed
`BinaryView` filter cases related to `#9143`

## Verification
- `cargo test -p arrow-select coalesce --lib`
- `cargo clippy -p arrow-select --lib --tests -- -D warnings`
- `cargo clippy -p arrow --bench coalesce_kernels --features test_utils
-- -D warnings`
- `cargo bench -p arrow --bench coalesce_kernels --features test_utils
-- --noplot single_binaryview`
- `cargo bench -p arrow --bench coalesce_kernels --features test_utils
-- --noplot mixed_binaryview`

## Benchmark Results
Measured against a clean `origin/main` worktree with the same
`BinaryView` benchmark additions. The figures below compare
representative median times from the baseline worktree and this branch.

### Mixed primitive + BinaryView
- `mixed_binaryview (max_string_len=8), 8192, nulls: 0, selectivity:
0.001`: `23.16 ms` -> `8.51 ms`
- `mixed_binaryview (max_string_len=8), 8192, nulls: 0, selectivity:
0.01`: `2.37 ms` -> `1.31 ms`
- `mixed_binaryview (max_string_len=8), 8192, nulls: 0.1, selectivity:
0.001`: `31.70 ms` -> `14.33 ms`
- `mixed_binaryview (max_string_len=8), 8192, nulls: 0.1, selectivity:
0.01`: `3.92 ms` -> `2.44 ms`

### Single BinaryView
- `single_binaryview, 8192, nulls: 0, selectivity: 0.01`: `4.86 ms` ->
`4.90 ms` (roughly flat, slightly slower)
- `single_binaryview (max_string_len=8), 8192, nulls: 0, selectivity:
0.001`: `34.72 ms` -> `19.33 ms`
- `single_binaryview (max_string_len=8), 8192, nulls: 0, selectivity:
0.01`: `3.46 ms` -> `2.03 ms`
- `single_binaryview (max_string_len=8), 8192, nulls: 0.1, selectivity:
0.01`: `5.93 ms` -> `3.97 ms`
- `single_binaryview (max_string_len=8), 8192, nulls: 0, selectivity:
0.8`: `597 µs` -> `619 µs` (regression)
- `single_binaryview (max_string_len=8), 8192, nulls: 0.1, selectivity:
0.8`: `1.78 ms` -> `1.79 ms` (roughly flat, slightly slower)

In short, this change substantially improves the mixed primitive +
inline `BinaryView` path that motivated `#9143`, while the single-column
`BinaryView` benchmarks still show trade-offs: sparse inline cases
improve, but dense inline cases are slightly slower and the non-inline
single-column path is effectively unchanged.

Closes #9143.

---------

Signed-off-by: cl <cailue@apache.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #10029.
[A document that provides a bit of
context](https://github.com/user-attachments/files/28477762/Arrow.flight.speed.up.2.pdf)


# Rationale for this change
Compression is the most compute and memory intensive part of the
arrow-ipc encoding pipeline. It runs per buffer, not per record batch.
For a Flight stream of 10 batches with 5 primitive arrays each, that is
100 compression calls minimum, [more for string and struct
arrays](https://arrow.apache.org/docs/format/Columnar.html#compression).
Each of those calls produced an owned compressed Vec that was then
copied a second time into a flat arrow_data accumulator before being
written to the output. For the uncompressed path the situation was the
same: Arc-backed buffer slices that required no compression were still
copied into that accumulator unnecessarily.

Separately, the original **write_message()** function flushed after
every dictionary and every record batch, causing repeated small OS write
calls per batch. ( **for non vector backed writer implementations** )
The goal was to eliminate both problems: stop copying buffers that do
not need to be copied, and stop flushing on every message.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Introduced EncodedBuffer, an enum that wraps either a raw Arc-backed
Buffer for the uncompressed path or an owned Vec for the compressed
path, so both can be held in a uniform collection without an extra copy
into a flat accumulator
- Changed write_array_data to push EncodedBuffer segments instead of
copying bytes into arrow_data
- FileWriter and StreamWriter both now call **write_batch_direct()**,
eliminating the flush-per-message behavior and the intermediate copy on
the hot path

# Are these changes tested?
These changes are intended to be completely seamless. I didn't write new
unit test for the code as nothing externally changed. all test still
pass
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->
## benchmarks
[**main** -> `cargo bench --bench ipc_writer -- "StreamWriter/write_10$"
--sample-size 100`]
[**my branch** -> `cargo bench --bench ipc_writer --
"StreamWriter/write_10$" --sample-size 100` ]
<img width="1832" height="982" alt="Image 6-1-26 at 3 19 PM"
src="https://github.com/user-attachments/assets/8e6253a4-8a53-4d03-bdab-d0321edc2561"
/>


[**main** -> `cargo bench --bench ipc_writer -- --sample-size 1000`]
[**my branch** -> `cargo bench --bench ipc_writer -- --sample-size
1000`]
<img width="1944" height="1000" alt="Image 6-1-26 at 3 20 PM"
src="https://github.com/user-attachments/assets/dc8015e8-ed60-487c-aa66-06f5d35499fe"
/>


# Are there any user-facing changes?
no
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

Resolves this
#10044 (comment)
from #10044

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
Code in this file is hard to navigate & its unclear what is happening.
# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
This PR introduces `IpcMetadataBuilde`r, a struct that groups the nodes
and buffers vecs previously passed separately into `write_array_data()`,
and removes the redundant num_rows/null_count parameters by deriving
them from `array_data` directly. Together these reduce
`write_array_data()` from 10 arguments to 7, eliminating the
#[allow(clippy::too_many_arguments)] suppression, and doc comments are
added to clarify the two-channel output model between
`IpcMetadataBuilder` (flatbuffer header metadata) and `IpcBodySink` (raw
Arrow data bytes).
# Are these changes tested?
yes
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

# Are there any user-facing changes?
no
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
…es (#10110)

# Which issue does this PR close?

- Closes #10092.

# Rationale for this change

check issue

# What changes are included in this PR?

- Add two `field_` APIs symmetric to `column_` ones.
- Reuse `Fields::find` in `column_by_name` to avoid a `Vec` alloc.
- Fix doc pointing to an old Jira issue. Now points to #9205
- `MapArray::entries_fields` avoid a `Vec` alloc.

# Are these changes tested?

- Yes, unit tests

# Are there any user-facing changes?

New `StructArray` APIs.
…bility (#10115)

# Which issue does this PR close?

N/A

# Rationale for this change

although `Bytes` and some of its functions are marked as `pub` it is
never exposed outside the crate.
updated this so reading the code will be less confusing

# What changes are included in this PR?

changed `pub` to `pub(crate)` in `Bytes` impl
 
# Are these changes tested?

existing tests

# Are there any user-facing changes?

no since it was never exposed anyway
…ption<Vec<(Key, Option<Value>)>>>` for tests (#10123)

# Which issue does this PR close?

N/A

# Rationale for this change

Whenever you try to write tests that use `MapArray` you have very
verbose way to build the MapArray with the specific values you want

so adding this helper will allow arrow tests and user tests to be
cleaner

# What changes are included in this PR?

added function and updated some of the tests in the repo that use the
`MapBuilder` (that do not test the builder itself of course) with the
new method to showcase how much cleaner it looks

# Are these changes tested?
yes

# Are there any user-facing changes?

new function
# Which issue does this PR close?

N/A

# Rationale for this change

`MapArray` is not very different than `ListArray` that is supported in
`lengths` kernel

# What changes are included in this PR?

added MapArray support and tests

# Are these changes tested?
yes

# Are there any user-facing changes?

`lengths` now support `MapArray`
# Which issue does this PR close?

- Closes #10047 .

# Rationale for this change

Implement concat for map

# What changes are included in this PR?

Implement concat for map

# Are these changes tested?

Yes

# Are there any user-facing changes?

No

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…10015)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes ##8420.

# Rationale for this change

Shredding into `FixedSizeBinary(16)` means we're shredding into `UUID`
Parquet logical type. `shred_variant` currently doesn't preserve
extension type metadata for the typed value field.

UUID is the only valid `Variant` shredding type that requires an arrow
extension type.
https://github.com/apache/parquet-format/blob/master/VariantShredding.md

Earlier in
[#](#8665 (comment))
@scovich mentioned:

> Yeah, as long as `shred_variant` only takes a `DataType` instead of a
`Field`, we are forced to assume 16-byte fixed binary is UUID. If it
accepted a `Field`, we should additionally require the UUID extension
type. Otherwise, we potentially run into problems because Decimal128 can
_also_ use 16-byte fixed binary!

This is an argument proposing to use `Field` instead of `DataType` for
`as_type` parameter in `shred_variant`. This should not be an issue
because arrow has a `Decimal128Type` to represent `Decimal128` logical
Parquet type. This way there's no ambiguity in using
`FixedSizeBinary(16)` arrow type to represent `UUID`. Switching
`as_type` to `Field` is unnecessary.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?

- `VariantArray::from_parts/ShreddedVariantFieldArray::from_parts` now
add `UUID` extension type metadata to the typed_value `Field` if
`DataType` is `FixedSizeBinary(16)`
- Uncommented `UUID` extension part metadata validation in a unit test.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?

- Yes, unit test.
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

# Are there any user-facing changes?

- Shredded `UUID` typed value fields now preserve `UUID` extension type
metadata.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

---------

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
# Which issue does this PR close?

- Closes #NNN.

# Rationale for this change

Miri currently takes just under an hour to run, with most of it being
the actual tests.

# What changes are included in this PR?

This PR modifies the script that runs miri to optionally use nextest's
[partitioning](https://nexte.st/docs/ci-features/partitioning/) feature,
and makes use of it in CI with 4 partitions. This should reduce the
overall miri runtime to just over 15 minutes with a minimal increase in
CI resource usage.

This is also scalable if the number of tests keeps increasing, changing
the number of partitions is trivial, picking 4 here is an arbitrary
choice.

# Are these changes tested?

Tested the script locally.

# Are there any user-facing changes?

No
# Which issue does this PR close?

None, just a dependency update.

# Rationale for this change

pyo3 has security vulnerability:
https://rustsec.org/advisories/RUSTSEC-2026-0176.html

This PR updates to 0.29 to resolve this vulnerability.

# What changes are included in this PR?

Update all crates that use the pyo3 dependency to 0.29

# Are these changes tested?

Updated and run against existing integration test suite.

# Are there any user-facing changes?

No

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Which issue does this PR close?
This PR works towards an initial solution closing #8016 
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Closes #8016.

# Rationale for this change      
Currently `arrow_writer` does not support writing Run End Encoded
columns out to parquet. This PR works towards solving this by first
expanding out the REE to its value type & then writing out to parquet.
Once its possible to write REE to parquet we can work on optimizing it
by keeping the compacting nature in tact.
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?
`arrow_writer()` now supports writing Run End Encoded (REE) arrays to
Parquet by hydrating them to their underlying value type before
encoding. This is an initial, correctness-first implementation. A
follow-up can/should optimize to preserve the compacted structure.

**parquet/src/arrow/arrow_writer/mod.rs**: generate a value-type
arrow-column writer & test
**parquet/src/arrow/arrow_writer/levels.rs**: core writer logic updated
to detect REE columns and expand them to their flat value type before
the existing write path.
**parquet/src/arrow/schema/mod.rs**: schema conversion updated to map
RunEndEncodedType to an appropriate Parquet physical type.
**parquet/benches/arrow_writer.rs**: REE write benchmarks added with low
and high null density scenarios, now unblocked by the implementation.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
Yes
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

# Are there any user-facing changes?
Users will be able to write out their REE columns out to parquet using
`arrow_writer`
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
# Which issue does this PR close?

- Closes #10093.

# Rationale for this change

check issue

# What changes are included in this PR?

- Rename existing `_field` APIs that return `&ArrayRef` to `_column`
- Add new `_field` APIs that return `&FieldRef` and tests for them

# Are these changes tested?

- Yes, unit tests
# Are there any user-facing changes?

- Yes, breaking API name change.
…nversion (#10065)

# Which issue does this PR close?


- Closes #9929.

# Rationale for this change

There were several issues with conversion identified when I tried to
integrate this in SedonaDB and that came to light when the spec was
recently clarified.

I am sorry for missing these changes when I reviewed the initial
implementation.

# What changes are included in this PR?

- A Parquet crs of `None` for geometry or geography is now converted to
a GeoArrow CRS of `"OGC:CRS84"` (the named value for the default CRS in
the Parquet spec)
- A Parquet crs of `"srid:0"` is now converted to a GeoArrow "omitted"
CRS. This was recently clarified in the Parquet spec (srid:0 is a named
example in the list of allowed values).
- A GeoArrow missing CRS is now encoded as `"srid:0"`
- A GeoArrow CRS that is "lonlat-like" is now encoded as a Parquet crs
of `None`. This logic was included in the previous implementation but
was reversed (Parquet CRSes that looked like lonlat were omitted when
written to GeoArrow, which is not correct).
- The GeoArrow metadata struct uses the name `"algorithm"` and was
serializing it to JSON. The GeoArrow spec uses the `"edges"` key. This
led to invalid metadata being generated which was either rejected or
incorrectly interpreted by consumers.

# Are these changes tested?

Yes. I added high-level end-to-end LogicalType <-> extension metadata
tests, since that is what matters (there were a few lower level tests
that I updated as well).
# Which issue does this PR close?

- Closes #10119

# Rationale for this change

This PR adds writer benchmarks for dictionaries so that we can measure
the performance impact of code changes on those code paths.

# What changes are included in this PR?

Three new benchmarks:

- StreamWriter benchmark for dictionaries
- StreamWriter benchmark for delta dictionaries
- FileWriter benchmark for delta dictionaries

# Are these changes tested?

Yes, just benchmarks included which I ran locally.

# Are there any user-facing changes?

No.
# Which issue does this PR close?

Follow-up while reviewing #10044.

# Rationale for this change

While reviewing #10044 (which reworks the IPC writer's buffer handling),
I found that the **compressed `IpcDataGenerator::encode` path is not
exercised by any test in the repository**.


# What changes are included in this PR?

This PR adds that missing coverage 

# Are these changes tested?

This PR is test-only.

# Are there any user-facing changes?

No.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…z)> (#10099)

# Which issue does this PR close?

N/A — test-only coverage improvement.

# Rationale for this change

While reviewing #10025 (which adds a primitive-child fast path to
`interleave_list`), I noticed `interleave` over list arrays had no test
for a primitive child that carries logical type parameters —
`Decimal128`/`Decimal256` precision & scale, or timezone-aware
`Timestamp`.

# What changes are included in this PR?

Two new tests in `arrow-select/src/interleave.rs`, each parameterized
over `i32`/`i64` offsets:

# Are these changes tested?

This PR is the tests. They pass on `main`, and also guard the fast path
added in #10025.

# Are there any user-facing changes?

No.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ry focused benchmarks (#10126)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

Part of

- #10125

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
Going through the arrow-flight codebase I noticed that by default
`DictionaryHandling` is set to Hydrate. This means it expands the arrays
out to their logical form. In other words when the variant is set to
hydrate, `arrow-ipc::IpcDataGenerator::encode_all_dicts()` never
actually runs.
This is important due to the arrow-ipc work that @alamb , @JakeDern &
myself have been working on. [Efforts are being made to
optimize](#10044 (comment))
arrow-ipc's use of dictionaries. This PR allows those chanages to be
visible through arrow-flight benchmarks
# What changes are included in this PR?
This PR adds a benchmark for arrow-flight's `do_put` endpoint using
dictionary arrays, measuring the latency difference between the two
DictionaryHandling variants.
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
changes are benchmarks
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

# Are there any user-facing changes?
no
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
…mp (#9825)

Instead of panic (in debug) or wrapping, check for overflow similarly to
other conversions (e.g. Timestamp -> Timestamp).

# Which issue does this PR close?

- Closes #9824 .

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.