update 57.2.0 to 57.3.0#15
Open
mandrush wants to merge 430 commits into
Open
Conversation
# Which issue does this PR close? - Closes #9378 # Rationale for this change the optimizations as listed in the issue description - Align to 8 bytes - Don't try to return a buffer with bit_offset 0 but round it to a multiple of 64 - Use chunk_exact for the fallback path # What changes are included in this PR? When both inputs share the same sub-64-bit alignment (left_offset % 64 == right_offset % 64), the optimized path is used. This covers the common cases (both offset 0, both sliced equally, etc.). The BitChunks fallback is retained only when the two offsets have different sub-64-bit alignment. # Are these changes tested? Yes the tests are changed and they are included # Are there any user-facing changes? Yes, this is a minor breaking change to from_bitwise_binary_op: - The returned BooleanBuffer may now have a non-zero offset (previously always 0) - The returned BooleanBuffer may have padding bits set outside the logical range in values() --------- Signed-off-by: Kunal Singh Dadhwal <kunalsinghdadhwal@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? None # Rationale for this change We want to use the SBBF Bloom Filter, but need to construct/serialize it manually. Currently there is no way to create a new `Sbbf` outside of this crate. Alongside this: we want to store the `Sbbf` in a `FixedSizedBinary` column for some fancy indexing. # What changes are included in this PR? Some methods become public # Are these changes tested? N/A # Are there any user-facing changes? Yes, we add a few more public methods to the `Sbbf` struct
# Which issue does this PR close? - Closes #NNN. # Rationale for this change Rust implementation of apache/arrow#45360 Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times. See more details in https://huggingface.co/blog/parquet-cdc The original C++ implementation apache/arrow#45360 Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better): <img width="984" height="411" alt="image" src="https://github.com/user-attachments/assets/e6e80931-ac76-4bdd-bf9c-ba7e06559411" /> # What changes are included in this PR? - **Content-defined chunker** at `parquet/src/column/chunker/` - **Arrow writer integration** integrated in `ArrowColumnWriter` - **Writer properties** via `CdcOptions` struct (`min_chunk_size`, `max_chunk_size`, `norm_level`) - **ColumnDescriptor**: added `repeated_ancestor_def_level` field to for nested field values iteration # Are these changes tested? Yes — unit tests are located in `cdc.rs` and ported from the C++ implementation. # Are there any user-facing changes? New **experimental** API, disabled by default — no behavior change for existing code: ```rust // Simple toggle (256 KiB min, 1 MiB max, norm_level 0) let props = WriterProperties::builder() .set_content_defined_chunking(true) .build(); // Excpliti CDC parameters let props = WriterProperties::builder() .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 }) .build(); ``` --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
#9576) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #9526 # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> `shred_variant` already supports Binary and LargeBinary types (#9525, #9554), but unshred_variant does not handle these types. This means shredded Binary/LargeBinary columns cannot be converted back to unshredded VariantArrays. # What changes are included in this PR? Adds unshred_variant support for DataType::Binary and DataType::LargeBinary in parquet-variant-compute/src/unshred_variant.rs: - New enum variants PrimitiveBinary and PrimitiveLargeBinary - Match arms in append_row and try_new_opt - AppendToVariantBuilder impls for BinaryArray and LargeBinaryArray # Are these changes tested? Yes # Are there any user-facing changes? No breaking changes --------- Signed-off-by: Kunal Singh Dadhwal <kunalsinghdadhwal@gmail.com>
# Which issue does this PR close? - part of #9108 # Rationale for this change Prepare for next release # What changes are included in this PR? 1. Update version to `58.1.0` 2. Add changelog. See rendered preview here: https://github.com/alamb/arrow-rs/blob/alamb/prepare_58.1.0/CHANGELOG.md # Are these changes tested? By CI # Are there any user-facing changes? Yes
…#9590) ## Summary - Reserve `output.views` capacity in `ByteViewArrayDecoderDictionary::read` before the decode loop - Reserve `output.offsets` capacity in `ByteArrayDecoderDictionary::read` before the decode loop This avoids per-chunk reallocation during `extend` calls inside the dictionary decode loop. Closes #9587 ## Test plan - [ ] Existing tests pass (no functional change, only pre-allocation) - [ ] Benchmark dictionary-encoded StringView/BinaryView/String reads 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Rationale for this change In some cases, it is desirable to print strings with surrounding quotation marks. A typical example that we run into in https://github.com/rerun-io/rerun is a `StructArray` that contains empty strings: Current formatting: ```text {name: } ``` Added option in this PR: ```text {name: ""} ``` # What changes are included in this PR? This PR relies on `std::fmt::Debug` to do the actual formatting of strings, which means that all escaping is handled out of the box. # Are these changes tested? This PR contains test for different types of inputs, including escape sequences. Additionally, it also tests the `StructArray` example outlined above. # Are there any user-facing changes? By default this option is false, making the feature opt-in. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
## Which issue does this PR close? Closes #9580 ## Rationale The current VLQ decoder calls `get_aligned` for each byte, which involves repeated offset calculations and bounds checks in the hot loop. ## What changes are included in this PR? Align to the byte boundary once, then iterate directly over the buffer slice, avoiding per-byte overhead from `get_aligned`. ## Are there any user-facing changes? No. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Rationale for this change The `object_store` crate release 0.13.2 breaks the build of parquet because it feature-gates the `buffered` module. I have filed apache/arrow-rs-object-store#677 about the breakage; meanwhile this fix is made in expectation that 0.13.2 will not be yanked and the feature gate will remain. # What changes are included in this PR? Bump the version to 0.13.2 and requesting the "tokio" feature. # Are these changes tested? The build should succeed in CI workflows. # Are there any user-facing changes? No Co-authored-by: Mikhail Zabaluev <mikhail.zabaluev@gmail.com>
Updates the requirements on [sha2](https://github.com/RustCrypto/hashes) to permit the latest version. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/RustCrypto/hashes/commit/ffe093984c004769747e998f77da8ff7c0e7a765"><code>ffe0939</code></a> Release sha2 0.11.0 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/806">#806</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/8991b65fe400c31c4cc189510f86ae642c470cd9"><code>8991b65</code></a> Use the standard order of the <code>[package]</code> section fields (<a href="https://redirect.github.com/RustCrypto/hashes/issues/807">#807</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/3d2bc57db40fd6aeb25d6c6da98d67e2784c2985"><code>3d2bc57</code></a> sha2: refactor backends (<a href="https://redirect.github.com/RustCrypto/hashes/issues/802">#802</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/faa55fb83697c8f3113636d88070e5f5edc8c335"><code>faa55fb</code></a> sha3: bump <code>keccak</code> to v0.2 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/803">#803</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/d3e6489e56f8486d4a93ceb7a8abf4924af1de7b"><code>d3e6489</code></a> sha3 v0.11.0-rc.9 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/801">#801</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/bbf6f51ff97f81ab15e6e5f6cf878bfbcb1f47c8"><code>bbf6f51</code></a> sha2: tweak backend docs (<a href="https://redirect.github.com/RustCrypto/hashes/issues/800">#800</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/155dbbf2959dbec0ec75948a82590ddaede2d3bc"><code>155dbbf</code></a> sha3: add default value for the <code>DS</code> generic parameter on <code>TurboShake128/256</code>...</li> <li><a href="https://github.com/RustCrypto/hashes/commit/ed514f2b34526683b3b7c41670f1887982c3df64"><code>ed514f2</code></a> Use published version of <code>keccak</code> v0.2 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/799">#799</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/702bcd83735a49c928c0fc24506924f5c0aa22af"><code>702bcd8</code></a> Migrate to closure-based <code>keccak</code> (<a href="https://redirect.github.com/RustCrypto/hashes/issues/796">#796</a>)</li> <li><a href="https://github.com/RustCrypto/hashes/commit/827c043f82d57666a0b146d156e91c39535c1305"><code>827c043</code></a> sha3 v0.11.0-rc.8 (<a href="https://redirect.github.com/RustCrypto/hashes/issues/794">#794</a>)</li> <li>Additional commits viewable in <a href="https://github.com/RustCrypto/hashes/compare/groestl-v0.10.0...sha2-v0.11.0">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #9340. # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Support `ListView` codec in arrow-json. Using `ListLikeArray` trait to simplify implementation. # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Tests added # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> New encoder/decoder
… verification (#9604) # Which issue does this PR close? - Closes #9603 # Rationale for this change The release and dev KEYS files could get out of synch. We should use the release/ version: - Users use the release/ version not dev/ version when they verify our artifacts' signature - https://dist.apache.org/ may reject our request when we request many times by CI # What changes are included in this PR? Use `https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/KEYS` to download the KEYS file and the expected `https://dist.apache.org/repos/dist/dev/arrow` for the RC artifacts. # Are these changes tested? Yes, I've verified 58.1.0 1 both previous to the change and after the change. # Are there any user-facing changes? No
…uct)` (#9597) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #9596. # Rationale for this change Check issue <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? Reuse `shred_basic_variant` as a fast path for unshredded `Struct` handling in `variant_get(..., Struct)` <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? Yes, added two unit tests to establish safe mode behavior. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
## Summary - Fix `MutableArrayData::extend_nulls` which previously panicked unconditionally for both sparse and dense Union arrays - For sparse unions: append the first type_id and extend nulls in all children - For dense unions: append the first type_id, compute offsets into the first child, and extend nulls in that child only ## Background This bug was discovered via DataFusion. `CaseExpr` uses `MutableArrayData` via `scatter()` to build result arrays. When a `CASE` expression returns a Union type (e.g., from `json_get` which returns a JSON union) and there are rows where no `WHEN` branch matches (implicit `ELSE NULL`), `scatter` calls `extend_nulls` which panics with "cannot call extend_nulls on UnionArray as cannot infer type". Any query like: ```sql SELECT CASE WHEN condition THEN returns_union(col, 'key') END FROM table ``` would panic if `condition` is false for any row. ## Root Cause The `extend_nulls` implementation for Union arrays unconditionally panicked because it claimed it "cannot infer type". However, the Union's field definitions (child types and type IDs) are available in the `MutableArrayData`'s data type — there's enough information to produce valid null entries by picking the first declared type_id. ## Test plan - [x] Added test for sparse union `extend_nulls` - [x] Added test for dense union `extend_nulls` - [x] Existing `test_union_dense` continues to pass - [x] All `array_transform` tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Relates to #9497. # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> As part of the effort to move the Json reader away from `ArrayData` toward typed `ArrayRef` APIs, it's necessary to change the `ArrayDecoder::decode` interface to return `ArrayRef` directly and updates all decoder implementations (list, struct, map, run-end encoded) to construct typed arrays without intermediate `ArrayData` round-trips. New benchmarks for map and run-end encoded decoding are added to verify there is no performance regression. # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> No
# Which issue does this PR close? - closes #9593 # Rationale for this change In a previous PR (#9593), I change instances of `truncate(0)` to `clear()`. However, this breaks the test `test_truncate_with_pool` at `arrow-buffer/src/buffer/mutable.rs:1357`, due to an inconsistency between the implementation of `truncate` and `clear`. This PR fixes that test. # What changes are included in this PR? This PR copies a section of code related to the `pool` feature present in `truncate` but absent in `clear`, fixing the failing unit test. # Are these changes tested? Yes. # Are there any user-facing changes? No.
# Rationale for this change CdcOptions only contains primitive fields (usize, usize, i32) so deriving PartialEq and Eq is straightforward. This is needed by downstream crates such as DataFusion that embed CdcOptions in their own configuration structs and need to compare them. # What changes are included in this PR? Implemented PartialEq and Eq for CdcOptions. # Are these changes tested? Added an equality test. # Are there any user-facing changes? No.
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #8400. # Rationale for this change Check issue <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? - Added `AppendNullMode` enum supporting all semantics. - Replaced the bool logic to the new enum - Fix test outputs for List Array cases <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? - Added unit tests <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Rationale for this change Makes the code simpler and more readable by relying on new PyO3 and Rust features. No behavior should have changed outside of an error message if `__arrow_c_array__` does not return a tuple # What changes are included in this PR? - use `.call_method0(M)?` instead of `.getattr(M)?.call0()` - Use `.extract()` that allows more advanced features like directly extracting tuple elements - remove temporary variables just before returning - use &raw const and &raw mut pointers instead of casting and addr_of!
# Which issue does this PR close? - Part of #9637 # Rationale for this change I can't benchmark the arrow-writer changes in #9447 due to hitting a panic: - #9637 # What changes are included in this PR? Temporarily disable the cdc benchmarks until the underlying bug is fixed # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
…ly (#9447) # Which issue does this PR close? - Closes #9446. - closes #9636 # Rationale for this change When writing a Parquet column with very sparse data, `GenericColumnWriter` accumulates unbounded memory for definition and repetition levels. The raw `i16` values are appended into `Vec<i16>` sinks on every `write_batch` call and only RLE-encoded in bulk when a data page is flushed. For a column that is almost entirely nulls, the actual RLE-encoded output can be tiny, yet the intermediate buffer grows linearly with the number of rows. # What changes are included in this PR? Replace the two raw-level `Vec<i16>` sinks (`def_levels_sink` / `rep_levels_sink`) with streaming `LevelEncoder` fields (`def_levels_encoder` / `rep_levels_encoder`). Behavior is the same, but we keep running RLE-encoded state rather than the full list of rows in memory. Existing logic is reused. # Are these changes tested? Yes, all tests passing. Benchmarks show no regression. `list_primitive` benches improved by 3-5%: ``` Benchmarking list_primitive/default: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. list_primitive/default time: [1.2109 ms 1.2171 ms 1.2248 ms] thrpt: [1.6999 GiB/s 1.7105 GiB/s 1.7194 GiB/s] change: time: [−3.7197% −2.8848% −2.0036%] (p = 0.00 < 0.05) thrpt: [+2.0445% +2.9705% +3.8634%] Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe Benchmarking list_primitive/bloom_filter: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50. list_primitive/bloom_filter time: [1.4405 ms 1.4810 ms 1.5292 ms] thrpt: [1.3615 GiB/s 1.4058 GiB/s 1.4452 GiB/s] change: time: [−6.4332% −4.7568% −2.9048%] (p = 0.00 < 0.05) thrpt: [+2.9917% +4.9944% +6.8755%] Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe Benchmarking list_primitive/parquet_2: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.3s, enable flat sampling, or reduce sample count to 60. list_primitive/parquet_2 time: [1.2271 ms 1.2311 ms 1.2362 ms] thrpt: [1.6841 GiB/s 1.6911 GiB/s 1.6966 GiB/s] change: time: [−5.8536% −4.9672% −4.1905%] (p = 0.00 < 0.05) thrpt: [+4.3738% +5.2269% +6.2175%] Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe list_primitive/zstd time: [2.0056 ms 2.0148 ms 2.0262 ms] thrpt: [1.0275 GiB/s 1.0333 GiB/s 1.0381 GiB/s] change: time: [−4.7073% −3.6719% −2.6698%] (p = 0.00 < 0.05) thrpt: [+2.7431% +3.8118% +4.9398%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) high mild 10 (10.00%) high severe list_primitive/zstd_parquet_2 time: [2.0455 ms 2.0730 ms 2.1120 ms] thrpt: [1009.4 MiB/s 1.0043 GiB/s 1.0178 GiB/s] change: time: [−5.8626% −3.7672% −1.4196%] (p = 0.00 < 0.05) thrpt: [+1.4401% +3.9146% +6.2277%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe Benchmarking list_primitive_non_null/default: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.6s, enable flat sampling, or reduce sample count to 60. list_primitive_non_null/default time: [1.3199 ms 1.3333 ms 1.3504 ms] thrpt: [1.5384 GiB/s 1.5581 GiB/s 1.5740 GiB/s] change: time: [−4.1662% −2.3491% −0.7148%] (p = 0.01 < 0.05) thrpt: [+0.7200% +2.4056% +4.3473%] Change within noise threshold. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe Benchmarking list_primitive_non_null/bloom_filter: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50. list_primitive_non_null/bloom_filter time: [1.6567 ms 1.6668 ms 1.6805 ms] thrpt: [1.2362 GiB/s 1.2464 GiB/s 1.2540 GiB/s] change: time: [−2.7884% −1.3493% +0.2820%] (p = 0.07 > 0.05) thrpt: [−0.2812% +1.3677% +2.8684%] No change in performance detected. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe Benchmarking list_primitive_non_null/parquet_2: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.2s, enable flat sampling, or reduce sample count to 50. list_primitive_non_null/parquet_2 time: [1.4279 ms 1.4409 ms 1.4551 ms] thrpt: [1.4277 GiB/s 1.4418 GiB/s 1.4550 GiB/s] change: time: [−2.0598% −0.9952% −0.1318%] (p = 0.04 < 0.05) thrpt: [+0.1319% +1.0052% +2.1032%] Change within noise threshold. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe list_primitive_non_null/zstd time: [2.6966 ms 2.7358 ms 2.7994 ms] thrpt: [759.93 MiB/s 777.60 MiB/s 788.89 MiB/s] change: time: [−3.8379% −2.1418% +0.0785%] (p = 0.03 < 0.05) thrpt: [−0.0784% +2.1887% +3.9911%] Change within noise threshold. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe list_primitive_non_null/zstd_parquet_2 time: [2.7684 ms 2.7861 ms 2.8099 ms] thrpt: [757.07 MiB/s 763.55 MiB/s 768.44 MiB/s] change: time: [−6.4460% −4.1387% −2.1474%] (p = 0.00 < 0.05) thrpt: [+2.1946% +4.3174% +6.8901%] Performance has improved. ``` # Are there any user-facing changes? None. Some internal symbols are now unused. I added some `#[allow(dead_code)]` statements since these were experimental-visible and might be externally relied on. --------- Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? - Follow on to #9594 # Rationale for this change @kylebarron says #9594 (comment): > fwiw previously there was a nice user-facing error here, while now the error generated from extract will be much more obtuse. Ideally this exception will never be raised except if the producer doesn't follow the spec correctly. # What changes are included in this PR? Restore the nice error # Are these changes tested? yes, added a test # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Which issue does this PR close? - Closes #NNN. # Rationale for this change Miri in CI is VERY slow (around 2.5 hours), but the github runners actually have 4 vCPUs and some memory, so using nextest can give us some speedup. # What changes are included in this PR? Install nextest in CI and then use it to run Miri # Are these changes tested? tested the script locally # Are there any user-facing changes? No
…er (#9497) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Part of #9298. # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> While implementing `ListViewArrayDecoder` in arrow-json, I noticed we could potentially retire `ArrayDataBuilder` inside `ListArrayDecoder`. Therefore, I'd like to use a small PR here to make sure there's no regression # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Replace `ArrayDataBuilder` with `GenericListArray` in `ListArrayDecoder` # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Covered by existing tests # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> No
# Which issue does this PR close? - Closes #9627. # Rationale for this change Adding benchmarks makes it easier to measure performance and evaluate the impact of changes to the implementation. I also have a PR including some significant improvements, but figured its worth splitting it into two parts, LMK if its better to do that in one step. # What changes are included in this PR? Add a couple of utility functions to generate list and list_view arrays without providing a seed # Are these changes tested? Benchmarks run locally, same setup as other benchmarks. # Are there any user-facing changes? No
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #10029. # Rationale for this change Increase the duplex buffer from 1 MB to 64 MB to eliminate artificial back-pressure in the roundtrip benchmarks. See rational in this [comment](#10044 (comment)) <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? bumps `max_buf_size` to 64**MB** <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? n/a <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> # Are there any user-facing changes? n/a <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Which issue does this PR close? - Part of #9110 # Rationale for this change This prepares for the `59.0.0` (major) release of the Rust Arrow / Parquet crates. # What changes are included in this PR? 1. Update version to `59.0.0` 2. Update CHANGELOG. See rendered preview here: https://github.com/alamb/arrow-rs/blob/alamb/make_release_59.0.0/CHANGELOG.md # Are these changes tested? By CI # Are there any user-facing changes? yes
# Which issue does this PR close? - Issue raised in #9110 # Rationale for this change Add a "bad_data" test for newly added file in parquet-testing # What changes are included in this PR? Adds a new test so the `bad_data` unit test doesn't fail. # Are these changes tested? Yes # Are there any user-facing changes? No, only tests
… comment (#10072) # Which issue does this PR close? Follow-up to #9972. # Rationale for this change A test comment added in #9972 described granular mode as writing "more pages than `main`". As noted in [review feedback](#9972 (comment)), comparing to `main` is confusing now that the PR has merged — that code *is* main. This rephrases the comment to compare against the default batched path instead, which the same comment already references. # What changes are included in this PR? - Reword one test comment in `test_arrow_writer_granular_mode_roundtrip`. No behavior change. # Are there any user-facing changes? No. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Which issue does this PR close? - Closes #10080. # Rationale for this change `From` is already implemented for all other signed integer primitives, ran into it working on decimal aggregations in DataFusion, which this will make much simpler. # What changes are included in this PR? Adds an additional trait implementation for i256. I've also considered deprecating `i256::from_i128` as a public function, but figured I'll see what reviewers think. # Are these changes tested? Just exposes an additional path for existing functionality. # Are there any user-facing changes? No Signed-off-by: Adam Gutglick <adam@spiraldb.com>
…s` (#10089) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Spawn off from #9848 - Contributes to #9731 # Rationale for this change The recursive `build_reader` / `build_*_reader` methods in the array reader builder thread `field` and `mask` through every call. # What changes are included in this PR? Bundle them into a small `Copy` `ReaderArgs` struct so the recursive signatures stay compact and there is a single, documented home for per-field reader options added in the future. This is a mechanical, behavior-preserving change: `build_array_reader` constructs the args at the entry point, group readers recurse with `args.with_field(child)`, and leaf readers read `args.field` and `args.mask`. # Are these changes tested? All tests passing. # Are there any user-facing changes? No.
# Which issue does this PR close? - Closes #9815. # Rationale for this change As noted in #9813 (comment), Rust debug builds panic on arithmetic overflow / underflow but release builds do not (they simply overflow / underflow). This means that some code paths may panic in debug builds that would have silently failed in release builds. As we harden down the security posture of arrow-rs I would like to start testing in release mode too to ensure overflows such as #9813 can be properly validation # What changes are included in this PR? Add Some new release mode tests: `linux-release-test:` et al # Are these changes tested? They are only tests, no code changes # Are there any user-facing changes? No
#10014) # Which issue does this PR close? - Closes #10013 - Related to #6736 # Rationale for this change `variant_get` / `variant_to_arrow` can already convert Variant values into many native Arrow array layouts, but requesting `DataType::Dictionary` or `DataType::RunEndEncoded` was not supported. This PR adds support for those output encodings without changing Variant shredding semantics. `Dictionary` and `RunEndEncoded` are produced as Arrow result arrays only; they are not introduced as valid Parquet Variant shredded `typed_value` layouts. # What changes are included in this PR? 1. Adds an encoded output builder in `variant_to_arrow` for `DataType::Dictionary` and `DataType::RunEndEncoded`. 2. Builds the logical child value array using the existing Variant-to-Arrow builders, then delegates the final Dictionary/REE encoding to Arrow's existing cast kernels. 3. Adds `variant_get` regression coverage for string dictionary, numeric dictionary, and run-end encoded outputs. # Are these changes tested? Yes: - `cargo fmt --check` - `cargo test -p parquet-variant-compute` - `cargo test -p parquet-variant` - `cargo clippy --workspace --all-targets` # Are there any user-facing changes? Yes. `variant_get` with `as_type` set to `DataType::Dictionary` or `DataType::RunEndEncoded` can now return those Arrow array encodings. Co-authored-by: Neetika Mittal <mneetika@users.noreply.github.com>
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #10095. # Rationale for this change - check issue <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? - Replaced old .md issue templates with new .yaml templates following DataFusion. - config.yaml is new, but has a convenient link to discussions since we get rid of `question` template <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? n/a <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> # Are there any user-facing changes? - Yes, better UX for issue creation. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Which issue does this PR close? - Closes #10083 . # Rationale for this change Add benchmarks for list types with nested repetition levels: - `list_nested`: List<List<Int32>> - `list_struct_with_list`: List<Struct<a:Int32, b:Float32, c:List<Int32>>> These exercise the per-slot (non-batched) write path where child_has_no_nested_rep() returns false, providing a baseline for future optimizations. # What changes are included in this PR? Add some benchmarks # Are these changes tested? They're already tests # Are there any user-facing changes? No Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
…ve (#10025) # Which issue does this PR close? - Closes #10022. # Rationale for this change Optimize interleave_list when child is primitive type. # What changes are included in this PR? 1. Special path when child is primitive type. 2. new `interleave_list_primitive_child` function # Are these changes tested? Covered by existing # Are there any user-facing changes? no
…ow selectivity filters and inlined Utf8View/BinaryView (#9755) ## Summary - fuse the sparse inline `BinaryView` filter and coalescing paths so primitive columns and inline views can be appended directly without materialising an intermediate filtered `RecordBatch` - reuse optimised filter indices and null-mask handling for coalescing, while preserving the existing fallback paths for dense and non-inline `BinaryView` inputs - add focused tests and benchmarks for single-column and mixed `BinaryView` filter cases related to `#9143` ## Verification - `cargo test -p arrow-select coalesce --lib` - `cargo clippy -p arrow-select --lib --tests -- -D warnings` - `cargo clippy -p arrow --bench coalesce_kernels --features test_utils -- -D warnings` - `cargo bench -p arrow --bench coalesce_kernels --features test_utils -- --noplot single_binaryview` - `cargo bench -p arrow --bench coalesce_kernels --features test_utils -- --noplot mixed_binaryview` ## Benchmark Results Measured against a clean `origin/main` worktree with the same `BinaryView` benchmark additions. The figures below compare representative median times from the baseline worktree and this branch. ### Mixed primitive + BinaryView - `mixed_binaryview (max_string_len=8), 8192, nulls: 0, selectivity: 0.001`: `23.16 ms` -> `8.51 ms` - `mixed_binaryview (max_string_len=8), 8192, nulls: 0, selectivity: 0.01`: `2.37 ms` -> `1.31 ms` - `mixed_binaryview (max_string_len=8), 8192, nulls: 0.1, selectivity: 0.001`: `31.70 ms` -> `14.33 ms` - `mixed_binaryview (max_string_len=8), 8192, nulls: 0.1, selectivity: 0.01`: `3.92 ms` -> `2.44 ms` ### Single BinaryView - `single_binaryview, 8192, nulls: 0, selectivity: 0.01`: `4.86 ms` -> `4.90 ms` (roughly flat, slightly slower) - `single_binaryview (max_string_len=8), 8192, nulls: 0, selectivity: 0.001`: `34.72 ms` -> `19.33 ms` - `single_binaryview (max_string_len=8), 8192, nulls: 0, selectivity: 0.01`: `3.46 ms` -> `2.03 ms` - `single_binaryview (max_string_len=8), 8192, nulls: 0.1, selectivity: 0.01`: `5.93 ms` -> `3.97 ms` - `single_binaryview (max_string_len=8), 8192, nulls: 0, selectivity: 0.8`: `597 µs` -> `619 µs` (regression) - `single_binaryview (max_string_len=8), 8192, nulls: 0.1, selectivity: 0.8`: `1.78 ms` -> `1.79 ms` (roughly flat, slightly slower) In short, this change substantially improves the mixed primitive + inline `BinaryView` path that motivated `#9143`, while the single-column `BinaryView` benchmarks still show trade-offs: sparse inline cases improve, but dense inline cases are slightly slower and the non-inline single-column path is effectively unchanged. Closes #9143. --------- Signed-off-by: cl <cailue@apache.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #10029. [A document that provides a bit of context](https://github.com/user-attachments/files/28477762/Arrow.flight.speed.up.2.pdf) # Rationale for this change Compression is the most compute and memory intensive part of the arrow-ipc encoding pipeline. It runs per buffer, not per record batch. For a Flight stream of 10 batches with 5 primitive arrays each, that is 100 compression calls minimum, [more for string and struct arrays](https://arrow.apache.org/docs/format/Columnar.html#compression). Each of those calls produced an owned compressed Vec that was then copied a second time into a flat arrow_data accumulator before being written to the output. For the uncompressed path the situation was the same: Arc-backed buffer slices that required no compression were still copied into that accumulator unnecessarily. Separately, the original **write_message()** function flushed after every dictionary and every record batch, causing repeated small OS write calls per batch. ( **for non vector backed writer implementations** ) The goal was to eliminate both problems: stop copying buffers that do not need to be copied, and stop flushing on every message. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Introduced EncodedBuffer, an enum that wraps either a raw Arc-backed Buffer for the uncompressed path or an owned Vec for the compressed path, so both can be held in a uniform collection without an extra copy into a flat accumulator - Changed write_array_data to push EncodedBuffer segments instead of copying bytes into arrow_data - FileWriter and StreamWriter both now call **write_batch_direct()**, eliminating the flush-per-message behavior and the intermediate copy on the hot path # Are these changes tested? These changes are intended to be completely seamless. I didn't write new unit test for the code as nothing externally changed. all test still pass <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> ## benchmarks [**main** -> `cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100`] [**my branch** -> `cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100` ] <img width="1832" height="982" alt="Image 6-1-26 at 3 19 PM" src="https://github.com/user-attachments/assets/8e6253a4-8a53-4d03-bdab-d0321edc2561" /> [**main** -> `cargo bench --bench ipc_writer -- --sample-size 1000`] [**my branch** -> `cargo bench --bench ipc_writer -- --sample-size 1000`] <img width="1944" height="1000" alt="Image 6-1-26 at 3 20 PM" src="https://github.com/user-attachments/assets/dc8015e8-ed60-487c-aa66-06f5d35499fe" /> # Are there any user-facing changes? no <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> Resolves this #10044 (comment) from #10044 # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Code in this file is hard to navigate & its unclear what is happening. # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> This PR introduces `IpcMetadataBuilde`r, a struct that groups the nodes and buffers vecs previously passed separately into `write_array_data()`, and removes the redundant num_rows/null_count parameters by deriving them from `array_data` directly. Together these reduce `write_array_data()` from 10 arguments to 7, eliminating the #[allow(clippy::too_many_arguments)] suppression, and doc comments are added to clarify the two-channel output model between `IpcMetadataBuilder` (flatbuffer header metadata) and `IpcBodySink` (raw Arrow data bytes). # Are these changes tested? yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> # Are there any user-facing changes? no <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
…es (#10110) # Which issue does this PR close? - Closes #10092. # Rationale for this change check issue # What changes are included in this PR? - Add two `field_` APIs symmetric to `column_` ones. - Reuse `Fields::find` in `column_by_name` to avoid a `Vec` alloc. - Fix doc pointing to an old Jira issue. Now points to #9205 - `MapArray::entries_fields` avoid a `Vec` alloc. # Are these changes tested? - Yes, unit tests # Are there any user-facing changes? New `StructArray` APIs.
…bility (#10115) # Which issue does this PR close? N/A # Rationale for this change although `Bytes` and some of its functions are marked as `pub` it is never exposed outside the crate. updated this so reading the code will be less confusing # What changes are included in this PR? changed `pub` to `pub(crate)` in `Bytes` impl # Are these changes tested? existing tests # Are there any user-facing changes? no since it was never exposed anyway
…ption<Vec<(Key, Option<Value>)>>>` for tests (#10123) # Which issue does this PR close? N/A # Rationale for this change Whenever you try to write tests that use `MapArray` you have very verbose way to build the MapArray with the specific values you want so adding this helper will allow arrow tests and user tests to be cleaner # What changes are included in this PR? added function and updated some of the tests in the repo that use the `MapBuilder` (that do not test the builder itself of course) with the new method to showcase how much cleaner it looks # Are these changes tested? yes # Are there any user-facing changes? new function
# Which issue does this PR close? N/A # Rationale for this change `MapArray` is not very different than `ListArray` that is supported in `lengths` kernel # What changes are included in this PR? added MapArray support and tests # Are these changes tested? yes # Are there any user-facing changes? `lengths` now support `MapArray`
# Which issue does this PR close? - Closes #10047 . # Rationale for this change Implement concat for map # What changes are included in this PR? Implement concat for map # Are these changes tested? Yes # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…10015) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes ##8420. # Rationale for this change Shredding into `FixedSizeBinary(16)` means we're shredding into `UUID` Parquet logical type. `shred_variant` currently doesn't preserve extension type metadata for the typed value field. UUID is the only valid `Variant` shredding type that requires an arrow extension type. https://github.com/apache/parquet-format/blob/master/VariantShredding.md Earlier in [#](#8665 (comment)) @scovich mentioned: > Yeah, as long as `shred_variant` only takes a `DataType` instead of a `Field`, we are forced to assume 16-byte fixed binary is UUID. If it accepted a `Field`, we should additionally require the UUID extension type. Otherwise, we potentially run into problems because Decimal128 can _also_ use 16-byte fixed binary! This is an argument proposing to use `Field` instead of `DataType` for `as_type` parameter in `shred_variant`. This should not be an issue because arrow has a `Decimal128Type` to represent `Decimal128` logical Parquet type. This way there's no ambiguity in using `FixedSizeBinary(16)` arrow type to represent `UUID`. Switching `as_type` to `Field` is unnecessary. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? - `VariantArray::from_parts/ShreddedVariantFieldArray::from_parts` now add `UUID` extension type metadata to the typed_value `Field` if `DataType` is `FixedSizeBinary(16)` - Uncommented `UUID` extension part metadata validation in a unit test. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? - Yes, unit test. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> # Are there any user-facing changes? - Shredded `UUID` typed value fields now preserve `UUID` extension type metadata. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> --------- Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
# Which issue does this PR close? - Closes #NNN. # Rationale for this change Miri currently takes just under an hour to run, with most of it being the actual tests. # What changes are included in this PR? This PR modifies the script that runs miri to optionally use nextest's [partitioning](https://nexte.st/docs/ci-features/partitioning/) feature, and makes use of it in CI with 4 partitions. This should reduce the overall miri runtime to just over 15 minutes with a minimal increase in CI resource usage. This is also scalable if the number of tests keeps increasing, changing the number of partitions is trivial, picking 4 here is an arbitrary choice. # Are these changes tested? Tested the script locally. # Are there any user-facing changes? No
# Which issue does this PR close? None, just a dependency update. # Rationale for this change pyo3 has security vulnerability: https://rustsec.org/advisories/RUSTSEC-2026-0176.html This PR updates to 0.29 to resolve this vulnerability. # What changes are included in this PR? Update all crates that use the pyo3 dependency to 0.29 # Are these changes tested? Updated and run against existing integration test suite. # Are there any user-facing changes? No --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Which issue does this PR close? This PR works towards an initial solution closing #8016 <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #8016. # Rationale for this change Currently `arrow_writer` does not support writing Run End Encoded columns out to parquet. This PR works towards solving this by first expanding out the REE to its value type & then writing out to parquet. Once its possible to write REE to parquet we can work on optimizing it by keeping the compacting nature in tact. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? `arrow_writer()` now supports writing Run End Encoded (REE) arrays to Parquet by hydrating them to their underlying value type before encoding. This is an initial, correctness-first implementation. A follow-up can/should optimize to preserve the compacted structure. **parquet/src/arrow/arrow_writer/mod.rs**: generate a value-type arrow-column writer & test **parquet/src/arrow/arrow_writer/levels.rs**: core writer logic updated to detect REE columns and expand them to their flat value type before the existing write path. **parquet/src/arrow/schema/mod.rs**: schema conversion updated to map RunEndEncodedType to an appropriate Parquet physical type. **parquet/benches/arrow_writer.rs**: REE write benchmarks added with low and high null density scenarios, now unblocked by the implementation. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> # Are there any user-facing changes? Users will be able to write out their REE columns out to parquet using `arrow_writer` <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Which issue does this PR close? - Closes #10093. # Rationale for this change check issue # What changes are included in this PR? - Rename existing `_field` APIs that return `&ArrayRef` to `_column` - Add new `_field` APIs that return `&FieldRef` and tests for them # Are these changes tested? - Yes, unit tests # Are there any user-facing changes? - Yes, breaking API name change.
…nversion (#10065) # Which issue does this PR close? - Closes #9929. # Rationale for this change There were several issues with conversion identified when I tried to integrate this in SedonaDB and that came to light when the spec was recently clarified. I am sorry for missing these changes when I reviewed the initial implementation. # What changes are included in this PR? - A Parquet crs of `None` for geometry or geography is now converted to a GeoArrow CRS of `"OGC:CRS84"` (the named value for the default CRS in the Parquet spec) - A Parquet crs of `"srid:0"` is now converted to a GeoArrow "omitted" CRS. This was recently clarified in the Parquet spec (srid:0 is a named example in the list of allowed values). - A GeoArrow missing CRS is now encoded as `"srid:0"` - A GeoArrow CRS that is "lonlat-like" is now encoded as a Parquet crs of `None`. This logic was included in the previous implementation but was reversed (Parquet CRSes that looked like lonlat were omitted when written to GeoArrow, which is not correct). - The GeoArrow metadata struct uses the name `"algorithm"` and was serializing it to JSON. The GeoArrow spec uses the `"edges"` key. This led to invalid metadata being generated which was either rejected or incorrectly interpreted by consumers. # Are these changes tested? Yes. I added high-level end-to-end LogicalType <-> extension metadata tests, since that is what matters (there were a few lower level tests that I updated as well).
# Which issue does this PR close? - Closes #10119 # Rationale for this change This PR adds writer benchmarks for dictionaries so that we can measure the performance impact of code changes on those code paths. # What changes are included in this PR? Three new benchmarks: - StreamWriter benchmark for dictionaries - StreamWriter benchmark for delta dictionaries - FileWriter benchmark for delta dictionaries # Are these changes tested? Yes, just benchmarks included which I ran locally. # Are there any user-facing changes? No.
# Which issue does this PR close? Follow-up while reviewing #10044. # Rationale for this change While reviewing #10044 (which reworks the IPC writer's buffer handling), I found that the **compressed `IpcDataGenerator::encode` path is not exercised by any test in the repository**. # What changes are included in this PR? This PR adds that missing coverage # Are these changes tested? This PR is test-only. # Are there any user-facing changes? No. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…z)> (#10099) # Which issue does this PR close? N/A — test-only coverage improvement. # Rationale for this change While reviewing #10025 (which adds a primitive-child fast path to `interleave_list`), I noticed `interleave` over list arrays had no test for a primitive child that carries logical type parameters — `Decimal128`/`Decimal256` precision & scale, or timezone-aware `Timestamp`. # What changes are included in this PR? Two new tests in `arrow-select/src/interleave.rs`, each parameterized over `i32`/`i64` offsets: # Are these changes tested? This PR is the tests. They pass on `main`, and also guard the fast path added in #10025. # Are there any user-facing changes? No. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ry focused benchmarks (#10126) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> Part of - #10125 # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Going through the arrow-flight codebase I noticed that by default `DictionaryHandling` is set to Hydrate. This means it expands the arrays out to their logical form. In other words when the variant is set to hydrate, `arrow-ipc::IpcDataGenerator::encode_all_dicts()` never actually runs. This is important due to the arrow-ipc work that @alamb , @JakeDern & myself have been working on. [Efforts are being made to optimize](#10044 (comment)) arrow-ipc's use of dictionaries. This PR allows those chanages to be visible through arrow-flight benchmarks # What changes are included in this PR? This PR adds a benchmark for arrow-flight's `do_put` endpoint using dictionary arrays, measuring the latency difference between the two DictionaryHandling variants. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? changes are benchmarks <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? If this PR claims a performance improvement, please include evidence such as benchmark results. --> # Are there any user-facing changes? no <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?