[REL-1242132] update fork#12
Conversation
…apache#8625) # Which issue does this PR close? - Closes apache#8606 # Rationale for this change Mapping from an iterator of `Variant`s into a `VariantArray` currently requires a lot of ceremony. This PR abstracts over this.
# Which issue does this PR close? - Closes apache#8538 . # Rationale for this change # What changes are included in this PR? Removes redundant `get_type_name` and leverages the `Display` trait of `DataType` # Are these changes tested? Yes, 3 tests added # Are there any user-facing changes? No Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? - Closes apache#8030 # Rationale for this change There is no agreement yet on the what to do with extension types when modifying `Field` data types (apache#8350), but I think we agree on removing the experimental warning from the `extension` module. Opening this PR so we can include it in 57.0.0. # What changes are included in this PR? The warning removal from apache#8350. # Are these changes tested? Not required. # Are there any user-facing changes? Yes, extension types no longer marked experimental.
## Which issue does this PR close? - Closes apache#8351 ## Rationale for this change ## What changes are included in this PR? This PR refactors and improves the `Display` formatting for `DataType` by: - **Improving Union type display** - now shows field information with parentheses for clarity: `Union(Sparse, 0: ('a': Int32), 1: ('b': nullable Utf8))` - **Improving RunEndEncoded type display** - now properly formats both run_ends and values fields: `RunEndEncoded('run_ends': UInt32, 'values': nullable Int32)` ## Are these changes tested? Yes ## Are there any user-facing changes?
…nt Thrift serialization (apache#8634) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of apache#5853. # Rationale for this change Structures that are to be Thrift serialized as fields in another struct must implement the `WriteThriftField` trait. In most instances the implementation of that trait is trivial and adds unnecessary verbosity to the code. # What changes are included in this PR? Add a new `write_thrift_field` macro that generates this boilerplate code. # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No
…e#8621) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Related to apache#7835 # Rationale for this change While testing the arrow 57 upgrade in DataFusion I found a few things that need to be fixed in parquet-rs. - apache/datafusion#17888 One was that the method `ArrowWriter::into_serialized_writer` was deprecated, (which I know I suggested in apache#8389 🤦 ). However, when testing it turns out that the constructor of `SerializedFileWriter` does a lot of work (like creating the parquet schema from the arrow schema and messing with metadata) https://github.com/apache/arrow-rs/blob/c4f0fc12199df696620c73d62523c8eef5743bf2/parquet/src/arrow/arrow_writer/mod.rs#L230-L263 Creating a `RowGroupWriterFactory` directly would involve a bunch of code duplication # What changes are included in this PR? So let's not deprecate this method for now and instead add some additional docs to guide people to the right lace # Are these changes tested? I tested manually upstream # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.
# Which issue does this PR close? N/A # Rationale for this change not testing the correct length # What changes are included in this PR? remove * 8 as the length of the buffer is in bytes already # Are these changes tested? created tests to make sure they are failing before AND created tests that make sure that ceil is used for future changes # Are there any user-facing changes? Nope
…tArray` (apache#8627) # Which issue does this PR close? - Closes apache#8610 # Rationale for this change Since the fields of `VariantArray` impl `PartialEq`, this PR simply derives `PartialEq` for `VariantArray` out. Based off of apache#8625
…trings for error messages (apache#8636) # Which issue does this PR close? This is a small performance improvement for the thrift remodeling - Part of apache#5853. # Rationale for this change Some of the often-called methods in the thrift protocol implementation created `ParquetError` instances with a string message that had to be allocated and formatted. This formatting code and probably also some drop glue bloats these otherwise small methods and prevented inlining. # What changes are included in this PR? Introduce a separate error type `ThriftProtocolError` that is smaller than `ParquetError` and does not contain any allocated data. The `ReadThrift` trait is not changed, since its custom implementations actually require the more expressive `ParquetError`. # Are these changes tested? The success path is covered by existing tests. Testing the error paths would require crafting some actually malformed files, or using a fuzzer. # Are there any user-facing changes? The `ThriftProtocolError` is crate-internal so there should be no api changes. Some error messages might differ slightly.
# Which issue does this PR close? N/A # Rationale for this change I have a PR to improve zip perf for scalar but I don't see any benchmarks for it: - apache#8653 # What changes are included in this PR? created zip benchmarks for scalar and non scalar with different masks # Are these changes tested? N/A # Are there any user-facing changes? Nope
…#8594) # Which issue does this PR close? - Partial fix for apache/datafusion#17857 # Rationale for this change These changes add a safer version of `append_value` in `ByteViewBuilder` that handles panics called `try_append_value`. Datafusions will consume the API and handle the Result coming back from the function. # What changes are included in this PR? # Are these changes tested? The method is already covered by existing tests. # Are there any user-facing changes? No breaking changes, as the original `append_value` method hasn't changed. --------- Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com> Co-authored-by: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com> Co-authored-by: Jörn Horstmann <git@jhorstmann.net>
…pache#8642) # Which issue does this PR close? - Closes apache#8639 # Rationale for this change add write_batch_size config and change compression to use parquet Compression # What changes are included in this PR? add write_batch_size config and change compression to use parquet Compression # Are these changes tested? I've try these command by myself. # Are there any user-facing changes? Yeah 1. zstd level previously is default 1, not change to 3 2. str zStd might not pass
# Which issue does this PR close? - Part of apache#7835 # Rationale for this change Let's get the code to the people # What changes are included in this PR? Update version number and CHANGELOG. See rendered version here: https://github.com/alamb/arrow-rs/blob/alamb/prepare_57.0.0/CHANGELOG.md # Are these changes tested? N/A # Are there any user-facing changes? No
…he#8635) # Which issue does this PR close? - Closes apache#8218 # Rationale for this change # What changes are included in this PR? Split the `builder.rs` to three files, `/builder/list`, `/builder/object` and `builder/metadata` # Are these changes tested? Yes # Are there any user-facing changes? No
# Which issue does this PR close? Related to: - apache#7456 - apache#8565 # Rationale for this change Improve the performance in ParquetRecoredBatchReader, especially when the `rowselector` is short. - By changing a hash map to a enum array # What changes are included in this PR? For `parquet/src/arrow/array_reader/cached_array_reader.rs`, update the hash function # Are these changes tested? The hashmaps are already covered by existing tests. Also tested by manual read parquets. # Are there any user-facing changes? No # Performance results in arrow_reader_row_filter.rs on my 3950X Benchmark | Change | Verdict -- | -- | -- int64 == 9999 / all_columns / async | 🟢 -1.61% | Improved int64 == 9999 / all_columns / sync | 🔴 +1.56% | Regressed int64 == 9999 / exclude_filter_column / async | 🟢 -1.11% | Improved int64 == 9999 / exclude_filter_column / sync | ⚪ -0.97% | Within noise float64 > 99.0 / all_columns / async | 🟢 -6.25% | Improved float64 > 99.0 / all_columns / sync | 🟢 -11.24% | Improved float64 > 99.0 / exclude_filter_column / async | 🟢 -11.10% | Improved float64 > 99.0 / exclude_filter_column / sync | 🟢 -3.31% | Improved ts ≥ 9000 / all_columns / async | 🔴 +2.77% | Regressed ts ≥ 9000 / all_columns / sync | ⚪ -0.06% | Within noise ts ≥ 9000 / exclude_filter_column / async | 🟢 -2.54% | Improved ts ≥ 9000 / exclude_filter_column / sync | ⚪ +0.28% | Within noise int64 > 90 / all_columns / async | 🟢 -14.68% | Improved int64 > 90 / all_columns / sync | 🟢 -21.00% | Improved int64 > 90 / exclude_filter_column / async | 🟢 -17.66% | Improved int64 > 90 / exclude_filter_column / sync | 🟢 -14.53% | Improved float64 ≤ 99.0 / all_columns / async | 🟢 -9.20% | Improved float64 ≤ 99.0 / all_columns / sync | 🟢 -11.07% | Improved float64 ≤ 99.0 / exclude_filter_column / async | 🟢 -10.01% | Improved float64 ≤ 99.0 / exclude_filter_column / sync | 🟢 -11.80% | Improved ts < 9000 / all_columns / async | 🟢 -3.43% | Improved ts < 9000 / all_columns / sync | 🟢 -6.23% | Improved ts < 9000 / exclude_filter_column / async | 🟢 -4.00% | Improved ts < 9000 / exclude_filter_column / sync | 🟢 -3.91% | Improved utf8View <> '' / all_columns / async | 🟢 -16.56% | Improved utf8View <> '' / all_columns / sync | 🟢 -12.10% | Improved utf8View <> '' / exclude_filter_column / async | 🟢 -13.00% | Improved utf8View <> '' / exclude_filter_column / sync | 🟢 -17.29% | Improved float64 > 99.0 AND ts ≥ 9000 / all_columns / async | 🔴 +3.51% | Regressed float64 > 99.0 AND ts ≥ 9000 / all_columns / sync | 🟢 -2.19% | Improved float64 > 99.0 AND ts ≥ 9000 / exclude_filter_column / async | 🟢 -2.63% | Improved float64 > 99.0 AND ts ≥ 9000 / exclude_filter_column / sync | 🟢 -2.72% | Improved
…e#8656) # Which issue does this PR close? N/A # Rationale for this change doing `OffsetBuffer::from_lengths(std::iter::repeat_n(size, value.len()));` does not utilize SIMD (I explain further if you want) See [GodBolt Link](https://godbolt.org/z/PTsfvfjqx) Extracted from: - apache#8653 After this and the pr below is merged will improve the datafusion scalar to array to use this and make it really really fast: - apache#8658 # What changes are included in this PR? added new function # Are these changes tested? yes # Are there any user-facing changes? yes
…<repeat>));` with `OffsetBuffer::from_repeated_length(<val>, <repeat>);` (apache#8669) # Which issue does this PR close? N/A # Rationale for this change Use the dedicated faster function for creating offset with the same length # What changes are included in this PR? replace ```rust OffsetBuffer::from_lengths(std::iter::repeat_n(<val>, <repeat>)); ``` with ```rust OffsetBuffer::from_repeated_length(<val>, <repeat>); ``` # Are these changes tested? Existing tests # Are there any user-facing changes? Nope ---- Related to: - apache#8656
# Which issue does this PR close? N/A # Rationale for this change I want to repeat the same value multiple times in a very fast way which will be used in: - apache#8653 After this and the pr below is merged will improve the datafusion scalar to array to use this and make it really really fast: - apache#8656 # What changes are included in this PR? Created a function in `MutableBuffer` to repeat a slice a number of times in a logarithmic way to reduce memcopy calls # Are these changes tested? Yes # Are there any user-facing changes? Yes, and added docs ------- Extracted from: - apache#8653 Benchmark results on local machine | Slice Length | Repetitions (n) | repeat_slice_n_times | extend_from_slice loop | Speedup | |--------------|-----------------|----------------------|------------------------|---------| | 3 | 3 | 47.092 ns | 41.910 ns | 0.89x | | 3 | 64 | 63.548 ns | 222.29 ns | 3.50x | | 3 | 1024 | 105.57 ns | 3.031 µs | 28.7x | | 3 | 8192 | 405.71 ns | 24.170 µs | 59.6x | | 20 | 3 | 48.437 ns | 46.437 ns | 0.96x | | 20 | 64 | 74.993 ns | 319.04 ns | 4.25x | | 20 | 1024 | 350.94 ns | 4.437 µs | 12.6x | | 20 | 8192 | 2.440 µs | 35.524 µs | 14.6x | | 100 | 3 | 50.369 ns | 47.568 ns | 0.94x | | 100 | 64 | 119.70 ns | 165.37 ns | 1.38x | | 100 | 1024 | 1.734 µs | 2.623 µs | 1.51x | | 100 | 8192 | 10.615 µs | 19.750 µs | 1.86x | these are the results: <details> <summary>Result</summary> ``` MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=3 time: [46.719 ns 47.092 ns 47.453 ns] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=3 time: [41.833 ns 41.910 ns 41.996 ns] Found 11 outliers among 100 measurements (11.00%) 9 (9.00%) high mild 2 (2.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=64 time: [62.935 ns 63.548 ns 64.183 ns] Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=64 time: [221.75 ns 222.29 ns 222.86 ns] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=1024 time: [105.15 ns 105.57 ns 106.01 ns] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=1024 time: [3.0240 µs 3.0308 µs 3.0395 µs] Found 11 outliers among 100 measurements (11.00%) 2 (2.00%) low mild 5 (5.00%) high mild 4 (4.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=8192 time: [401.57 ns 405.71 ns 409.94 ns] Found 6 outliers among 100 measurements (6.00%) 6 (6.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=8192 time: [24.124 µs 24.170 µs 24.222 µs] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=3 time: [48.287 ns 48.437 ns 48.606 ns] Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=3 time: [46.289 ns 46.437 ns 46.611 ns] Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=64 time: [74.625 ns 74.993 ns 75.395 ns] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=64 time: [318.20 ns 319.04 ns 319.98 ns] Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) high mild 5 (5.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=1024 time: [346.66 ns 350.94 ns 355.17 ns] Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 2 (2.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=1024 time: [4.4251 µs 4.4369 µs 4.4506 µs] Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 2 (2.00%) high mild 5 (5.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=8192 time: [2.4336 µs 2.4401 µs 2.4465 µs] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=8192 time: [35.466 µs 35.524 µs 35.589 µs] Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=3 time: [50.209 ns 50.369 ns 50.530 ns] Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=3 time: [47.439 ns 47.568 ns 47.701 ns] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=64 time: [117.77 ns 119.70 ns 122.00 ns] Found 12 outliers among 100 measurements (12.00%) 7 (7.00%) high mild 5 (5.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=64 time: [164.88 ns 165.37 ns 166.07 ns] Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=1024 time: [1.7278 µs 1.7335 µs 1.7398 µs] Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 5 (5.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=1024 time: [2.6176 µs 2.6232 µs 2.6305 µs] Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) high mild 4 (4.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=8192 time: [10.583 µs 10.615 µs 10.649 µs] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=8192 time: [19.471 µs 19.750 µs 20.185 µs] Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) high mild 7 (7.00%) high severe ``` </details>
# Which issue does this PR close? - Followup of apache#8552. # Rationale for this change Code cleanup and optimization # What changes are included in this PR? Addressed the post-comments in apache#8552 and refactor/optimize the method `rescale_decimal` # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No
…che#8638) # Which issue does this PR close? - Closes apache#8637. - Support Variant to Arrow for Null/Time/Decimlal{4,8,16} # What changes are included in this PR? - Add logic in `typed_value_to_variant`/`PrimitiveVariantToArrowRowBuilder` for `Null/Time/Decimal{4,8,16}` - Implement `PrimitiveFromVariant` for `Time64MicrosecondType` - Add tests to cover the added logic - # Are these changes tested? Added some tests # Are there any user-facing changes? No
# Which issue does this PR close? Add utf8-view support for json key # Rationale for this change Add utf8-view support for json key # What changes are included in this PR? Add utf8-view support for json key # Are these changes tested? * [x] TODO # Are there any user-facing changes? No
# Which issue does this PR close? - Closes apache#8674 # Rationale for this change Add json encoding for binary view # What changes are included in this PR? Add BinaryViewEncoder # Are these changes tested? * [x] TODO # Are there any user-facing changes? No
# Which issue does this PR close? - part of apache#7835 # Rationale for this change We added a new crate so let's add that to the instructions too # What changes are included in this PR? # Are these changes tested? # Are there any user-facing changes?
…che#8706) # Which issue does this PR close? N/A # Rationale for this change It is not obvious that the thrift macros produce public enums only (e.g. see apache#8680 (comment)). This should be made clear in the documentation. # What changes are included in this PR? Add said clarification. # Are these changes tested? Documentation only, so no tests required. # Are there any user-facing changes? No, only changes to private documentation
# Which issue does this PR close? - Closes apache#8691 # Rationale for this change The `README.md` file in `arrow-avro` instructs users to install version 56. This is invalid and should be changed to version 57. # What changes are included in this PR? Updated the `README.md` file to reference version 57. # Are these changes tested? N/A since this a small `README.md` file change. # Are there any user-facing changes? The `README.md` file in `arrow-avro` now instructs users to install version 57. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? chore: add test case of `RowSelection::trim` # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.
…DING_SLOTS (apache#8663) # Which issue does this PR close? - Closes [apache#8662] # Rationale for this change Related to apache#8607 We need to know how many encoding are support to create a decoder slot. # What changes are included in this PR? Update the `thrift_enum` to know the fields count of enum `Encoding`, and the value is passed to `EncodingMask` And the `ENCODING_SLOTS` # Are these changes tested? 1. Originally I think add a UT can prevent failure after the new encoding are introduced, then I realized the counts are already transferred, the UT is not required, the original tests can already cover the code. # Are there any user-facing changes? No
- A follow up from apache#8625 # Rationale for this change While working on a separate task, I noticed `create_test_variant_array` was redundant. Since `VariantArray` can already be constructed directly from an iterator of Variants, this PR removes the now-unnecessary test helper.
# Which issue does this PR close? - Closes apache#8685. # What changes are included in this PR? In the implementation of `RowConverter::from_binary`, the `BinaryArray` is broken into parts and an attempt is made to convert the data buffer into `Vec` at no copying cost with `Buffer::into_vec`. Only if this fails, the data is copied out for a newly allocated `Vec`. # Are these changes tested? Passes existing tests using `RowConverter::from_binary`, which all convert a non-shared buffer taking advantage of the optimization. Another test is added to cover the copying path. # Are there any user-facing changes? No
# Which issue does this PR close? - Closes apache#8692. # Rationale for this change Explained in issue. # What changes are included in this PR? - Adds `FilterPredicate::filter_record_batch` - Adapts the free function `filter_record_batch` to use the new function - Uses `new_unchecked` to create the filtered result. The rationale for this is identical to apache#8583 # Are these changes tested? Covered by existing tests for `filter_record_batch` # Are there any user-facing changes? No --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
…quet metadata (apache#9008) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Part of apache#5853. # Rationale for this change Add ability to skip the decoding of more types of statistics contained in the Parquet column metadata. While this currently doesn't have a huge impact on decode time, it can reduce the amount of memory used by the `ParquetMetaData`. # What changes are included in this PR? Adds more options and tests for those options. Also adds size statistics to the metadata bench. # Are these changes tested? Yes # Are there any user-facing changes? Only adds new options, no breaking changes.
Use the suggested Arc<[Buffer]> storage for ViewArray storage instead of an owned Vec<Buffer> so that the slice clone does not allocate. # Which issue does this PR close? - Closes apache#6408 . --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…mns (apache#9081) # Which issue does this PR close? N/A # Rationale for this change I noticed that converting around 50 columns the conversion become very slow, so adding a benchmark as I'm optimizing those parts # What changes are included in this PR? added new benchmark for `row_format` that convert 50 columns arrays # Are these changes tested? N/A # Are there any user-facing changes? Nope --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…he#9112) # Which issue does this PR close? - Part of apache#9061 # Rationale for this change @tustvold says apache#9058 (comment) > My 2 cents is it would be better to move the codepaths relying on ArrayData over to using the typed arrays directly, this should not only cut down on allocations but unnecessary validation and dispatch overheads. Let's make sure the documentation also encodes this wisdom # What changes are included in this PR? Improve the documentation for `make_array` to include the information about `make_array` # Are these changes tested? By CI # Are there any user-facing changes? Only docs and an example, no functional changes
…#8916) # Which issue does this PR close? None. # Rationale for this change 1. Keep consistent with `FlightClient`. 2. We need inner error (for example `tonic::Status`) to know if we should retry, but ArrowError stores string of `tonic::Status`. # What changes are included in this PR? 1. Let flight sql client returns `FlightError` . 2. Cleanup code. # Are these changes tested? CI # Are there any user-facing changes? Yes. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? - Closes apache#9135 # Rationale for this change The code was calling `.reserve(batch_size)` which reserves space to at least `batch_size` additional elements (https://doc.rust-lang.org/std/vec/struct.Vec.html#method.reserve). This also improves performance a bit: ``` filter: primitive, 8192, nulls: 0, selectivity: 0.001 time: [59.509 ms 59.660 ms 59.856 ms] change: [−3.0781% −2.7917% −2.4795%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe filter: primitive, 8192, nulls: 0, selectivity: 0.01 time: [6.0072 ms 6.0226 ms 6.0428 ms] change: [−8.7042% −7.1161% −6.0455%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe Benchmarking filter: primitive, 8192, nulls: 0, selectivity: 0.1: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 50. filter: primitive, 8192, nulls: 0, selectivity: 0.1 time: [1.8664 ms 1.8709 ms 1.8772 ms] change: [−15.187% −14.905% −14.632%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 5 (5.00%) high mild 6 (6.00%) high severe filter: primitive, 8192, nulls: 0, selectivity: 0.8 time: [2.5191 ms 2.5444 ms 2.5717 ms] change: [−13.064% −11.414% −10.003%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high severe Benchmarking filter: primitive, 8192, nulls: 0.1, selectivity: 0.001: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.7s, or reduce sample count to 60. filter: primitive, 8192, nulls: 0.1, selectivity: 0.001 time: [76.422 ms 76.671 ms 76.973 ms] change: [−5.5096% −4.0229% −2.8048%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe filter: primitive, 8192, nulls: 0.1, selectivity: 0.01 time: [10.197 ms 10.228 ms 10.262 ms] change: [−3.6627% −3.0569% −2.4919%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild filter: primitive, 8192, nulls: 0.1, selectivity: 0.1 time: [4.6635 ms 4.6750 ms 4.6915 ms] change: [−9.4939% −8.5908% −7.8383%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe filter: primitive, 8192, nulls: 0.1, selectivity: 0.8 time: [4.7777 ms 4.8115 ms 4.8467 ms] change: [−9.9226% −9.1384% −8.3813%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ``` # What changes are included in this PR? Changes it to use `self.views.reserve(self.batch_size - self.views.len())` to avoid allocating more than necessary (i.e. 2x the amount). # Are these changes tested? Existing tests # Are there any user-facing changes?
…ache#9091) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #NNN. # Rationale for this change Improve JSON binary decoding performance by avoiding per-value allocations and enabling direct hex decoding into builders. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? Optimized binary hex decoding paths to reduce allocations and improve throughput. ``` decode_binary_hex_json time: [3.6780 ms 3.6953 ms 3.7150 ms] change: [−61.051% −60.818% −60.565%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe decode_fixed_binary_hex_json time: [4.0404 ms 4.1400 ms 4.2901 ms] change: [−56.149% −55.040% −53.330%] (p = 0.00 < 0.05) Performance has improved. Found 19 outliers among 100 measurements (19.00%) 7 (7.00%) high mild 12 (12.00%) high severe decode_binary_view_hex_json time: [4.3731 ms 4.4242 ms 4.4767 ms] change: [−53.305% −52.771% −52.239%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ``` <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - related to apache#9085 - related to apache#9022 # Rationale for this change the `nullif` kernel has an optimization for counting nulls as part of applying nullif, and I did not know if that made a difference (turns out it does). I made a benchmark to be able to measure this difference # What changes are included in this PR? Add a `nullif_kernel` benchmark # Are these changes tested? I ran them manually # Are there any user-facing changes? No new benchmark
…pache#9129) # Which issue does this PR close? - part of apache#9128 # Rationale for this change While studying / profiling the Parquet reader I have noticed several places where unecessary allocations are happening # What changes are included in this PR? Avoid ArrayData allocation in `PrimitiveArray::reinterpret_cast` # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
# Which issue does this PR close? - Part of apache#9136 # Rationale for this change API for apache#8951 , as part of a 2-3x speedup for filtering primitive types. # What changes are included in this PR? Adds `BooleanBufferBuilder::extend`, a fast way to extend the buffer from an iterator. # Are these changes tested? Yes, new tests # Are there any user-facing changes?
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #NNN. # Rationale for this change One of my new years resolutions is to try and encourage / build the arrow-rs community more. Part of doing so is to lower the barrier for new contributors with clear communication and documentation. Thus I would like to improve the main readme # What changes are included in this PR? Update the README. See rendered preview here: https://github.com/alamb/arrow-rs/blob/alamb/improved_comms_docs/README.md 1. Move community section to the top of the README file 2. Clarify that most communication happens on github <img width="1348" height="775" alt="Screenshot 2026-01-09 at 9 33 06 AM" src="https://github.com/user-attachments/assets/fc3f6041-478d-4170-bb40-3956764a9b8d" /> # Are these changes tested? CI # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
…pache#9134) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes apache#9133 # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Allow casting null arrays to all types we support. We missed (large) list view, run end encoded and union. # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Refactor from null arm to accept all target types to enable casting to large list view, list view, run end encoded and union types. # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added tests. # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> No.
… return type (apache#9139) ### Which issue does this PR close? Closes apache#9105 ### Rationale for this change The documentation for `VariantObject::get` previously described `Result`-style semantics (`Ok(None)` / `Err`), but the method actually returns `Option<Variant>`. This mismatch could confuse users of the API. ### What changes are included in this PR? - Update the documentation for `VariantObject::get` to correctly describe its `Option` return type. - No functional or behavioral changes are included.
…ray` (apache#9114) # Which issue does this PR close? - Part of apache#9061 - broken out of apache#9058 # Rationale for this change The current implementation of `make_array` for StructArray and GenericByteViewArray clones `ArrayData` which allocates a new Vec. This is unnecessary given that `make_array` is passed an owned ArrayData # What changes are included in this PR? 1. Add a new API to ArrayData to break it down into parts (`into_parts`) 2. Use that API to avoid cloning while constructing StructArray and GenericByteViewArray # Are these changes tested? Yes by CI # Are there any user-facing changes? A few fewer allocations when creating arrays --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes apache#9147 . # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> # What changes are included in this PR? Check issue <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> # Are these changes tested? It's a test <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> # Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
Closes apache#9131 This PR updates the Apache Software Foundation copyright year in `arrow-rs/NOTICE.txt`, as discussed in the release verification follow-up. No third-party entries are modified.
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Part of apache#9018 . # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> To consider offset in slicing of RunArray. # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> 1. Considered offset in slicing of RunArray. 2. Enhanced RunArray slice API. # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 3. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> yes # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> Yes, extended API to access RunArray slices directly than getting it from index.
…e#9146) # Which issue does this PR close? No known issue, but this confirms that nested dicts do indeed work in Map and Union arrays. It was brought up here: apache#9126 (comment) # Rationale for this change Ensure that IPC roundtripping nested dicts works in Map and Union arrays. # What changes are included in this PR? Unit tests testing the functionality. # Are these changes tested? The whole PR consists of tests only. # Are there any user-facing changes? No @alamb @Jefffrey
# Which issue does this PR close? N/A # Rationale for this change allow to reserve so we can avoid reallocating # What changes are included in this PR? added `reserve` function to `Rows` + tests # Are these changes tested? yes # Are there any user-facing changes? yes
…ow conversion (apache#9080) # Which issue does this PR close? N/A # Rationale for this change Making the row length calculation faster which result in faster row conversion # What changes are included in this PR? 1. Instead of iterating over the bytes and getting the length from the byte slice, we use the offsets directly, this is faster as it saves us going to the buffer 2. Added new API for `GenericByteViewArray` (explained below) # Are these changes tested? Yes # Are there any user-facing changes? Yes, added `lengths` function to `GenericByteViewArray` to get an iterator over the lengths of the items in the array ----- Related to: - apache#9078 - apache#9079 --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…rayData` (apache#9122) # Which issue does this PR close? - related to apache#9061 - Part of apache#9128 # Rationale for this change - similarly to apache#9120 Creating Arrays via ArrayData / `make_array` has overhead (at least 2 Vec allocations) compared to simply creating the arrays directly # What changes are included in this PR? Update the parquet reader to create `PrimitiveArray`s directly # Are these changes tested? By CI # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
# Which issue does this PR close? - Closes apache#9096. # Rationale for this change The RowFilter API does exist and can evaluate predicates during evaluation, but it has no examples. # What changes are included in this PR? - Added a rustdoc example and blog link to `ParquetRecordBatchReaderBuilder::with_row_filter`. - Added a running example in `parquet/examples/read_with_row_filter.rs` # Are these changes tested? Yes ``` cargo run -p parquet --example read_with_row_filter cargo test -p parquet --doc ``` # Are there any user-facing changes? Yes, doc only. No API changes.
…ta` (1% improvement) (apache#9120) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - part of apache#9061 - Part of apache#9128 # Rationale for this change I noticed on apache#9061 that there is non trivial overhead to create struct arrays. I am trying to improve `make_array` in parallel, but @tustvold had an even better idea in apache#9058 (comment) > My 2 cents is it would be better to move the codepaths relying on ArrayData over to using the typed arrays directly, this should not only cut down on allocations but unnecessary validation and dispatch overheads. # What changes are included in this PR? Update the parquet `StructArray` reader (used for the top level RecordBatch) to directly construct StructArray rather than using ArrayData # Are these changes tested? By existing CI Benchmarks show a small repeatable improvement of a few percent. For example ``` arrow_reader_clickbench/async/Q10 1.00 12.7±0.35ms ? ?/sec 1.02 12.9±0.44ms ? ?/sec ``` I am pretty sure this is because the click bench dataset has more than 100 columns. Creating such a struct array requires cloning 100 `ArrayData` (one for each child) which each has a Vec<Buffers>. So this saves (at least) 100 allocations per batch # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. -->
| name: Test | ||
| runs-on: ubuntu-latest | ||
| container: | ||
| image: amd64/rust | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| with: | ||
| submodules: true | ||
| - name: Setup Rust toolchain | ||
| uses: ./.github/actions/setup-builder | ||
| - name: Test parquet-geospatial | ||
| run: cargo test -p parquet-geospatial | ||
|
|
||
| # test compilation | ||
| linux-features: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 months ago
In general, the fix is to explicitly set GITHUB_TOKEN permissions in the workflow to the minimal scope required. For this workflow, the jobs only check out code and run Rust commands; they do not write to the repository or interact with issues, PRs, or releases. Therefore contents: read is sufficient. The cleanest fix without changing any functionality is to add a root-level permissions: block that applies to all jobs.
Concretely, in .github/workflows/parquet-geospatial.yml, add a permissions: block near the top of the workflow (for example, after the name: or after concurrency:) with contents: read. This will apply to all three jobs (linux-test, linux-features, clippy) because none of them define their own permissions. No imports or additional methods are needed, as this is purely a YAML configuration change.
| @@ -23,6 +23,9 @@ | ||
| group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }} | ||
| cancel-in-progress: true | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| # trigger for all PRs that touch certain files and changes to main | ||
| on: | ||
| push: |
| name: Check Compilation | ||
| runs-on: ubuntu-latest | ||
| container: | ||
| image: amd64/rust | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| with: | ||
| submodules: true | ||
| - name: Setup Rust toolchain | ||
| uses: ./.github/actions/setup-builder | ||
| - name: Check compilation (parquet-geospatial) | ||
| run: cargo check -p parquet-geospatial | ||
|
|
||
| clippy: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 months ago
In general, the fix is to add an explicit permissions block to the workflow (at the top level or for each job) that grants only the minimal scopes required. Here, all jobs simply check out code and run Rust commands; they only need read access to repository contents. The single best fix is to add a workflow-level permissions block with contents: read, which will apply to all jobs that don’t override it. This should be added near the top of .github/workflows/parquet-geospatial.yml, after the name: declaration and before concurrency: or on:. No additional imports or external dependencies are required, and existing behavior will remain unchanged because none of the jobs used elevated permissions.
| @@ -19,6 +19,9 @@ | ||
| # tests for parquet-geospatial crate | ||
| name: "parquet-geospatial" | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| concurrency: | ||
| group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }} | ||
| cancel-in-progress: true |
| name: Clippy | ||
| runs-on: ubuntu-latest | ||
| container: | ||
| image: amd64/rust | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| - name: Setup Rust toolchain | ||
| uses: ./.github/actions/setup-builder | ||
| - name: Setup Clippy | ||
| run: rustup component add clippy | ||
| - name: Run clippy (parquet-geospatial) | ||
| run: cargo clippy -p parquet-geospatial --all-targets --all-features -- -D warnings |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 months ago
To fix the problem, explicitly declare a minimal permissions block so the workflow does not inherit potentially broad repository defaults. Since all three jobs (linux-test, linux-features, clippy) only check out code and run cargo commands, they only need read access to the repository contents.
The best way without changing functionality is to add a workflow-level permissions section (so it applies to all jobs) near the top of the file, setting contents: read. This preserves all current behavior (the jobs still run exactly the same steps) but ensures the automatically provided GITHUB_TOKEN cannot write to the repository or perform other unnecessary actions. Concretely, in .github/workflows/parquet-geospatial.yml, add:
permissions:
contents: readafter the name: "parquet-geospatial" line (or before concurrency: / on:). No imports or additional methods are needed because this is a YAML configuration change only.
| @@ -19,6 +19,9 @@ | ||
| # tests for parquet-geospatial crate | ||
| name: "parquet-geospatial" | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| concurrency: | ||
| group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }} | ||
| cancel-in-progress: true |
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Closes #NNN.
Rationale for this change
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
What changes are included in this PR?
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.