Skip to content

Conversation

@Jefffrey
Copy link
Contributor

Which issue does this PR close?

N/A

Rationale for this change

Simplify code.

What changes are included in this PR?

See comments.

Are these changes tested?

Yes.

Are there any user-facing changes?

Removed some public APIs, but they weren't meant to be public anyway.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Jan 28, 2026

Self {
signature: Signature::one_of(variants, Volatility::Immutable),
aliases: vec![],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No aliases, remove this

fn aliases(&self) -> &[String] {
&self.aliases
fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
make_scalar_function(compute_hex, vec![Hint::AcceptsSingular])(&args.args)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Hint::AcceptsSingular makes our code simpler in handling scalars; we could use ColumnarValue but for now can make use of make_scalar_function

i -= 1;
buffer[i] = HEX_CHARS_UPPER[(n & 0xF) as usize];
n >>= 4;
fn hex_values(array: &ArrayRef) -> Result<ArrayRef> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplicating this match arm (previously had same for dictionary)

}

/// Hex encoding lookup tables for fast byte-to-hex conversion
const HEX_CHARS_UPPER: &[u8; 16] = b"0123456789ABCDEF";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lowercase was only meant for sha2; remove the code here, as sha2 wasn't using it from here

fn hex_encode_bytes<'a, I, T>(iter: I) -> Result<ArrayRef>
where
I: Iterator<Item = Option<T>>,
I: ExactSizeIterator<Item = Option<T>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ExactSizeIterator lets us get the length from the iterator, instead of passing it separately

use datafusion_expr::ColumnarValue;

#[test]
fn test_dictionary_hex_utf8() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving tests to SLTs

FROM VALUES (1), (2), (NULL), (3)
) SELECT hex(column1), arrow_typeof(hex(column1)) FROM values
----
1 Utf8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally these would stay dictionary, but theres known issue with passing dictionary types via coercible API

@Jefffrey Jefffrey marked this pull request as ready for review January 28, 2026 13:24
}

/// Hex encoding lookup tables for fast byte-to-hex conversion
const HEX_CHARS_UPPER: &[u8; 16] = b"0123456789ABCDEF";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey great PR and appreciate commented changes, WDYT about hex bench results? datafusion/spark/benches/hex.rs

@Jefffrey
Copy link
Contributor Author

Thanks @Jefffrey great PR and appreciate commented changes, WDYT about hex bench results? datafusion/spark/benches/hex.rs

I ran the bench on my M4 Mac:

Details
     Running benches/hex.rs (/Users/jeffrey/.cargo_target_cache/release/deps/hex-093ddc801ba8123c)
Gnuplot not found, using plotters backend
hex_int64/size=1024     time:   [6.3674 µs 6.4616 µs 6.5677 µs]
                        change: [+5.4026% +8.6451% +12.018%] (p = 0.00 < 0.05)
                        Performance has regressed.

hex_int64/size=4096     time:   [27.309 µs 27.679 µs 28.046 µs]
                        change: [−1.2305% +0.6185% +2.2777%] (p = 0.49 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

hex_int64/size=8192     time:   [57.803 µs 58.466 µs 59.090 µs]
                        change: [+5.0492% +5.9280% +6.9853%] (p = 0.00 < 0.05)
                        Performance has regressed.

hex_utf8/size=1024      time:   [41.243 µs 41.363 µs 41.477 µs]
                        change: [−0.1689% +0.1669% +0.4908%] (p = 0.32 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  12 (12.00%) high mild

hex_utf8/size=4096      time:   [163.62 µs 164.08 µs 164.50 µs]
                        change: [−0.4326% −0.0739% +0.2639%] (p = 0.68 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  7 (7.00%) high mild

hex_utf8/size=8192      time:   [326.48 µs 327.31 µs 328.12 µs]
                        change: [−0.4520% −0.1235% +0.1939%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild

hex_binary/size=1024    time:   [41.158 µs 41.225 µs 41.286 µs]
                        change: [−0.4329% −0.1577% +0.1180%] (p = 0.27 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

hex_binary/size=4096    time:   [163.86 µs 164.18 µs 164.48 µs]
                        change: [−0.5476% −0.2517% +0.0122%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

hex_binary/size=8192    time:   [326.72 µs 327.45 µs 328.29 µs]
                        change: [−0.3658% −0.0302% +0.3973%] (p = 0.89 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

hex_int64_dict/size=1024
                        time:   [6.6908 µs 6.8738 µs 7.0860 µs]
                        change: [−2.2874% −0.2044% +1.5752%] (p = 0.84 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

hex_int64_dict/size=4096
                        time:   [27.934 µs 28.394 µs 28.926 µs]
                        change: [−1.8159% −0.5246% +0.7127%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

hex_int64_dict/size=8192
                        time:   [55.631 µs 56.239 µs 56.968 µs]
                        change: [−5.1281% −2.8229% −0.6794%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

I find the supposed int regressions surprising; the main hex loop was not altered apart from simplifying how len is passed in. Maybe it was already so fast that adding make_scalar_function()/altering the match path to get to the hot loop is adding this extra overhead? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants