Simplify Spark hex function #20044

Jefffrey · 2026-01-28T13:06:43Z

Which issue does this PR close?

N/A

Rationale for this change

Simplify code.

What changes are included in this PR?

See comments.

Are these changes tested?

Yes.

Are there any user-facing changes?

Removed some public APIs, but they weren't meant to be public anyway.

Jefffrey · 2026-01-28T13:06:56Z

datafusion/spark/src/function/math/hex.rs


        Self {
            signature: Signature::one_of(variants, Volatility::Immutable),
-            aliases: vec![],


No aliases, remove this

Jefffrey · 2026-01-28T13:07:45Z

datafusion/spark/src/function/math/hex.rs

-    fn aliases(&self) -> &[String] {
-        &self.aliases
+    fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
+        make_scalar_function(compute_hex, vec![Hint::AcceptsSingular])(&args.args)


Using Hint::AcceptsSingular makes our code simpler in handling scalars; we could use ColumnarValue but for now can make use of make_scalar_function

Jefffrey · 2026-01-28T13:08:01Z

datafusion/spark/src/function/math/hex.rs

-        i -= 1;
-        buffer[i] = HEX_CHARS_UPPER[(n & 0xF) as usize];
-        n >>= 4;
+fn hex_values(array: &ArrayRef) -> Result<ArrayRef> {


Deduplicating this match arm (previously had same for dictionary)

Jefffrey · 2026-01-28T13:08:54Z

datafusion/spark/src/function/math/hex.rs

 }

+/// Hex encoding lookup tables for fast byte-to-hex conversion
+const HEX_CHARS_UPPER: &[u8; 16] = b"0123456789ABCDEF";


The lowercase was only meant for sha2; remove the code here, as sha2 wasn't using it from here

Jefffrey · 2026-01-28T13:09:12Z

datafusion/spark/src/function/math/hex.rs

+fn hex_encode_bytes<'a, I, T>(iter: I) -> Result<ArrayRef>
 where
-    I: Iterator<Item = Option<T>>,
+    I: ExactSizeIterator<Item = Option<T>>,


Using ExactSizeIterator lets us get the length from the iterator, instead of passing it separately

Jefffrey · 2026-01-28T13:09:26Z

datafusion/spark/src/function/math/hex.rs

-    use datafusion_expr::ColumnarValue;
-
-    #[test]
-    fn test_dictionary_hex_utf8() {


Moving tests to SLTs

Jefffrey · 2026-01-28T13:10:18Z

datafusion/sqllogictest/test_files/spark/math/hex.slt

+	FROM VALUES (1), (2), (NULL), (3)
+) SELECT hex(column1), arrow_typeof(hex(column1)) FROM values
+----
+1 Utf8


Ideally these would stay dictionary, but theres known issue with passing dictionary types via coercible API

How Null/Dictionary/REE datatypes interact with Coercion API #19458

comphead · 2026-01-29T22:45:42Z

datafusion/spark/src/function/math/hex.rs

 }

+/// Hex encoding lookup tables for fast byte-to-hex conversion
+const HEX_CHARS_UPPER: &[u8; 16] = b"0123456789ABCDEF";


comphead

Thanks @Jefffrey great PR and appreciate commented changes, WDYT about hex bench results? datafusion/spark/benches/hex.rs

Jefffrey · 2026-01-30T04:25:45Z

Thanks @Jefffrey great PR and appreciate commented changes, WDYT about hex bench results? datafusion/spark/benches/hex.rs

I ran the bench on my M4 Mac:

Details

     Running benches/hex.rs (/Users/jeffrey/.cargo_target_cache/release/deps/hex-093ddc801ba8123c)
Gnuplot not found, using plotters backend
hex_int64/size=1024     time:   [6.3674 µs 6.4616 µs 6.5677 µs]
                        change: [+5.4026% +8.6451% +12.018%] (p = 0.00 < 0.05)
                        Performance has regressed.

hex_int64/size=4096     time:   [27.309 µs 27.679 µs 28.046 µs]
                        change: [−1.2305% +0.6185% +2.2777%] (p = 0.49 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

hex_int64/size=8192     time:   [57.803 µs 58.466 µs 59.090 µs]
                        change: [+5.0492% +5.9280% +6.9853%] (p = 0.00 < 0.05)
                        Performance has regressed.

hex_utf8/size=1024      time:   [41.243 µs 41.363 µs 41.477 µs]
                        change: [−0.1689% +0.1669% +0.4908%] (p = 0.32 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  12 (12.00%) high mild

hex_utf8/size=4096      time:   [163.62 µs 164.08 µs 164.50 µs]
                        change: [−0.4326% −0.0739% +0.2639%] (p = 0.68 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  7 (7.00%) high mild

hex_utf8/size=8192      time:   [326.48 µs 327.31 µs 328.12 µs]
                        change: [−0.4520% −0.1235% +0.1939%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild

hex_binary/size=1024    time:   [41.158 µs 41.225 µs 41.286 µs]
                        change: [−0.4329% −0.1577% +0.1180%] (p = 0.27 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

hex_binary/size=4096    time:   [163.86 µs 164.18 µs 164.48 µs]
                        change: [−0.5476% −0.2517% +0.0122%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

hex_binary/size=8192    time:   [326.72 µs 327.45 µs 328.29 µs]
                        change: [−0.3658% −0.0302% +0.3973%] (p = 0.89 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

hex_int64_dict/size=1024
                        time:   [6.6908 µs 6.8738 µs 7.0860 µs]
                        change: [−2.2874% −0.2044% +1.5752%] (p = 0.84 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

hex_int64_dict/size=4096
                        time:   [27.934 µs 28.394 µs 28.926 µs]
                        change: [−1.8159% −0.5246% +0.7127%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

hex_int64_dict/size=8192
                        time:   [55.631 µs 56.239 µs 56.968 µs]
                        change: [−5.1281% −2.8229% −0.6794%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

I find the supposed int regressions surprising; the main hex loop was not altered apart from simplifying how len is passed in. Maybe it was already so fast that adding make_scalar_function()/altering the match path to get to the hot loop is adding this extra overhead? 🤔

Simplify Spark hex function

a1d026d

github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Jan 28, 2026

Jefffrey commented Jan 28, 2026

View reviewed changes

Jefffrey marked this pull request as ready for review January 28, 2026 13:24

comphead reviewed Jan 29, 2026

View reviewed changes

comphead approved these changes Jan 29, 2026

View reviewed changes

Merge branch 'main' into refactor-spark-hex

f8a3718

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify Spark hex function #20044

Simplify Spark hex function #20044

Jefffrey commented Jan 28, 2026

Uh oh!

Jefffrey Jan 28, 2026

Uh oh!

Jefffrey Jan 28, 2026

Uh oh!

Jefffrey Jan 28, 2026

Uh oh!

Jefffrey Jan 28, 2026

Uh oh!

Jefffrey Jan 28, 2026

Uh oh!

Jefffrey Jan 28, 2026

Uh oh!

Jefffrey Jan 28, 2026

Uh oh!

comphead Jan 29, 2026

Uh oh!

comphead left a comment

Uh oh!

Jefffrey commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Simplify Spark hex function #20044

Are you sure you want to change the base?

Simplify Spark hex function #20044

Conversation

Jefffrey commented Jan 28, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants