feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs by mbutrovich · Pull Request #4267 · apache/datafusion-comet

mbutrovich · 2026-05-08T15:36:39Z

Draft while we discuss with #4233 and #4239.

Which issue does this PR close?

Closes #.

Rationale for this change

#4232 merged the JVM UDF bridge. This PR adds a codegen dispatcher on top: one generic CometUDF that compiles a specialized batch kernel per bound Catalyst Expression + input schema via Janino.

Benefits:

Any supported ScalaUDF or Catalyst expression routes through native without a hand-written CometUDF.
UDFs stop being opaque operator boundaries; ScalaUDFs and Catalyst expressions share one expression tree, so Comet keeps surrounding native operators in place.
An entire expression subtree compiles into one per-row loop with stack-local intermediates instead of per-subexpression Arrow batches.

Opt-in via spark.comet.exec.codegenDispatch.mode = auto | force | disabled. Primary targets: string expressions and user ScalaUDFs.

What changes are included in this PR?

Codegen dispatcher: CometBatchKernelCodegen (orchestrator) + CometBatchKernelCodegenInput / CometBatchKernelCodegenOutput (per-side emission) + CometCodegenDispatchUDF (bridge entry, three-layer cache).
Complex type support: ArrayType, StructType, and MapType as both input and output, including arbitrary nesting. Sealed ArrowColumnSpec + recursive nested-class emission.
Per-expression specialization: direct-bytes RegExpReplace emitter bypassing the UTF8String round-trip Spark's doGenCode forces.
Optimization set applied per (expression, input schema): zero-copy UTF8 reads (VarCharVector / ViewVarCharVector), non-nullable isNullAt elision, decimal short-precision fast path on both sides, UTF8 on-heap write shortcut, pre-sized variable-length output buffers, NullIntolerant short-circuit, non-nullable output short-circuit, subexpression elimination. Complex-type output writes hoist getChildByOrdinal + cast to once-per-batch setup so the per-row body has no runtime type dispatch and no redundant casts.
Bridge contract additions: numRows parameter (zero-column expressions); TaskContext propagation across JNI so partition-sensitive expressions (Rand, Uuid, MonotonicallyIncreasingID, user UDFs reading TaskContext.get()) see the Spark task context from the Tokio worker.
Serde routing: CometScalaUDF routes any ScalaUDF; the regex family (rlike, regexp_replace, regexp_extract, regexp_extract_all, regexp_instr, split) gets a uniform pickWithMode switch; native Rust paths preserved where they exist. Proto-building factored into CodegenDispatchSerdeHelpers.buildJvmUdfExpr.
Allocation reuses Utils.toArrowField + Field.createVector for every output type. Input spec derives Spark DataTypes via Utils.fromArrowField. Exception paths close partially-allocated vectors to avoid leaks.
Docs split: docs/source/user-guide/latest/jvm_udf_dispatch.md (configuration, supported expressions and types, regex routing matrix, behavioral limitations); docs/source/contributor-guide/jvm_udf_dispatch.md (architecture, optimizations, caching, CSE rationale, WSCG-exploration notes, open items cross-referencing in-code TODOs, file map). Both ASCII-only.

How are these changes tested?

CometCodegenSourceSuite - generated-source assertions for every optimization and every complex-type shape.
CometCodegenDispatchSmokeSuite - end-to-end correctness across the scalar and complex type surface (primitives, decimal precision boundaries, date/timestamp/timestampNTZ, array/struct/map round-trips including nested shapes), composed-UDF trees, subquery reuse, TaskContext propagation.
CometCodegenDispatchFuzzSuite - randomized string fuzz + decimal identity fuzz at several null densities.
CometRegExpJvmSuite - SQL-level Spark-vs-Comet correctness for the regex family.
CometScalaUDFCompositionBenchmark - Spark vs Comet native built-ins vs dispatcher disabled vs dispatcher force over three shapes.

…UDFs

mbutrovich · 2026-05-08T21:49:51Z

There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward.

…r JNI

…ted body" on Spark 3.5

…scala_udf

andygrove · 2026-05-12T14:40:27Z

+   * generated subclass is not thread-safe across concurrent {@code process} calls, so kernels are
+   * allocated per dispatcher invocation and init is run once on the fresh instance.
+   */
+  public void init(int partitionIndex) {}


nit: I think moving init before process helps with reading this

andygrove · 2026-05-12T14:45:15Z

 */
 trait CometUDF {
-  def evaluate(inputs: Array[ValueVector]): ValueVector
+  def evaluate(inputs: Array[ValueVector], numRows: Int): ValueVector


Would it be worth creating a separate PR with this change?

yes, I am peeling off this and the taskcontext change to its own PRs.

andygrove · 2026-05-12T14:45:58Z

+  val REGEXP_ENGINE_RUST = "rust"
+  val REGEXP_ENGINE_JAVA = "java"
+
+  val COMET_REGEXP_ENGINE: ConfigEntry[String] =


Using the regexp work to test the new framework makes sense, but I think we should split this work out into a follow on PR

andygrove · 2026-05-12T14:46:57Z

We still need compatibility docs for regexp for the Rust path

andygrove · 2026-05-12T14:48:03Z

+      keyUnwrapper,
+      // Capture the Spark task thread's TaskContext at `createPlan` time. Stashed native-side
+      // in the ExecutionContext and passed through the JVM UDF bridge so that Tokio workers
+      // running JVM UDFs see the real `TaskContext` via their thread-local. See
+      // `CometUdfBridge.evaluate` and `CometTaskContextShim` for the receive side.
+      TaskContext.get())


could this change be a separate pr?

yes, peeling this off as a separate PR

feat: Arrow-direct codegen dispatcher for Spark expressions and Scala…

1746bcc

…UDFs

This was referenced May 8, 2026

feat: add experimental support for Spark regexp expressions via JVM UDF framework #4239

Open

feat: add user-facing CometUDF registration for custom JVM UDFs #4233

Draft

mbutrovich and others added 9 commits May 8, 2026 11:44

prettier, add new suites to CI checks.

08d6b78

make format, fix shims for 4.0+

557752e

make format, fix shims for 4.0+

896f61f

Merge branch 'main' into codegen_scala_udf

a82e160

strengthen tests for composed expressions

2a158f4

make format, again.

654bbad

fix pr_benchmark_check.yml

10df7e0

fix arrow shading issue in CI.

7afe69f

fix Spark 4.0 collation expression shim

0dc5855

mbutrovich and others added 17 commits May 8, 2026 19:44

apply common subexpression elimination, add tests for subqueries in UDFs

43a7b0c

make format

9640897

decimal fast path. document 64KB limitation right now

f0c8296

pass through task context to get around tokio worker pool calling ove…

2173f40

…r JNI

fix compilation on scala 2.12, fix format issue

2f9585b

Merge branch 'main' into codegen_scala_udf

582cd17

decimal output, utf8 output, non-nullable output optimizations

22f3256

optimization menu

7666715

estimate binaryview and binary size

0a34636

fix "CSE collapses a repeated subtree to one evaluation in the genera…

e94b6db

…ted body" on Spark 3.5

Merge remote-tracking branch 'origin/codegen_scala_udf' into codegen_…

d0f1f27

…scala_udf

add some complex type support, remove apache#4239 code. update docs.

07e37ea

split codegen input and output, basic struct WIP

ebf77c4

split massive codegen file, handle recursive nested types

6836c30

map input

5d91a8f

more struct support

2a28aaf

revert some benchmark changes

0c6586a

mbutrovich added 4 commits May 9, 2026 21:04

cleanup part 1

8497fe7

cleanup part 2

8d703c3

cleanup part 3

5ec0e3f

remove view support, it's dead code right now

a22051e

This was referenced May 10, 2026

CometPlainVector: validity-bitmap byte cache for sequential reads #4279

Open

CometPlainVector: cache offsetBufferAddress for variable-width vectors #4280

Open

mbutrovich and others added 7 commits May 9, 2026 22:54

use cometplainvector part 1

421c60c

use cometplainvector part 2

0705dff

make generated class final

9a00874

clean up test names

d7b43fc

fix format

034e1f5

Merge branch 'main' into codegen_scala_udf

317feaf

Merge branch 'main' into codegen_scala_udf

db1f1f2

andygrove reviewed May 12, 2026

View reviewed changes

mbutrovich added 2 commits May 12, 2026 11:06

fix 2.12 mapvalues usage

caffed9

Remove code related to apache#4239.

4be8144

mbutrovich mentioned this pull request May 12, 2026

feat: Add num_rows and TaskContext to CometUDFBridge.evaluate #4306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs#4267

feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs#4267
mbutrovich wants to merge 40 commits into
apache:mainfrom
mbutrovich:codegen_scala_udf

mbutrovich commented May 8, 2026 •

edited

Loading

Uh oh!

mbutrovich commented May 8, 2026

Uh oh!

andygrove May 12, 2026

Uh oh!

andygrove May 12, 2026

Uh oh!

mbutrovich May 12, 2026

Uh oh!

andygrove May 12, 2026

Uh oh!

andygrove May 12, 2026

Uh oh!

andygrove May 12, 2026

Uh oh!

mbutrovich May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbutrovich commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented May 8, 2026

Uh oh!

andygrove May 12, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove May 12, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich May 12, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove May 12, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove May 12, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove May 12, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented May 8, 2026 •

edited

Loading