From 16534567c8f198180175ab630510e7719340fa55 Mon Sep 17 00:00:00 2001
From: Andy Grove <agrove@apache.org>
Date: Tue, 28 Apr 2026 10:39:59 -0600
Subject: [PATCH 1/4] docs: clarify when scalar function serde must set return
 type

Document the failure mode where a Comet scalar function name collides with a
DataFusion built-in that has different arity or input types. Without an
explicit return type on the proto, the native planner consults DataFusion's
UDF registry first and signature mismatches fail before Comet's UDF is
swapped in. The classic case is levenshtein, where Spark's optional 3-arg
threshold form fails because DF's built-in is 2-arg only.

Add the same pointer to the review skill so reviewers catch this on
incoming PRs.
---
 .claude/skills/review-comet-pr/SKILL.md       |  1 +
 .../adding_a_new_expression.md                | 34 +++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/.claude/skills/review-comet-pr/SKILL.md b/.claude/skills/review-comet-pr/SKILL.md
index 8c77605076..39429b7f9a 100644
--- a/.claude/skills/review-comet-pr/SKILL.md
+++ b/.claude/skills/review-comet-pr/SKILL.md
@@ -259,6 +259,7 @@ If the PR adds a new expression or operator but does not update the relevant doc
 4. **Tests in wrong framework**: Expression tests should use the SQL file-based framework (`CometSqlFileTestSuite`) rather than adding to Scala test suites like `CometExpressionSuite`. Suggest migration if the PR adds Scala tests for expressions that could use SQL files instead.
 5. **Stale native code**: PR might need `./mvnw install -pl common -DskipTests`
 6. **Missing `getSupportLevel`**: Edge cases should be marked as `Incompatible`
+7. **Scalar function name collides with a DataFusion built-in**: If the PR registers a Spark function whose name is also defined by `datafusion-functions` (e.g. `levenshtein`, `concat`, `coalesce`, `sha2`, `regexp_replace`), check that the serde sets the return type explicitly via `scalarFunctionExprToProtoWithReturnType` rather than `scalarFunctionExprToProto` or the bare `CometScalarFunction(name)` shortcut. Without an explicit return type, the native planner consults DataFusion's UDF registry first for type resolution, and any arity or input-type difference between the Spark and DataFusion versions will fail native execution with `Error from DataFusion: Function 'X' expects N arguments but received M`. The Comet UDF is only swapped in *after* DF's signature validation passes. See the "When to set the return type explicitly" section in `docs/source/contributor-guide/adding_a_new_expression.md`.
 
 ---
 
diff --git a/docs/source/contributor-guide/adding_a_new_expression.md b/docs/source/contributor-guide/adding_a_new_expression.md
index 57131d0091..4b0a46a011 100644
--- a/docs/source/contributor-guide/adding_a_new_expression.md
+++ b/docs/source/contributor-guide/adding_a_new_expression.md
@@ -84,6 +84,40 @@ For simple scalar functions that map directly to a DataFusion function, you can
 classOf[Cos] -> CometScalarFunction("cos")
 ```
 
+#### When to set the return type explicitly
+
+`CometScalarFunction(name)` and the lower-level `scalarFunctionExprToProto(name, args)` helper both produce a protobuf `ScalarFunc` message **without** a `return_type` field. That is fine when the function name does not collide with a DataFusion built-in, or when it does collide and the Spark and DataFusion versions take the same arity and types. In that case the native planner consults DataFusion's UDF registry only to resolve the return type, then swaps in Comet's UDF for execution.
+
+It is **not** fine when the Spark function and the DataFusion built-in differ in arity or input types. The native planner calls `coerce_types` and `return_field_from_args` on DataFusion's UDF before Comet's UDF is selected, and a signature mismatch fails the query at execution time with an error like:
+
+```
+org.apache.comet.CometNativeException: Error from DataFusion:
+Function 'levenshtein' expects 2 arguments but received 3.
+```
+
+The classic case is `levenshtein`. Spark accepts an optional 3rd `threshold` argument, DataFusion's built-in is 2-arg only, so the 3-arg form fails native execution unless the serde sets the return type explicitly. Other names that exist in both engines with potentially different signatures include `concat`, `coalesce`, `sha2`, and `regexp_replace`. If you are adding a function whose name is shared with `datafusion-functions`, check the upstream signature before deciding how to serialize.
+
+To avoid the registry lookup, write a custom `CometExpressionSerde` and use `scalarFunctionExprToProtoWithReturnType`, passing the Spark expression's declared `dataType`:
+
+```scala
+object CometLevenshtein extends CometExpressionSerde[Levenshtein] {
+  override def convert(
+      expr: Levenshtein,
+      inputs: Seq[Attribute],
+      binding: Boolean): Option[ExprOuterClass.Expr] = {
+    val childExprs = expr.children.map(exprToProtoInternal(_, inputs, binding))
+    val optExpr = scalarFunctionExprToProtoWithReturnType(
+      "levenshtein",
+      expr.dataType,
+      false,
+      childExprs: _*)
+    optExprWithInfo(optExpr, expr, expr.children: _*)
+  }
+}
+```
+
+When the return type is set on the proto, the native planner skips the registry lookup entirely and routes straight to the Comet UDF registered in `create_comet_physical_fun_with_eval_mode`.
+
 #### Registering the Expression Handler
 
 Once you've created your `CometExpressionSerde` implementation, register it in `QueryPlanSerde.scala` by adding it to the appropriate expression map (e.g., `mathExpressions`, `stringExpressions`, `predicateExpressions`, etc.):

From 4521a6e2c308727f37d4483ef98b01b46ae4d0af Mon Sep 17 00:00:00 2001
From: Andy Grove <agrove@apache.org>
Date: Tue, 28 Apr 2026 10:43:46 -0600
Subject: [PATCH 2/4] docs: rewrite review skill's documentation check section

The old guidance pointed reviewers at compatibility.md and configs.md as
files contributors should hand-edit, but both are now generated by
GenerateDocs from getIncompatibleReasons / getUnsupportedReasons on the
serde and republished by CI on every merge to main.

Reframe the section so reviewers check the source (serde reason
methods) for generated docs and only ask for hand edits to the still
manual pages like expressions.md and the non-expression compatibility
pages.
---
 .claude/skills/review-comet-pr/SKILL.md | 26 ++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/.claude/skills/review-comet-pr/SKILL.md b/.claude/skills/review-comet-pr/SKILL.md
index 39429b7f9a..b0bf5a5d98 100644
--- a/.claude/skills/review-comet-pr/SKILL.md
+++ b/.claude/skills/review-comet-pr/SKILL.md
@@ -243,13 +243,29 @@ gh pr checks $ARGUMENTS --repo apache/datafusion-comet --failed
 
 ### 7. Documentation Check
 
-Check whether the PR requires updates to user-facing documentation in `docs/`:
+Some user-facing docs are auto-generated from the serde. Others are hand-edited. Treat them differently.
 
-- **Compatibility guide** (`docs/source/user-guide/compatibility.md`): New expressions or operators should be listed. Incompatible behaviors should be documented.
-- **Configuration guide** (`docs/source/user-guide/configs.md`): New config options should be documented.
-- **Expressions list** (`docs/source/user-guide/expressions.md`): New expressions should be added.
+**Generated by `GenerateDocs`** — do NOT ask the contributor to edit these by hand. CI regenerates and publishes them on every merge to `main`:
 
-If the PR adds a new expression or operator but does not update the relevant docs, flag this as something that needs to be addressed.
+- Compatibility guide pages under `docs/source/user-guide/latest/compatibility/expressions/` (`math.md`, `datetime.md`, `array.md`, `string.md`, `aggregate.md`, `struct.md`, `map.md`, `misc.md`, `cast.md`)
+- Configuration reference at `docs/source/user-guide/latest/configs.md`
+
+For these, check the *source* instead. Does the new or modified `CometExpressionSerde` provide accurate `getIncompatibleReasons()` and `getUnsupportedReasons()` strings? Each returned string is rendered as a bullet on the corresponding compat page. Common gaps to flag:
+
+- Expression marked `Incompatible(Some("..."))` in `getSupportLevel` but `getIncompatibleReasons()` is empty, so the compat page shows it as supported with no caveats.
+- `Unsupported(Some("..."))` for specific data types or argument shapes but no `getUnsupportedReasons()` to surface the limitation to users.
+- Reason strings drifting from the `notes` argument passed to `Compatible` / `Incompatible` / `Unsupported`. They do not have to match exactly, but consistency helps users.
+- Reason strings that are too terse to be useful in user-facing docs (a single word, no context, no link to a tracking issue when behavior is known to differ).
+
+See `docs/source/contributor-guide/adding_a_new_expression.md` (sections "Documenting Incompatible and Unsupported Reasons") for the contract these methods follow.
+
+**Hand-edited** — PR should update if relevant:
+
+- `docs/source/user-guide/latest/expressions.md` — the supported-expressions list. New expressions belong here.
+- Other `latest/compatibility/` pages such as `floating-point.md`, `operators.md`, `regex.md`, `scans.md`.
+- Top-level user-guide pages such as `iceberg.md`, `installation.md`, `tuning.md` when the PR changes user-visible behavior.
+
+If the PR adds a new expression but does not update `expressions.md`, flag that. If it touches incompatibility behavior, flag that the serde reasons should reflect the change.
 
 ### 8. Common Comet Review Issues
 

From 783df6ac70885aabf96b6aad1d1b40faf240d813 Mon Sep 17 00:00:00 2001
From: Andy Grove <agrove@apache.org>
Date: Tue, 28 Apr 2026 16:36:05 -0600
Subject: [PATCH 3/4] prettier

---
 .claude/skills/review-comet-pr/SKILL.md | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/.claude/skills/review-comet-pr/SKILL.md b/.claude/skills/review-comet-pr/SKILL.md
index b0bf5a5d98..f3c0556796 100644
--- a/.claude/skills/review-comet-pr/SKILL.md
+++ b/.claude/skills/review-comet-pr/SKILL.md
@@ -68,6 +68,7 @@ For expression PRs, check how similar expressions are implemented in the codebas
    ```
 
 3. **Read the Spark implementation carefully.** Pay attention to:
+
    - The `eval` and `doGenEval`/`nullSafeEval` methods. These define the exact behavior.
    - The `inputTypes` and `dataType` fields. These define which types Spark accepts and what it returns.
    - Null handling. Does it use `nullable = true`? Does `nullSafeEval` handle nulls implicitly?
@@ -82,6 +83,7 @@ For expression PRs, check how similar expressions are implemented in the codebas
    ```
 
 5. **Compare the Spark behavior against the Comet implementation in the PR.** Identify:
+
    - Edge cases tested in Spark but not in the PR
    - Data types supported in Spark but not handled in the PR
    - Behavioral differences that should be marked `Incompatible`
@@ -95,12 +97,14 @@ For expression PRs, check how similar expressions are implemented in the codebas
 For expression PRs, verify against the Spark source you read in step 2:
 
 1. **Check edge cases**
+
    - Null handling
    - Overflow behavior
    - Empty input behavior
    - Type-specific behavior
 
 2. **Verify all data types are handled**
+
    - Does Spark support this type? (Check `inputTypes` in Spark source)
    - Does the PR handle all Spark-supported types?
 
@@ -216,11 +220,13 @@ SELECT known_buggy_expr(v) FROM test_table
 2. **Look for a microbenchmark implementation.** Expression benchmarks live in `spark/src/test/scala/org/apache/spark/sql/benchmark/`. Check whether the PR adds a benchmark for the new expression.
 
 3. **Review the benchmark results if provided:**
+
    - Is Comet actually faster than Spark for this expression?
    - Are the benchmarks representative? They should test with realistic data sizes, not just trivial inputs.
    - Are different data types benchmarked if the expression supports multiple types?
 
 4. **Review the Rust implementation for performance concerns:**
+
    - Unnecessary allocations or copies
    - Row-by-row processing where batch/array operations are possible
    - Redundant type conversions
@@ -250,7 +256,7 @@ Some user-facing docs are auto-generated from the serde. Others are hand-edited.
 - Compatibility guide pages under `docs/source/user-guide/latest/compatibility/expressions/` (`math.md`, `datetime.md`, `array.md`, `string.md`, `aggregate.md`, `struct.md`, `map.md`, `misc.md`, `cast.md`)
 - Configuration reference at `docs/source/user-guide/latest/configs.md`
 
-For these, check the *source* instead. Does the new or modified `CometExpressionSerde` provide accurate `getIncompatibleReasons()` and `getUnsupportedReasons()` strings? Each returned string is rendered as a bullet on the corresponding compat page. Common gaps to flag:
+For these, check the _source_ instead. Does the new or modified `CometExpressionSerde` provide accurate `getIncompatibleReasons()` and `getUnsupportedReasons()` strings? Each returned string is rendered as a bullet on the corresponding compat page. Common gaps to flag:
 
 - Expression marked `Incompatible(Some("..."))` in `getSupportLevel` but `getIncompatibleReasons()` is empty, so the compat page shows it as supported with no caveats.
 - `Unsupported(Some("..."))` for specific data types or argument shapes but no `getUnsupportedReasons()` to surface the limitation to users.
@@ -275,7 +281,7 @@ If the PR adds a new expression but does not update `expressions.md`, flag that.
 4. **Tests in wrong framework**: Expression tests should use the SQL file-based framework (`CometSqlFileTestSuite`) rather than adding to Scala test suites like `CometExpressionSuite`. Suggest migration if the PR adds Scala tests for expressions that could use SQL files instead.
 5. **Stale native code**: PR might need `./mvnw install -pl common -DskipTests`
 6. **Missing `getSupportLevel`**: Edge cases should be marked as `Incompatible`
-7. **Scalar function name collides with a DataFusion built-in**: If the PR registers a Spark function whose name is also defined by `datafusion-functions` (e.g. `levenshtein`, `concat`, `coalesce`, `sha2`, `regexp_replace`), check that the serde sets the return type explicitly via `scalarFunctionExprToProtoWithReturnType` rather than `scalarFunctionExprToProto` or the bare `CometScalarFunction(name)` shortcut. Without an explicit return type, the native planner consults DataFusion's UDF registry first for type resolution, and any arity or input-type difference between the Spark and DataFusion versions will fail native execution with `Error from DataFusion: Function 'X' expects N arguments but received M`. The Comet UDF is only swapped in *after* DF's signature validation passes. See the "When to set the return type explicitly" section in `docs/source/contributor-guide/adding_a_new_expression.md`.
+7. **Scalar function name collides with a DataFusion built-in**: If the PR registers a Spark function whose name is also defined by `datafusion-functions` (e.g. `levenshtein`, `concat`, `coalesce`, `sha2`, `regexp_replace`), check that the serde sets the return type explicitly via `scalarFunctionExprToProtoWithReturnType` rather than `scalarFunctionExprToProto` or the bare `CometScalarFunction(name)` shortcut. Without an explicit return type, the native planner consults DataFusion's UDF registry first for type resolution, and any arity or input-type difference between the Spark and DataFusion versions will fail native execution with `Error from DataFusion: Function 'X' expects N arguments but received M`. The Comet UDF is only swapped in _after_ DF's signature validation passes. See the "When to set the return type explicitly" section in `docs/source/contributor-guide/adding_a_new_expression.md`.
 
 ---
 

From 2be1941208e0c324272d1a8cb587743f1321aa5c Mon Sep 17 00:00:00 2001
From: Andy Grove <agrove@apache.org>
Date: Tue, 28 Apr 2026 20:58:58 -0600
Subject: [PATCH 4/4] style: apply prettier formatting

---
 .claude/skills/review-comet-pr/SKILL.md | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/.claude/skills/review-comet-pr/SKILL.md b/.claude/skills/review-comet-pr/SKILL.md
index f3c0556796..1cda5f6b58 100644
--- a/.claude/skills/review-comet-pr/SKILL.md
+++ b/.claude/skills/review-comet-pr/SKILL.md
@@ -68,7 +68,6 @@ For expression PRs, check how similar expressions are implemented in the codebas
    ```
 
 3. **Read the Spark implementation carefully.** Pay attention to:
-
    - The `eval` and `doGenEval`/`nullSafeEval` methods. These define the exact behavior.
    - The `inputTypes` and `dataType` fields. These define which types Spark accepts and what it returns.
    - Null handling. Does it use `nullable = true`? Does `nullSafeEval` handle nulls implicitly?
@@ -83,7 +82,6 @@ For expression PRs, check how similar expressions are implemented in the codebas
    ```
 
 5. **Compare the Spark behavior against the Comet implementation in the PR.** Identify:
-
    - Edge cases tested in Spark but not in the PR
    - Data types supported in Spark but not handled in the PR
    - Behavioral differences that should be marked `Incompatible`
@@ -97,14 +95,12 @@ For expression PRs, check how similar expressions are implemented in the codebas
 For expression PRs, verify against the Spark source you read in step 2:
 
 1. **Check edge cases**
-
    - Null handling
    - Overflow behavior
    - Empty input behavior
    - Type-specific behavior
 
 2. **Verify all data types are handled**
-
    - Does Spark support this type? (Check `inputTypes` in Spark source)
    - Does the PR handle all Spark-supported types?
 
@@ -220,13 +216,11 @@ SELECT known_buggy_expr(v) FROM test_table
 2. **Look for a microbenchmark implementation.** Expression benchmarks live in `spark/src/test/scala/org/apache/spark/sql/benchmark/`. Check whether the PR adds a benchmark for the new expression.
 
 3. **Review the benchmark results if provided:**
-
    - Is Comet actually faster than Spark for this expression?
    - Are the benchmarks representative? They should test with realistic data sizes, not just trivial inputs.
    - Are different data types benchmarked if the expression supports multiple types?
 
 4. **Review the Rust implementation for performance concerns:**
-
    - Unnecessary allocations or copies
    - Row-by-row processing where batch/array operations are possible
    - Redundant type conversions