diff --git a/.claude/skills/review-comet-pr/SKILL.md b/.claude/skills/review-comet-pr/SKILL.md index 8c77605076..1cda5f6b58 100644 --- a/.claude/skills/review-comet-pr/SKILL.md +++ b/.claude/skills/review-comet-pr/SKILL.md @@ -243,13 +243,29 @@ gh pr checks $ARGUMENTS --repo apache/datafusion-comet --failed ### 7. Documentation Check -Check whether the PR requires updates to user-facing documentation in `docs/`: +Some user-facing docs are auto-generated from the serde. Others are hand-edited. Treat them differently. -- **Compatibility guide** (`docs/source/user-guide/compatibility.md`): New expressions or operators should be listed. Incompatible behaviors should be documented. -- **Configuration guide** (`docs/source/user-guide/configs.md`): New config options should be documented. -- **Expressions list** (`docs/source/user-guide/expressions.md`): New expressions should be added. +**Generated by `GenerateDocs`** — do NOT ask the contributor to edit these by hand. CI regenerates and publishes them on every merge to `main`: -If the PR adds a new expression or operator but does not update the relevant docs, flag this as something that needs to be addressed. +- Compatibility guide pages under `docs/source/user-guide/latest/compatibility/expressions/` (`math.md`, `datetime.md`, `array.md`, `string.md`, `aggregate.md`, `struct.md`, `map.md`, `misc.md`, `cast.md`) +- Configuration reference at `docs/source/user-guide/latest/configs.md` + +For these, check the _source_ instead. Does the new or modified `CometExpressionSerde` provide accurate `getIncompatibleReasons()` and `getUnsupportedReasons()` strings? Each returned string is rendered as a bullet on the corresponding compat page. Common gaps to flag: + +- Expression marked `Incompatible(Some("..."))` in `getSupportLevel` but `getIncompatibleReasons()` is empty, so the compat page shows it as supported with no caveats. +- `Unsupported(Some("..."))` for specific data types or argument shapes but no `getUnsupportedReasons()` to surface the limitation to users. +- Reason strings drifting from the `notes` argument passed to `Compatible` / `Incompatible` / `Unsupported`. They do not have to match exactly, but consistency helps users. +- Reason strings that are too terse to be useful in user-facing docs (a single word, no context, no link to a tracking issue when behavior is known to differ). + +See `docs/source/contributor-guide/adding_a_new_expression.md` (sections "Documenting Incompatible and Unsupported Reasons") for the contract these methods follow. + +**Hand-edited** — PR should update if relevant: + +- `docs/source/user-guide/latest/expressions.md` — the supported-expressions list. New expressions belong here. +- Other `latest/compatibility/` pages such as `floating-point.md`, `operators.md`, `regex.md`, `scans.md`. +- Top-level user-guide pages such as `iceberg.md`, `installation.md`, `tuning.md` when the PR changes user-visible behavior. + +If the PR adds a new expression but does not update `expressions.md`, flag that. If it touches incompatibility behavior, flag that the serde reasons should reflect the change. ### 8. Common Comet Review Issues @@ -259,6 +275,7 @@ If the PR adds a new expression or operator but does not update the relevant doc 4. **Tests in wrong framework**: Expression tests should use the SQL file-based framework (`CometSqlFileTestSuite`) rather than adding to Scala test suites like `CometExpressionSuite`. Suggest migration if the PR adds Scala tests for expressions that could use SQL files instead. 5. **Stale native code**: PR might need `./mvnw install -pl common -DskipTests` 6. **Missing `getSupportLevel`**: Edge cases should be marked as `Incompatible` +7. **Scalar function name collides with a DataFusion built-in**: If the PR registers a Spark function whose name is also defined by `datafusion-functions` (e.g. `levenshtein`, `concat`, `coalesce`, `sha2`, `regexp_replace`), check that the serde sets the return type explicitly via `scalarFunctionExprToProtoWithReturnType` rather than `scalarFunctionExprToProto` or the bare `CometScalarFunction(name)` shortcut. Without an explicit return type, the native planner consults DataFusion's UDF registry first for type resolution, and any arity or input-type difference between the Spark and DataFusion versions will fail native execution with `Error from DataFusion: Function 'X' expects N arguments but received M`. The Comet UDF is only swapped in _after_ DF's signature validation passes. See the "When to set the return type explicitly" section in `docs/source/contributor-guide/adding_a_new_expression.md`. --- diff --git a/docs/source/contributor-guide/adding_a_new_expression.md b/docs/source/contributor-guide/adding_a_new_expression.md index 57131d0091..4b0a46a011 100644 --- a/docs/source/contributor-guide/adding_a_new_expression.md +++ b/docs/source/contributor-guide/adding_a_new_expression.md @@ -84,6 +84,40 @@ For simple scalar functions that map directly to a DataFusion function, you can classOf[Cos] -> CometScalarFunction("cos") ``` +#### When to set the return type explicitly + +`CometScalarFunction(name)` and the lower-level `scalarFunctionExprToProto(name, args)` helper both produce a protobuf `ScalarFunc` message **without** a `return_type` field. That is fine when the function name does not collide with a DataFusion built-in, or when it does collide and the Spark and DataFusion versions take the same arity and types. In that case the native planner consults DataFusion's UDF registry only to resolve the return type, then swaps in Comet's UDF for execution. + +It is **not** fine when the Spark function and the DataFusion built-in differ in arity or input types. The native planner calls `coerce_types` and `return_field_from_args` on DataFusion's UDF before Comet's UDF is selected, and a signature mismatch fails the query at execution time with an error like: + +``` +org.apache.comet.CometNativeException: Error from DataFusion: +Function 'levenshtein' expects 2 arguments but received 3. +``` + +The classic case is `levenshtein`. Spark accepts an optional 3rd `threshold` argument, DataFusion's built-in is 2-arg only, so the 3-arg form fails native execution unless the serde sets the return type explicitly. Other names that exist in both engines with potentially different signatures include `concat`, `coalesce`, `sha2`, and `regexp_replace`. If you are adding a function whose name is shared with `datafusion-functions`, check the upstream signature before deciding how to serialize. + +To avoid the registry lookup, write a custom `CometExpressionSerde` and use `scalarFunctionExprToProtoWithReturnType`, passing the Spark expression's declared `dataType`: + +```scala +object CometLevenshtein extends CometExpressionSerde[Levenshtein] { + override def convert( + expr: Levenshtein, + inputs: Seq[Attribute], + binding: Boolean): Option[ExprOuterClass.Expr] = { + val childExprs = expr.children.map(exprToProtoInternal(_, inputs, binding)) + val optExpr = scalarFunctionExprToProtoWithReturnType( + "levenshtein", + expr.dataType, + false, + childExprs: _*) + optExprWithInfo(optExpr, expr, expr.children: _*) + } +} +``` + +When the return type is set on the proto, the native planner skips the registry lookup entirely and routes straight to the Comet UDF registered in `create_comet_physical_fun_with_eval_mode`. + #### Registering the Expression Handler Once you've created your `CometExpressionSerde` implementation, register it in `QueryPlanSerde.scala` by adding it to the appropriate expression map (e.g., `mathExpressions`, `stringExpressions`, `predicateExpressions`, etc.):