docs: improve review skill and contributor guide for serde patterns#4132
Merged
andygrove merged 4 commits intoapache:mainfrom Apr 29, 2026
Merged
docs: improve review skill and contributor guide for serde patterns#4132andygrove merged 4 commits intoapache:mainfrom
andygrove merged 4 commits intoapache:mainfrom
Conversation
Document the failure mode where a Comet scalar function name collides with a DataFusion built-in that has different arity or input types. Without an explicit return type on the proto, the native planner consults DataFusion's UDF registry first and signature mismatches fail before Comet's UDF is swapped in. The classic case is levenshtein, where Spark's optional 3-arg threshold form fails because DF's built-in is 2-arg only. Add the same pointer to the review skill so reviewers catch this on incoming PRs.
The old guidance pointed reviewers at compatibility.md and configs.md as files contributors should hand-edit, but both are now generated by GenerateDocs from getIncompatibleReasons / getUnsupportedReasons on the serde and republished by CI on every merge to main. Reframe the section so reviewers check the source (serde reason methods) for generated docs and only ask for hand edits to the still manual pages like expressions.md and the non-expression compatibility pages.
coderfender
approved these changes
Apr 28, 2026
Contributor
|
LGTM @andygrove . We might need to run prettier to fix markdown formatting ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A. Documentation-only follow-up to issues observed while reviewing scalar function PRs (e.g. #4105 for
levenshtein).Rationale for this change
Two related cleanups to the contributor guide and the local
review-comet-prskill:Document when a scalar function serde must set its return type explicitly. When a Comet scalar function shares its name with a
datafusion-functionsbuilt-in (e.g.levenshtein,concat,coalesce,sha2,regexp_replace), the serde must usescalarFunctionExprToProtoWithReturnType, notscalarFunctionExprToProtoor the bareCometScalarFunction(name)shortcut. Without an explicit return type the native planner consults DataFusion's UDF registry first to resolve the return type, and any arity or input-type difference between the Spark and DataFusion versions fails native execution at runtime with errors like:This is subtle because the matching-arity path happens to work even without an explicit return type. It only blows up when Spark's signature is wider than DF's, which is exactly when a contributor is most likely to hit it without warning. The contributor guide already mentioned
scalarFunctionExprToProtoWithReturnTypein passing but did not explain when it is required versus optional.Refresh the review skill's documentation-check guidance. The existing instructions told reviewers to flag missing hand edits to
compatibility.mdandconfigs.md. Both of those paths are stale. The compatibility guide pages (docs/source/user-guide/latest/compatibility/expressions/*.md) and the configuration reference (docs/source/user-guide/latest/configs.md) are now generated byGenerateDocsfromgetIncompatibleReasons/getUnsupportedReasonson the serde and republished by CI on every merge tomain. Asking contributors to edit them by hand is wrong. The skill now distinguishes generated docs (where reviewers should check the serde reason methods instead) from the still hand-edited pages likeexpressions.md.What changes are included in this PR?
docs/source/contributor-guide/adding_a_new_expression.md: new "When to set the return type explicitly" subsection that explains the failure mode, lists known colliding names, and shows aCometLevenshtein-style example usingscalarFunctionExprToProtoWithReturnType..claude/skills/review-comet-pr/SKILL.md:expressions.md, etc.), with concrete things to look for in reason strings.How are these changes tested?
Documentation-only change. No tests required. The Sphinx build will validate the Markdown on the next docs publish.