-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Integrate CastColumnExpr into PhysicalExprAdapter with Owned Cast Options and Nullable-Aware Schema Rewriting
#20038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kosiew
wants to merge
48
commits into
apache:main
Choose a base branch
from
kosiew:castintegration-17330
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,893
−151
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Extend cast unwrapping and interval support checks to treat CastColumnExpr like other casting nodes. Update ordering equivalence substitution to recognize widening CastColumnExpr projections when mapping orderings. Add unit tests to cover CastColumnExpr behavior in simplification, interval support, and equivalence projection flows.
Implement a round-trip test for CastColumnExpr that checks the preservation of target field name and type changes while ensuring the child column identity remains intact after serialization. This improves the robustness of the serialization process for CastColumnExpr.
Extend physical expression protobuf schema to include PhysicalCastOptions, FormatOptions, and DurationFormat. Add PhysicalCastColumnNode with cast options support. Update serialization/deserialization for proper round- tripping of cast expressions with options, including legacy field fallback.
Expose configured cast options via a new accessor on CastColumnExpr for consumers like proto serialization. Update cast-column serialization to utilize the actual options for safe, format_options, and cast_options fields instead of relying on defaults.
Validate input and target types in CastColumnExpr::new, including struct compatibility checks and castability verification. Update schema rewriting and proto deserialization to accommodate the new constructor behavior. Ensure robust error handling during type casting operations.
Add a new CastColumnExpr::new_with_schema constructor that accepts and stores the input schema. Document the column-only helper for single-field validation paths. Update CastColumnExpr construction to include full input schemas during schema rewriting and proto parsing, ensuring correct type resolution.
Simplify CastColumnExpr constructor to ensure format options are always present by using the FormatOptionsSlot trait. Add new_with_schema method for cases requiring full schema. Update schema_rewriter.rs to properly wrap Schema references in Arc. Add default format options fallback in serialization for CastColumnExpr to protobuf.
Mutate serialized proto to remove format_options, ensuring deserializer correctly falls back to DEFAULT_CAST_OPTIONS. This change helps verify robustness of the cast handling logic.
Centralize cast option normalization and type/struct validation in the CastColumnExpr construction. Update new and new_with_schema to utilize a shared build path while maintaining separate schema setup. This improves code organization and consistency.
Remove the format-options slot trait and introduce a dedicated normalize_cast_options helper. Wire this helper into CastColumnExpr::build to ensure explicit normalization behavior for casting options. This improves code clarity and maintainability.
Own SchemaRef values in the updated physical expression adapter rewriter and pass cloned schema arcs during rewriting. Reuse the existing physical schema Arc when constructing CastColumnExpr to minimize deep cloning and improve performance.
Replace per-call string leaks in format_options_from_proto with an interned cache to reuse stable format strings across calls. Ensure cast_options_from_proto remains on the same path. Add a unit test to verify cache reuse for repeated and distinct format strings.
Expand the rustdoc for CastColumnExpr::new to detail its single-field schema behavior and usage constraints. Clarify when to use new_with_schema for scenarios with broader schema dependencies.
Add internal helper for interned format strings, reusing it for building ArrowFormatOptions. Update format string cache test to compare pointer equality of interned strings, eliminating global cache contention issues.
Ensure columns in CastColumnExpr match the input schema's field name and type at the referenced index. Implement a clear planning error on mismatch. Add a unit test to verify errors when a column's schema field does not align with the provided input field.
Ensure safe schema field access in CastColumnExpr::build and return a structured error when a column index is out of bounds. Add a unit test to validate that out-of-range column indexes produce an error instead of causing a panic during new_with_schema construction.
Document the rationale behind the format-string cache leak. Implement a size cap to prevent further interning when the cache limit is reached. Add tests to validate that interning stops appropriately once the cache is full.
Document the rationale behind the format-string cache leak. Implement a size cap to prevent further interning when the cache limit is reached. Add tests to validate that interning stops appropriately once the cache is full.
Replace the format string cache with a bounded map/queue that evicts older entries. Reuse leaked strings for recent overflow values, integrating this change into intern_format_str. Update the cache limit test to assert pointer reuse for overflow values that remain in the eviction window.
the input schema before resolving the expression data type, preserving existing type compatibility checks and error messages.
Require full Field equality, including nullability and metadata, in CastColumnExpr::build. Expand mismatch error messages to provide detailed attribute information. Update schema-mismatch tests and add a case for nullability and metadata differences.
Return errors when format string cache limit is reached, preventing leaks while still reusing existing entries. Update documentation for clarity. Expand tests to validate over-limit error path and ensure interned strings are reused after limit is hit.
- Simplify validation to only check data type compatibility and column bounds - Remove strict name/nullability/metadata matching to allow schema evolution - Fix SchemaRewriter to use actual physical field at column's index - Update tests to reflect new, more lenient validation behavior
…ache limit checks
…arly return for matching types refactor(cast_column): introduce validate_cast_compatibility function for improved type checking fix(proto): deprecate legacy fields in PhysicalCastColumnNode for backward compatibility
…in schema adaptation
…ng the same field compatibility logic
…time compatibility with protobuf deserialization
…-tripping to collapse duplicate setup across the cast column tests, improving readability and maintainability
…wing * Derive Default for OwnedCastOptions and remove manual impl * Replace 'static CastOptions with borrowed lifetimes across scalar, columnar, and physical exprs * Normalize cast option handling with OwnedCastOptions defaults * Update proto serialization/deserialization to use owned format options * Adjust tests and roundtrip logic to reflect owned cast/format options
…s in OwnedCastOptions and OwnedFormatOptions
91362a8 to
2d114c6
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
common
Related to common crate
core
Core DataFusion crate
datasource
Changes to the datasource crate
logical-expr
Logical plan and expressions
physical-expr
Changes to the physical-expr crates
proto
Related to proto crate
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
The existing
PhysicalExprAdapterand casting infrastructure relied primarily onCastExprwith ArrowCastOptions<'static>, which imposed several limitations:'staticstring lifetimes for format options, making it unsafe or impractical to construct cast options dynamically (e.g. from SQL, protobuf, or IPC).CastExprnodes even when column-aware semantics were required, complicating optimization, equivalence reasoning, interval analysis, and pruning.This PR addresses these issues by fully integrating
CastColumnExprinto the physical planning pipeline, introducing owned cast/format options, and tightening schema- and nullability-aware validation across DataFusion.What changes are included in this PR?
1. Owned cast and format options
OwnedFormatOptionsandOwnedCastOptionsindatafusion-common.FormatOptions<'static>and prevents memory leaks or string interning.CastOptions<'_>for execution.2.
CastColumnExprintegrationPhysicalExprAdapterto emitCastColumnExprinstead ofCastExprfor column casts.validate_field_compatibilityandvalidate_struct_compatibility.3. Schema rewriter cleanup
There are 712 proto generated lines:
4. Optimizer and execution support
Extends:
to recognize and reason about
CastColumnExpr.5. Serialization / deserialization
CastColumnExpr,PhysicalCastOptions, andFormatOptions.safe,format_options).CastColumnExpr.6. Nullability correctness fixes
Are these changes tested?
Yes. This PR adds and updates extensive test coverage, including:
CastColumnExprconstruction, evaluation, and validation.Existing tests were also updated where schema nullability assumptions changed.
Are there any user-facing changes?
There are no intentional breaking API changes, though downstream users relying on internal physical expressions may need to account for
CastColumnExpr.LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed, validated, and tested before submission.