Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Jan 28, 2026

Which issue does this PR close?


Rationale for this change

The existing PhysicalExprAdapter and casting infrastructure relied primarily on CastExpr with Arrow CastOptions<'static>, which imposed several limitations:

  • It required 'static string lifetimes for format options, making it unsafe or impractical to construct cast options dynamically (e.g. from SQL, protobuf, or IPC).
  • Struct-aware casting and nullability validation were fragmented across multiple call sites, leading to subtle correctness issues (especially around nullable → non-nullable casts).
  • The adapter produced generic CastExpr nodes even when column-aware semantics were required, complicating optimization, equivalence reasoning, interval analysis, and pruning.

This PR addresses these issues by fully integrating CastColumnExpr into the physical planning pipeline, introducing owned cast/format options, and tightening schema- and nullability-aware validation across DataFusion.


What changes are included in this PR?

1. Owned cast and format options

  • Introduces OwnedFormatOptions and OwnedCastOptions in datafusion-common.
  • Eliminates the need for FormatOptions<'static> and prevents memory leaks or string interning.
  • Provides safe, ephemeral conversion to Arrow CastOptions<'_> for execution.

2. CastColumnExpr integration

  • Refactors the PhysicalExprAdapter to emit CastColumnExpr instead of CastExpr for column casts.
  • Adds robust validation via validate_field_compatibility and validate_struct_compatibility.
  • Ensures nullable → non-nullable casts are rejected early and consistently.

3. Schema rewriter cleanup

  • Simplifies and clarifies schema-rewrite logic with helper routines.
  • Correctly handles column index mismatches, reordered schemas, and nested structs.
  • Improves error messages and correctness for mismatched physical vs logical schemas.

There are 712 proto generated lines:

 datafusion/proto/src/generated/pbjson.rs           | 631 +++++++++++++++++++++
 datafusion/proto/src/generated/prost.rs            |  81 ++-

4. Optimizer and execution support

  • Extends:

    • Equivalence properties
    • Ordering propagation
    • Interval reasoning
    • Cast-unwrapping simplifications
    • Statistics-based pruning
      to recognize and reason about CastColumnExpr.

5. Serialization / deserialization

  • Adds protobuf support for CastColumnExpr, PhysicalCastOptions, and FormatOptions.
  • Maintains backward compatibility with pre-43.0 fields (safe, format_options).
  • Enables distributed and IPC round-tripping of plans containing CastColumnExpr.

6. Nullability correctness fixes

  • Updates tests and examples to use nullable logical schemas where appropriate.
  • Fixes incorrect assumptions that missing columns are non-nullable.

Are these changes tested?

Yes. This PR adds and updates extensive test coverage, including:

  • Unit tests for CastColumnExpr construction, evaluation, and validation.
  • Tests for nullable vs non-nullable casting behavior.
  • Schema rewrite and adapter behavior tests.
  • Optimizer tests covering ordering, equivalence classes, interval reasoning, and cast unwrapping.
  • Protobuf (de)serialization round-trip tests.

Existing tests were also updated where schema nullability assumptions changed.


Are there any user-facing changes?

  • Behavioral fixes: Queries involving missing columns, reordered schemas, or nested structs now behave correctly with respect to nullability.
  • Improved correctness: Invalid nullable → non-nullable casts now fail deterministically at planning time.
  • No SQL syntax changes are introduced.

There are no intentional breaking API changes, though downstream users relying on internal physical expressions may need to account for CastColumnExpr.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed, validated, and tested before submission.

kosiew added 30 commits January 28, 2026 11:53
Extend cast unwrapping and interval support checks to treat
CastColumnExpr like other casting nodes. Update ordering
equivalence substitution to recognize widening CastColumnExpr
projections when mapping orderings. Add unit tests to cover
CastColumnExpr behavior in simplification, interval support,
and equivalence projection flows.
Implement a round-trip test for CastColumnExpr that checks
the preservation of target field name and type changes while
ensuring the child column identity remains intact after
serialization. This improves the robustness of the
serialization process for CastColumnExpr.
Extend physical expression protobuf schema to include PhysicalCastOptions,
FormatOptions, and DurationFormat. Add PhysicalCastColumnNode with cast
options support. Update serialization/deserialization for proper round-
tripping of cast expressions with options, including legacy field fallback.
Expose configured cast options via a new accessor on
CastColumnExpr for consumers like proto serialization.
Update cast-column serialization to utilize the actual
options for safe, format_options, and cast_options
fields instead of relying on defaults.
Validate input and target types in CastColumnExpr::new,
including struct compatibility checks and castability
verification. Update schema rewriting and proto
deserialization to accommodate the new constructor
behavior. Ensure robust error handling during type
casting operations.
Add a new CastColumnExpr::new_with_schema constructor
that accepts and stores the input schema. Document the
column-only helper for single-field validation paths.

Update CastColumnExpr construction to include full input
schemas during schema rewriting and proto parsing, ensuring
correct type resolution.
Simplify CastColumnExpr constructor to ensure format options are always
present by using the FormatOptionsSlot trait. Add new_with_schema method
for cases requiring full schema. Update schema_rewriter.rs to properly
wrap Schema references in Arc. Add default format options fallback in
serialization for CastColumnExpr to protobuf.
Mutate serialized proto to remove format_options,
ensuring deserializer correctly falls back to DEFAULT_CAST_OPTIONS.
This change helps verify robustness of the cast handling logic.
Centralize cast option normalization and type/struct validation in
the CastColumnExpr construction. Update new and new_with_schema to
utilize a shared build path while maintaining separate schema
setup. This improves code organization and consistency.
Remove the format-options slot trait and introduce a
dedicated normalize_cast_options helper. Wire this
helper into CastColumnExpr::build to ensure explicit
normalization behavior for casting options. This
improves code clarity and maintainability.
Own SchemaRef values in the updated physical expression
adapter rewriter and pass cloned schema arcs during
rewriting. Reuse the existing physical schema Arc when
constructing CastColumnExpr to minimize deep cloning
and improve performance.
Replace per-call string leaks in format_options_from_proto
with an interned cache to reuse stable format strings across
calls. Ensure cast_options_from_proto remains on the same
path. Add a unit test to verify cache reuse for repeated
and distinct format strings.
Expand the rustdoc for CastColumnExpr::new to detail its
single-field schema behavior and usage constraints. Clarify
when to use new_with_schema for scenarios with broader
schema dependencies.
Add internal helper for interned format strings, reusing it
for building ArrowFormatOptions. Update format string cache
test to compare pointer equality of interned strings,
eliminating global cache contention issues.
Ensure columns in CastColumnExpr match the input schema's
field name and type at the referenced index. Implement a
clear planning error on mismatch. Add a unit test to verify
errors when a column's schema field does not align with
the provided input field.
Ensure safe schema field access in CastColumnExpr::build
and return a structured error when a column index is
out of bounds. Add a unit test to validate that out-of-range
column indexes produce an error instead of causing a panic
during new_with_schema construction.
Document the rationale behind the format-string cache leak.
Implement a size cap to prevent further interning when the
cache limit is reached. Add tests to validate that interning
stops appropriately once the cache is full.
Document the rationale behind the format-string cache leak.
Implement a size cap to prevent further interning when the
cache limit is reached. Add tests to validate that interning
stops appropriately once the cache is full.
Replace the format string cache with a bounded map/queue that
evicts older entries. Reuse leaked strings for recent overflow
values, integrating this change into intern_format_str.

Update the cache limit test to assert pointer reuse for
overflow values that remain in the eviction window.
the input schema before resolving the expression data type, preserving
existing type compatibility checks and error messages.
Require full Field equality, including nullability and
metadata, in CastColumnExpr::build. Expand mismatch error
messages to provide detailed attribute information. Update
schema-mismatch tests and add a case for nullability and
metadata differences.
Return errors when format string cache limit is reached,
preventing leaks while still reusing existing entries.
Update documentation for clarity. Expand tests to validate
over-limit error path and ensure interned strings are reused
after limit is hit.
- Simplify validation to only check data type compatibility and column bounds
- Remove strict name/nullability/metadata matching to allow schema evolution
- Fix SchemaRewriter to use actual physical field at column's index
- Update tests to reflect new, more lenient validation behavior
…arly return for matching types

refactor(cast_column): introduce validate_cast_compatibility function for improved type checking

fix(proto): deprecate legacy fields in PhysicalCastColumnNode for backward compatibility
kosiew added 16 commits January 28, 2026 11:54
…time compatibility with protobuf deserialization
…-tripping to collapse duplicate setup across the cast column tests, improving readability and maintainability
…wing

* Derive Default for OwnedCastOptions and remove manual impl
* Replace 'static CastOptions with borrowed lifetimes across scalar, columnar, and physical exprs
* Normalize cast option handling with OwnedCastOptions defaults
* Update proto serialization/deserialization to use owned format options
* Adjust tests and roundtrip logic to reflect owned cast/format options
…s in OwnedCastOptions and OwnedFormatOptions
@kosiew kosiew force-pushed the castintegration-17330 branch from 91362a8 to 2d114c6 Compare January 28, 2026 03:55
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Jan 28, 2026
@kosiew kosiew marked this pull request as ready for review January 28, 2026 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate cast_column into PhysicalExprAdapter

1 participant