Relax OneOf schema comparison to ignore nullability differences#37
Conversation
Different table types (FileScanTable, LiveTable, MV rollups) may expose different nullability for the same column. For example: - A partition column in one table is non-nullable - The same column in a rollup MV is a file column, forced nullable for DF 52 RecordBatch validation compatibility Previously, OneOf rejected candidates with any schema difference including nullability. Now it only checks field names and data types, allowing nullable/non-nullable variants of the same column.
b64bbd7 to
7202ae6
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the OneOf candidate compatibility checks in the view exploitation planner to treat schemas as equal even when they differ only by field nullability, addressing mismatches introduced by forced-nullable file columns.
Changes:
- Replaced strict logical/physical schema equality checks with a comparison that ignores field nullability.
- Added
schemas_equal_ignoring_nullability()helper for Arrow schemas. - Added unit tests covering equality/inequality cases for the new helper.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .any(|schema| { | ||
| !schemas_equal_ignoring_nullability( | ||
| schema.as_arrow(), | ||
| logical_inputs[0].schema().as_arrow(), | ||
| ) |
There was a problem hiding this comment.
By relaxing the logical schema compatibility check to ignore nullability, OneOf can now contain branches whose DFSchema differs from branches[0].schema() (which is what OneOf::schema() reports). This makes the OneOf node’s output schema potentially stricter (non-nullable) than some candidate branches, which can violate DataFusion’s expectation that a logical node’s schema matches all its inputs and can enable incorrect optimizations based on non-nullability. Consider normalizing the OneOf schema to a merged/most-permissive schema (nullable if any branch is nullable) or otherwise ensuring the OneOf-reported schema is compatible with every branch when nullability differs.
| if logical_inputs | ||
| .iter() | ||
| .map(|plan| plan.schema()) | ||
| .any(|schema| schema != logical_inputs[0].schema()) | ||
| .any(|schema| { | ||
| !schemas_equal_ignoring_nullability( |
There was a problem hiding this comment.
Minor: logical_inputs[0].schema().as_arrow() is recomputed inside the .any(...) closure for every input. Consider binding the baseline schema once before the iterator (and similarly for the physical schema check) to reduce repeated work and make the comparison a bit easier to read.
Summary
OneOf candidate schema comparison now ignores nullability differences, only checking field names and data types.
Problem
After DF 52's removal of
SchemaAdapter,FileScanTablemust force all file columns nullable to passRecordBatch::try_new_with_optionsvalidation. This causes OneOf candidates to have mismatched nullability:day_aggs_v3:dateis a partition column → non-nullableday_aggs_alltime_v3:dateis a file column → forced nullableOneOf's strict
schema != schemacomparison rejected these as incompatible, causing:Fix
Replace strict
PartialEqschema comparison withschemas_equal_ignoring_nullability()that only checks field names and data types. Nullability differences between candidates are acceptable because they don't affect query correctness — OneOf just picks the best candidate to execute.Tests
4 unit tests for
schemas_equal_ignoring_nullability: