fix: reject incompatible decimal precision/scale in native_datafusion scan by andygrove · Pull Request #4090 · apache/datafusion-comet

andygrove · 2026-04-25T19:04:15Z

Which issue does this PR close?

Closes #4089.

Rationale for this change

When the native_datafusion scan reads a Parquet column whose physical type is Decimal(p1, s1) under a requested read schema of Decimal(p2, s2) with s2 < s1, the existing schema adapter falls through to Spark's Cast expression. Cast happily truncates fractional digits, producing wrong values silently. Spark's vectorized reader rejects this with SchemaColumnConvertNotSupportedException, and native_iceberg_compat already does the same via TypeUtil.checkParquetType. The native scan should match.

What changes are included in this PR?

native/core/src/parquet/schema_adapter.rs: in replace_with_spark_cast, add a guard before the existing branches that returns DataFusionError::Plan when both physical_type and target_type are Decimal128 and the target scale is smaller than the source scale.

The check is intentionally narrow:

Only fires on scale narrowing. Precision-only changes with the same scale (e.g. Decimal(5,2) read as Decimal(3,2)) are still allowed and fall through to Spark's Cast, which produces null on per-value overflow. This matches Spark 4.0's parquet-mr fallback behavior exercised by ParquetTypeWideningSuite's parquet decimal type change Decimal(5, 2) -> Decimal(3, 2) overflows with parquet-mr test.
Does not touch any other type path. The closed prior attempt at broad schema validation ([native_datafusion] [Spark SQL Tests] Schema incompatibility tests expect exceptions that native_datafusion handles gracefully #3311) broke unrelated tests; this one does not.

How are these changes tested?

Added a focused test to ParquetReadSuite: native_datafusion rejects incompatible decimal precision/scale. It writes Decimal(10, 2) data, reads it under Decimal(5, 0) (scale narrowed from 2 to 0), forces spark.comet.scan.impl=native_datafusion and spark.sql.sources.useV1SourceList=parquet, and asserts collect() raises SparkException. Verified against ParquetReadV1Suite (44 tests, all pass; 1 pre-existing test ignored).

The behavior is also covered by the per-impl matrix added in #4087 (decimal(10,2) read as decimal(5,0): native_datafusion), whose assertion will need flipping from "succeeds" to "throws" once that PR merges.

… scan The native_datafusion Spark physical expression adapter previously fell through to a Spark Cast for decimal-to-decimal type changes, which silently rescales or truncates values that should have raised an error. Mirror Spark's TypeUtil.isDecimalTypeMatched (Spark 3.x rule) by rejecting reads where the target precision is smaller than the source precision or the scales differ. Closes apache#4089.

mbutrovich · 2026-04-26T14:09:22Z

In a case where we expect an exception to be generated anyway, can we catch this at CometScanRule rather than going all the way to serialization and native operators?

andygrove · 2026-04-26T14:12:18Z

In a case where we expect an exception to be generated anyway, can we catch this at CometScanRule rather than going all the way to serialization and native operators?

I think the issue is that we do not know the types of all the parquet files until runtime?

mbutrovich · 2026-04-26T14:31:53Z

I think the issue is that we do not know the types of all the parquet files until runtime?

IIRC from looking at this a while back, Spark has read the physical schema already, but thrown it away by the time our Comet rules run with no good way to get it again.

I'm not opposed to handling it this way, just wanted to think through if we could catch it earlier. It's also a fairly uncommon scenario.

andygrove · 2026-04-26T15:07:46Z

I'm not opposed to handling it this way, just wanted to think through if we could catch it earlier. It's also a fairly uncommon scenario.

I do think this is an edge case that is fairly unlikely IRL because it only happens when the user provides a schema that is incompatible with the file schema

mbutrovich · 2026-04-27T14:39:36Z

Thanks @andygrove! I double-checked Spark's logic and the scale-narrowing guard looks right for preventing silent truncation.

Spark's isDecimalTypeMatched (line 1169 of ParquetVectorUpdaterFactory.java) requires exact scale match, so scale widening like Decimal(10,2) -> Decimal(10,4) would also be rejected by Spark's vectorized reader but passes through here. The PR description explains why that's fine (lossless cast), which makes sense. Would it be worth noting that in the code comment so future readers don't have to check the PR description?

Per review feedback, document that scale widening (e.g. Decimal(10,2) read as Decimal(10,4)) is also rejected by Spark's vectorized reader via isDecimalTypeMatched but is allowed here because the cast is lossless. Reference Spark's ParquetVectorUpdaterFactory check directly in the comment so future readers don't need to consult the PR description.

andygrove · 2026-04-27T18:46:40Z

Expanded the comment in replace_with_spark_cast to document both the precision-only and scale-widening cases and to reference isDecimalTypeMatched directly. Pushed in fcbec76.

mbutrovich

Thanks @andygrove!

…l-precision # Conflicts: # native/core/src/parquet/schema_adapter.rs # spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala

andygrove · 2026-04-27T22:42:02Z

Merged. Thanks @mbutrovich

…4090, apache#4091 Cases 4 (Decimal(10,2)->Decimal(5,0)) and 6 (STRING->INT) now throw SparkException on native_datafusion after the schema adapter rejection fixes landed on main. Update assertions and the behavior matrix.

andygrove added correctness bug Something isn't working labels Apr 25, 2026

andygrove force-pushed the fix-issue-4089-decimal-precision branch from 1194d82 to 99e1235 Compare April 26, 2026 13:04

andygrove mentioned this pull request Apr 26, 2026

Comet 0.15.1 Release #4094

Open

andygrove mentioned this pull request Apr 27, 2026

fix: reject string/binary read as numeric in native_datafusion scan #4091

Merged

mbutrovich approved these changes Apr 27, 2026

View reviewed changes

Merge remote-tracking branch 'apache/main' into fix-issue-4089-decima…

786f300

…l-precision # Conflicts: # native/core/src/parquet/schema_adapter.rs # spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala

andygrove merged commit 4b086bb into apache:main Apr 27, 2026
119 checks passed

andygrove deleted the fix-issue-4089-decimal-precision branch April 27, 2026 22:41

andygrove mentioned this pull request Apr 28, 2026

fix: throw SchemaColumnConvertNotSupportedException from native_datafusion schema mismatch #4117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reject incompatible decimal precision/scale in native_datafusion scan#4090

fix: reject incompatible decimal precision/scale in native_datafusion scan#4090
andygrove merged 3 commits intoapache:mainfrom
andygrove:fix-issue-4089-decimal-precision

andygrove commented Apr 25, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Apr 26, 2026

Uh oh!

andygrove commented Apr 26, 2026

Uh oh!

mbutrovich commented Apr 26, 2026 •

edited

Loading

Uh oh!

andygrove commented Apr 26, 2026

Uh oh!

mbutrovich commented Apr 27, 2026

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented Apr 26, 2026

Uh oh!

andygrove commented Apr 26, 2026

Uh oh!

mbutrovich commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Apr 26, 2026

Uh oh!

mbutrovich commented Apr 27, 2026

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Apr 25, 2026 •

edited

Loading

mbutrovich commented Apr 26, 2026 •

edited

Loading