fix: throw SchemaColumnConvertNotSupportedException from native_datafusion schema mismatch by andygrove · Pull Request #4117 · apache/datafusion-comet

andygrove · 2026-04-28T00:47:02Z

Which issue does this PR close?

Partially addresses #3720.

Rationale for this change

Spark's vectorized Parquet reader signals incompatible column reads (e.g. reading a STRING column as INT, scale-narrowing decimal, scalar read as ARRAY) with SchemaColumnConvertNotSupportedException. FileScanRDD (3.x) and FileDataSourceV2 (4.0) then wrap that into a typed SparkException (_LEGACY_ERROR_TEMP_2063 on 3.x, FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH on 4.0).

native_datafusion was raising the same incompatible-read cases (after #4090 and #4091) but as opaque CometNativeException, so several Spark SQL tests that assert the typed exception chain or message format had to be ignored under IgnoreCometNativeDataFusion("...3720"). This PR aligns native_datafusion's error path with native_iceberg_compat's, so the same tests pass without changing user-visible semantics for the cases that already errored.

What changes are included in this PR?

Native side

New SparkError::ParquetSchemaConvert { file_path, column, physical_type, spark_type } variant in native/common/src/error.rs, wired into error_type_name, params_as_json, and exception_class. Flows through the existing DataFusionError::External → CometQueryExecutionException JSON pipeline.
native/core/src/parquet/schema_adapter.rs: the two DataFusionError::Plan(...) returns added by fix: reject incompatible decimal precision/scale in native_datafusion scan #4090 and fix: reject string/binary read as numeric in native_datafusion scan #4091 now return the new variant. Added a planning-time guard for scalar/complex mismatch (e.g. Timestamp read as Array<Timestamp>) — covers SPARK-45604.

JVM shims (3.4 / 3.5 / 4.0 versions of ShimSparkErrorConverter.scala)

New case "ParquetSchemaConvert" translates the JSON-encoded native error to a SchemaColumnConvertNotSupportedException, then wraps it via the version-appropriate QueryExecutionErrors:
- 3.4 / 3.5: unsupportedSchemaColumnConvertError → _LEGACY_ERROR_TEMP_2063
- 4.0: parquetColumnDataTypeMismatchError → FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH
File path is currently passed empty since DataFusion's PhysicalExprAdapterFactory::create doesn't expose it. The wrapped message still satisfies the errMsg.contains("Parquet column cannot be converted in file") assertions.

Spark SQL diffs

Tests previously ignored under IgnoreCometNativeDataFusion("...3720") are unignored where they now pass:

dev/diffs/3.5.8.diff: SPARK-35640 binary as timestamp, SPARK-45604 ntz to array (verified locally end-to-end on Spark v3.5.8 with ENABLE_COMET=true ENABLE_COMET_ONHEAP=true).
dev/diffs/3.4.3.diff: same two tests (parallel structure with 3.5; same shim code path).
dev/diffs/4.0.2.diff: SPARK-45604 ntz to array only. SPARK-35640 binary as timestamp on 4.0 uses checkErrorMatchPVals with strict parameter format (column="[_1]", actualType="BINARY", expectedType="timestamp"); the shim currently passes Arrow type names (Utf8, Timestamp(µs, "UTC")) without brackets and would need an Arrow-to-Parquet/Spark type-name translation step. Left ignored as a follow-up.

How are these changes tested?

End-to-end against apache/spark v3.5.8 and v4.0.2 with the Comet jar and ENABLE_COMET=true ENABLE_COMET_ONHEAP=true:

Test	3.5.8	4.0.2
`ParquetIOSuite` SPARK-35640 binary as timestamp	pass	fail (param format, see above)
`ParquetSchemaSuite` SPARK-45604 ntz to array	pass	pass
`ParquetSchemaSuite` schema mismatch failure error message vectorized	fail (test extracts file path from message and re-reads — needs file-path plumbing through `PhysicalExprAdapterFactory::create`, separate follow-up)	not exercised

Existing Comet regression tests added by #4090 and #4091 (ParquetReadSuite "native_datafusion rejects string read as non-string/binary type" and "native_datafusion rejects incompatible decimal precision/scale") continue to pass — they assert assertThrows[SparkException], which the new wrapped SparkException satisfies.

cargo clippy --all-targets --workspace -- -D warnings clean.

Out of scope (follow-ups)

3.5.8 vectorized schema mismatch test (requires file-path plumbing).
4.0.2 binary-as-timestamp test (requires Arrow-to-Parquet/Spark type-name translation in shim).
3.4.3 end-to-end run (parallel to 3.5.8, not separately verified).

…usion schema mismatch Spark's vectorized reader signals incompatible Parquet column reads with SchemaColumnConvertNotSupportedException, which FileScanRDD / FileDataSourceV2 then wraps in a typed SparkException (_LEGACY_ERROR_TEMP_2063 on 3.x, FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH on 4.0). The native_datafusion scan previously surfaced these as opaque CometNativeException, so several Spark SQL tests covering schema-mismatch behavior had to be ignored (issue apache#3720). Add a SparkError::ParquetSchemaConvert variant that flows through the existing CometQueryExecutionException JSON pipeline and have each shim translate it to the version-appropriate Spark exception with SchemaColumnConvertNotSupportedException as the cause. Wire the variant into the two schema_adapter.rs guards added by apache#4090/apache#4091, plus a new planning-time guard for scalar/complex mismatch (covers SPARK-45604: Timestamp read as Array<Timestamp>). Unignore the now-passing tests in dev/diffs: - 3.4.3 / 3.5.8: SPARK-35640 binary as timestamp, SPARK-45604 ntz to array - 4.0.2: SPARK-45604 ntz to array (binary-as-timestamp on 4.0 needs Arrow-to-Parquet/Spark type-name translation in the shim before checkErrorMatchPVals will accept it)

coderfender

Thank you

mbutrovich

Thanks @andygrove!

andygrove · 2026-04-28T18:12:13Z

Merged. Thanks @coderfender @mbutrovich

andygrove marked this pull request as ready for review April 28, 2026 03:08

andygrove requested a review from mbutrovich April 28, 2026 03:15

coderfender approved these changes Apr 28, 2026

View reviewed changes

Merge remote-tracking branch 'apache/main' into upmerge-4117

1f2b553

andygrove requested a review from comphead April 28, 2026 14:30

mbutrovich approved these changes Apr 28, 2026

View reviewed changes

andygrove merged commit b1ca457 into apache:main Apr 28, 2026
170 of 175 checks passed

andygrove deleted the enable-spark358-3720-tests branch April 28, 2026 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: throw SchemaColumnConvertNotSupportedException from native_datafusion schema mismatch#4117

fix: throw SchemaColumnConvertNotSupportedException from native_datafusion schema mismatch#4117
andygrove merged 2 commits intoapache:mainfrom
andygrove:enable-spark358-3720-tests

andygrove commented Apr 28, 2026

Uh oh!

coderfender left a comment

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

andygrove commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andygrove commented Apr 28, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Out of scope (follow-ups)

Uh oh!

coderfender left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants