Skip to content

fix: throw SchemaColumnConvertNotSupportedException from native_datafusion schema mismatch#4117

Merged
andygrove merged 2 commits intoapache:mainfrom
andygrove:enable-spark358-3720-tests
Apr 28, 2026
Merged

fix: throw SchemaColumnConvertNotSupportedException from native_datafusion schema mismatch#4117
andygrove merged 2 commits intoapache:mainfrom
andygrove:enable-spark358-3720-tests

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Partially addresses #3720.

Rationale for this change

Spark's vectorized Parquet reader signals incompatible column reads (e.g. reading a STRING column as INT, scale-narrowing decimal, scalar read as ARRAY) with SchemaColumnConvertNotSupportedException. FileScanRDD (3.x) and FileDataSourceV2 (4.0) then wrap that into a typed SparkException (_LEGACY_ERROR_TEMP_2063 on 3.x, FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH on 4.0).

native_datafusion was raising the same incompatible-read cases (after #4090 and #4091) but as opaque CometNativeException, so several Spark SQL tests that assert the typed exception chain or message format had to be ignored under IgnoreCometNativeDataFusion("...3720"). This PR aligns native_datafusion's error path with native_iceberg_compat's, so the same tests pass without changing user-visible semantics for the cases that already errored.

What changes are included in this PR?

Native side

JVM shims (3.4 / 3.5 / 4.0 versions of ShimSparkErrorConverter.scala)

  • New case "ParquetSchemaConvert" translates the JSON-encoded native error to a SchemaColumnConvertNotSupportedException, then wraps it via the version-appropriate QueryExecutionErrors:
    • 3.4 / 3.5: unsupportedSchemaColumnConvertError_LEGACY_ERROR_TEMP_2063
    • 4.0: parquetColumnDataTypeMismatchErrorFAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH
  • File path is currently passed empty since DataFusion's PhysicalExprAdapterFactory::create doesn't expose it. The wrapped message still satisfies the errMsg.contains("Parquet column cannot be converted in file") assertions.

Spark SQL diffs

Tests previously ignored under IgnoreCometNativeDataFusion("...3720") are unignored where they now pass:

  • dev/diffs/3.5.8.diff: SPARK-35640 binary as timestamp, SPARK-45604 ntz to array (verified locally end-to-end on Spark v3.5.8 with ENABLE_COMET=true ENABLE_COMET_ONHEAP=true).
  • dev/diffs/3.4.3.diff: same two tests (parallel structure with 3.5; same shim code path).
  • dev/diffs/4.0.2.diff: SPARK-45604 ntz to array only. SPARK-35640 binary as timestamp on 4.0 uses checkErrorMatchPVals with strict parameter format (column="[_1]", actualType="BINARY", expectedType="timestamp"); the shim currently passes Arrow type names (Utf8, Timestamp(µs, "UTC")) without brackets and would need an Arrow-to-Parquet/Spark type-name translation step. Left ignored as a follow-up.

How are these changes tested?

End-to-end against apache/spark v3.5.8 and v4.0.2 with the Comet jar and ENABLE_COMET=true ENABLE_COMET_ONHEAP=true:

Test 3.5.8 4.0.2
ParquetIOSuite SPARK-35640 binary as timestamp pass fail (param format, see above)
ParquetSchemaSuite SPARK-45604 ntz to array pass pass
ParquetSchemaSuite schema mismatch failure error message vectorized fail (test extracts file path from message and re-reads — needs file-path plumbing through PhysicalExprAdapterFactory::create, separate follow-up) not exercised

Existing Comet regression tests added by #4090 and #4091 (ParquetReadSuite "native_datafusion rejects string read as non-string/binary type" and "native_datafusion rejects incompatible decimal precision/scale") continue to pass — they assert assertThrows[SparkException], which the new wrapped SparkException satisfies.

cargo clippy --all-targets --workspace -- -D warnings clean.

Out of scope (follow-ups)

  • 3.5.8 vectorized schema mismatch test (requires file-path plumbing).
  • 4.0.2 binary-as-timestamp test (requires Arrow-to-Parquet/Spark type-name translation in shim).
  • 3.4.3 end-to-end run (parallel to 3.5.8, not separately verified).

…usion schema mismatch

Spark's vectorized reader signals incompatible Parquet column reads with
SchemaColumnConvertNotSupportedException, which FileScanRDD / FileDataSourceV2
then wraps in a typed SparkException (_LEGACY_ERROR_TEMP_2063 on 3.x,
FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH on 4.0). The native_datafusion
scan previously surfaced these as opaque CometNativeException, so several
Spark SQL tests covering schema-mismatch behavior had to be ignored
(issue apache#3720).

Add a SparkError::ParquetSchemaConvert variant that flows through the
existing CometQueryExecutionException JSON pipeline and have each shim
translate it to the version-appropriate Spark exception with
SchemaColumnConvertNotSupportedException as the cause. Wire the variant
into the two schema_adapter.rs guards added by apache#4090/apache#4091, plus a new
planning-time guard for scalar/complex mismatch (covers SPARK-45604:
Timestamp read as Array<Timestamp>).

Unignore the now-passing tests in dev/diffs:
- 3.4.3 / 3.5.8: SPARK-35640 binary as timestamp, SPARK-45604 ntz to array
- 4.0.2: SPARK-45604 ntz to array (binary-as-timestamp on 4.0 needs
  Arrow-to-Parquet/Spark type-name translation in the shim before
  checkErrorMatchPVals will accept it)
@andygrove andygrove marked this pull request as ready for review April 28, 2026 03:08
@andygrove andygrove requested a review from mbutrovich April 28, 2026 03:15
Copy link
Copy Markdown
Contributor

@coderfender coderfender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@andygrove andygrove requested a review from comphead April 28, 2026 14:30
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove!

@andygrove andygrove merged commit b1ca457 into apache:main Apr 28, 2026
170 of 175 checks passed
@andygrove andygrove deleted the enable-spark358-3720-tests branch April 28, 2026 18:12
@andygrove
Copy link
Copy Markdown
Member Author

Merged. Thanks @coderfender @mbutrovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants