From 0890827c256fbb0a8453a9dac2749cfd96a4284d Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Mon, 4 May 2026 07:00:49 -0600 Subject: [PATCH 1/2] docs: add Spark 4.1 known-limitations section with NullType Parquet entry Start a Spark-4.1 "Known Limitations" section on the compatibility guide's spark-versions page, mirroring the existing Spark-4.0 section. First entry documents #4199 (parquet-rs rejects Spark's `BOOLEAN + Unknown` encoding for `NullType` columns) with the failure mode and a user-facing workaround so operators hitting the decode-time error have somewhere to land. --- .../user-guide/latest/compatibility/spark-versions.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/source/user-guide/latest/compatibility/spark-versions.md b/docs/source/user-guide/latest/compatibility/spark-versions.md index 6569c35406..115b1595be 100644 --- a/docs/source/user-guide/latest/compatibility/spark-versions.md +++ b/docs/source/user-guide/latest/compatibility/spark-versions.md @@ -51,6 +51,17 @@ Spark 4.1 support is experimental and intended for development and testing only. in production. ``` +### Known Limitations + +- **`NullType` columns in Parquet files** + ([#4199](https://github.com/apache/datafusion-comet/issues/4199)): Spark encodes a `NullType` + column as a Parquet `BOOLEAN` physical type annotated with `LogicalType::Unknown`. The Rust + `parquet` crate that Comet depends on accepts `Unknown` only when paired with `INT32` and rejects + any other physical type with `Parquet error: Cannot annotate Unknown from BOOLEAN for field ''`. + Any attempt to read a Parquet file that contains a `NullType` column fails at decode time before + Comet's scan runs. Workaround: project the column away, cast it to a concrete type before + persisting, or read the file with Comet disabled for that query. + ## Spark 4.2 (Experimental) Spark 4.2.0-preview4 is provided as experimental support with Java 17 and Scala 2.13. From 9eaddb435d55c2ea008b24341cde96d3471bbfa5 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Mon, 4 May 2026 07:52:25 -0600 Subject: [PATCH 2/2] docs: note UTF-8-only limitation for Parquet STRING columns (#4121) Comet's native execution is built on Arrow, whose string type is strictly UTF-8, so non-UTF-8 bytes in a STRING column are rejected at scan time rather than silently accepted like Spark. Document under Shared Limitations in the scans compatibility guide. --- docs/source/user-guide/latest/compatibility/scans.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/docs/source/user-guide/latest/compatibility/scans.md b/docs/source/user-guide/latest/compatibility/scans.md index 27ed20c19e..d68c59d562 100644 --- a/docs/source/user-guide/latest/compatibility/scans.md +++ b/docs/source/user-guide/latest/compatibility/scans.md @@ -57,6 +57,15 @@ The following shared limitation may produce incorrect results without falling ba written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before October 15, 1582. +The following shared limitation raises an error at scan time rather than falling back to Spark: + +- Invalid UTF-8 bytes in `STRING` columns. Spark permits arbitrary byte sequences in a `STRING` + column (for example from `CAST(X'C1' AS STRING)`), but Comet's native execution path is built on + Arrow, whose string type is strictly UTF-8. Reading a Parquet file whose `STRING` column contains + non-UTF-8 bytes fails with `Parquet error: encountered non UTF-8 data`. Disable Comet for the + query, or cast the column to `BINARY` before persisting, if you need to preserve non-UTF-8 bytes. + See [#4121](https://github.com/apache/datafusion-comet/issues/4121). + ## `native_datafusion` Limitations The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these