diff --git a/docs/source/user-guide/latest/compatibility/scans.md b/docs/source/user-guide/latest/compatibility/scans.md index 27ed20c19e..d68c59d562 100644 --- a/docs/source/user-guide/latest/compatibility/scans.md +++ b/docs/source/user-guide/latest/compatibility/scans.md @@ -57,6 +57,15 @@ The following shared limitation may produce incorrect results without falling ba written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before October 15, 1582. +The following shared limitation raises an error at scan time rather than falling back to Spark: + +- Invalid UTF-8 bytes in `STRING` columns. Spark permits arbitrary byte sequences in a `STRING` + column (for example from `CAST(X'C1' AS STRING)`), but Comet's native execution path is built on + Arrow, whose string type is strictly UTF-8. Reading a Parquet file whose `STRING` column contains + non-UTF-8 bytes fails with `Parquet error: encountered non UTF-8 data`. Disable Comet for the + query, or cast the column to `BINARY` before persisting, if you need to preserve non-UTF-8 bytes. + See [#4121](https://github.com/apache/datafusion-comet/issues/4121). + ## `native_datafusion` Limitations The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these diff --git a/docs/source/user-guide/latest/compatibility/spark-versions.md b/docs/source/user-guide/latest/compatibility/spark-versions.md index 6569c35406..115b1595be 100644 --- a/docs/source/user-guide/latest/compatibility/spark-versions.md +++ b/docs/source/user-guide/latest/compatibility/spark-versions.md @@ -51,6 +51,17 @@ Spark 4.1 support is experimental and intended for development and testing only. in production. ``` +### Known Limitations + +- **`NullType` columns in Parquet files** + ([#4199](https://github.com/apache/datafusion-comet/issues/4199)): Spark encodes a `NullType` + column as a Parquet `BOOLEAN` physical type annotated with `LogicalType::Unknown`. The Rust + `parquet` crate that Comet depends on accepts `Unknown` only when paired with `INT32` and rejects + any other physical type with `Parquet error: Cannot annotate Unknown from BOOLEAN for field ''`. + Any attempt to read a Parquet file that contains a `NullType` column fails at decode time before + Comet's scan runs. Workaround: project the column away, cast it to a concrete type before + persisting, or read the file with Comet disabled for that query. + ## Spark 4.2 (Experimental) Spark 4.2.0-preview4 is provided as experimental support with Java 17 and Scala 2.13.