Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/source/user-guide/latest/compatibility/scans.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,15 @@ The following shared limitation may produce incorrect results without falling ba
written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before
October 15, 1582.

The following shared limitation raises an error at scan time rather than falling back to Spark:

- Invalid UTF-8 bytes in `STRING` columns. Spark permits arbitrary byte sequences in a `STRING`
column (for example from `CAST(X'C1' AS STRING)`), but Comet's native execution path is built on
Arrow, whose string type is strictly UTF-8. Reading a Parquet file whose `STRING` column contains
non-UTF-8 bytes fails with `Parquet error: encountered non UTF-8 data`. Disable Comet for the
query, or cast the column to `BINARY` before persisting, if you need to preserve non-UTF-8 bytes.
See [#4121](https://github.com/apache/datafusion-comet/issues/4121).

## `native_datafusion` Limitations

The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these
Expand Down
11 changes: 11 additions & 0 deletions docs/source/user-guide/latest/compatibility/spark-versions.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,17 @@ Spark 4.1 support is experimental and intended for development and testing only.
in production.
```

### Known Limitations

- **`NullType` columns in Parquet files**
([#4199](https://github.com/apache/datafusion-comet/issues/4199)): Spark encodes a `NullType`
column as a Parquet `BOOLEAN` physical type annotated with `LogicalType::Unknown`. The Rust
`parquet` crate that Comet depends on accepts `Unknown` only when paired with `INT32` and rejects
any other physical type with `Parquet error: Cannot annotate Unknown from BOOLEAN for field '<name>'`.
Any attempt to read a Parquet file that contains a `NullType` column fails at decode time before
Comet's scan runs. Workaround: project the column away, cast it to a concrete type before
persisting, or read the file with Comet disabled for that query.

## Spark 4.2 (Experimental)

Spark 4.2.0-preview4 is provided as experimental support with Java 17 and Scala 2.13.
Expand Down