From 0890827c256fbb0a8453a9dac2749cfd96a4284d Mon Sep 17 00:00:00 2001
From: Andy Grove <agrove@apache.org>
Date: Mon, 4 May 2026 07:00:49 -0600
Subject: [PATCH 1/2] docs: add Spark 4.1 known-limitations section with
 NullType Parquet entry

Start a Spark-4.1 "Known Limitations" section on the compatibility guide's
spark-versions page, mirroring the existing Spark-4.0 section. First entry
documents #4199 (parquet-rs rejects Spark's `BOOLEAN + Unknown` encoding
for `NullType` columns) with the failure mode and a user-facing workaround
so operators hitting the decode-time error have somewhere to land.
---
 .../user-guide/latest/compatibility/spark-versions.md | 11 +++++++++++
 1 file changed, 11 insertions(+)
diff --git a/docs/source/user-guide/latest/compatibility/spark-versions.md b/docs/source/user-guide/latest/compatibility/spark-versions.md
index 6569c35406..115b1595be 100644
--- a/docs/source/user-guide/latest/compatibility/spark-versions.md
+++ b/docs/source/user-guide/latest/compatibility/spark-versions.md
@@ -51,6 +51,17 @@ Spark 4.1 support is experimental and intended for development and testing only.
 in production.
 ```
 
+### Known Limitations
+
+- **`NullType` columns in Parquet files**
+  ([#4199](https://github.com/apache/datafusion-comet/issues/4199)): Spark encodes a `NullType`
+  column as a Parquet `BOOLEAN` physical type annotated with `LogicalType::Unknown`. The Rust
+  `parquet` crate that Comet depends on accepts `Unknown` only when paired with `INT32` and rejects
+  any other physical type with `Parquet error: Cannot annotate Unknown from BOOLEAN for field '<name>'`.
+  Any attempt to read a Parquet file that contains a `NullType` column fails at decode time before
+  Comet's scan runs. Workaround: project the column away, cast it to a concrete type before
+  persisting, or read the file with Comet disabled for that query.
+
 ## Spark 4.2 (Experimental)
 
 Spark 4.2.0-preview4 is provided as experimental support with Java 17 and Scala 2.13.

From 9eaddb435d55c2ea008b24341cde96d3471bbfa5 Mon Sep 17 00:00:00 2001
From: Andy Grove <agrove@apache.org>
Date: Mon, 4 May 2026 07:52:25 -0600
Subject: [PATCH 2/2] docs: note UTF-8-only limitation for Parquet STRING
 columns (#4121)

Comet's native execution is built on Arrow, whose string type is
strictly UTF-8, so non-UTF-8 bytes in a STRING column are rejected at
scan time rather than silently accepted like Spark. Document under
Shared Limitations in the scans compatibility guide.
---
 docs/source/user-guide/latest/compatibility/scans.md | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/docs/source/user-guide/latest/compatibility/scans.md b/docs/source/user-guide/latest/compatibility/scans.md
index 27ed20c19e..d68c59d562 100644
--- a/docs/source/user-guide/latest/compatibility/scans.md
+++ b/docs/source/user-guide/latest/compatibility/scans.md
@@ -57,6 +57,15 @@ The following shared limitation may produce incorrect results without falling ba
   written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before
   October 15, 1582.
 
+The following shared limitation raises an error at scan time rather than falling back to Spark:
+
+- Invalid UTF-8 bytes in `STRING` columns. Spark permits arbitrary byte sequences in a `STRING`
+  column (for example from `CAST(X'C1' AS STRING)`), but Comet's native execution path is built on
+  Arrow, whose string type is strictly UTF-8. Reading a Parquet file whose `STRING` column contains
+  non-UTF-8 bytes fails with `Parquet error: encountered non UTF-8 data`. Disable Comet for the
+  query, or cast the column to `BINARY` before persisting, if you need to preserve non-UTF-8 bytes.
+  See [#4121](https://github.com/apache/datafusion-comet/issues/4121).
+
 ## `native_datafusion` Limitations
 
 The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these