perf: stop using FFI in native shuffle read path by andygrove · Pull Request #3731 · apache/datafusion-comet

andygrove · 2026-03-18T23:00:28Z

Which issue does this PR close?

Performance improvements for native shuffle read. Shows 13% improvement in TPC-H @ 1TB.

Rationale for this change

Simplifies the shuffle direct read code path, removing unnecessary FFI transfers.

What changes are included in this PR?

How are these changes tested?

Adds a design document for bypassing Arrow FFI in the shuffle read path when both the shuffle writer and downstream operator are native.

Add a new ShuffleScanExec operator that pulls compressed shuffle blocks from JVM via CometShuffleBlockIterator and decodes them natively using read_ipc_compressed(). Uses the pre-pull pattern (get_next_batch called externally before poll_next) to avoid JNI calls on tokio threads.

Fix two bugs discovered during testing: - ClassCastException: factory closure incorrectly cast Partition to CometExecPartition before extracting ShuffledRowRDDPartition; the partition passed to the factory is already the unwrapped partition from the input RDD - NoSuchElementException in SQLShuffleReadMetricsReporter: metrics field in CometShuffledBatchRDD was not exposed as a val, causing Map.empty to be used instead of the real shuffle metrics map Add Scala integration test that runs a repartition+aggregate query with direct read enabled and disabled to verify result parity. Add Rust unit test for read_ipc_compressed codec round-trip.

- Remove redundant getCurrentBlockLength() JNI call (reuse hasNext() return value) - Make readAsRawStream() lazy instead of materializing all streams to a List - Remove pointless DirectByteBuffer re-allocation in close() - Remove dead sparkPlanToInputIdx map

Skip test_read_compressed_ipc_block under Miri since it calls foreign zstd functions that Miri cannot execute.

wForget · 2026-03-20T02:35:12Z

Shows 13% improvement in TPC-H @ 1TB.

Nice work! I didn't expect removing FFI to bring such great benefits. Could you share where these benefits mainly come from? Is it due to fewer JNI calls, or was the overhead from ArrowImporter relatively high?

wForget · 2026-03-20T03:10:54Z

I read the relevant implementation in Gluten, which defines a lightweight ColumanBatch that only holds a nativeHandle and does not make arrow imports.

https://github.com/apache/gluten/blob/main/gluten-arrow/src/main/java/org/apache/gluten/columnarbatch/IndicatorVector.java

wForget · 2026-03-20T03:43:13Z

+      int bytesRead = channel.read(headerBuf);
+      if (bytesRead < 0) {
+        if (headerBuf.position() == 0) {
+          return -1;


We can call close() earlier here.

wForget · 2026-03-20T03:46:29Z

+    // Field count discarded - schema determined by ShuffleScan protobuf fields
+    headerBuf.getLong();
+
+    long bytesToRead = compressedLength - 8;


nit: add a comment explaining why -8

wForget · 2026-03-20T06:01:43Z

+          case rdd: CometShuffledBatchRDD =>
+            val dep = rdd.dependency
+            val rddMetrics = rdd.metrics
+            factories(scanIdx) = (context, part) => {


Much of the logic here duplicates CometShuffledBatchRDD#compute. Perhaps we could add a computeAsShuffleBlockIterator method to CometShuffledBatchRDD and reuse createReader logic. Like:

class CometShuffledBatchRDD { def computeAsShuffleBlockIterator(context: TaskContext, split: Partition): CometShuffleBlockIterator = { ... } } factories(scanIdx) = rdd.computeAsShuffleBlockIterator

In that case, we no longer need to buildShuffleBlockIteratorFactories; we can compute it in CometExecRDD like this:

class CometExecRDD { override def compute(split: Partition, context: TaskContext): Iterator[ColumnarBatch] = { val partition = split.asInstanceOf[CometExecPartition] val inputs = inputRDDs.zip(partition.inputPartitions).zipWithIndex.map { case ((rdd: CometShuffledBatchRDD, part), idx) if shuffleScanIndices.contains(idx) => rdd.computeAsShuffleBlockIterator(part, context) case ((rdd, part), _) => rdd.iterator(part, context) } ... }

- Close stream on clean EOF in CometShuffleBlockIterator - Add comment explaining compressedLength - 8 subtraction - Extract createReader and computeAsShuffleBlockIterator into CometShuffledBatchRDD to eliminate duplicated reader-creation logic from buildShuffleBlockIteratorFactories - Simplify CometExecRDD to take shuffleScanIndices and create iterators directly from CometShuffledBatchRDD inputs - Add test with multiple shuffles in plan (join of two shuffled datasets)

andygrove · 2026-03-20T14:39:04Z

Shows 13% improvement in TPC-H @ 1TB.

Nice work! I didn't expect removing FFI to bring such great benefits. Could you share where these benefits mainly come from? Is it due to fewer JNI calls, or was the overhead from ArrowImporter relatively high?

Thanks! Yes, the overhead from export/import is high, including serializing the schema for every batch.

In this case, doing export/import at all was just not needed, so the overhead was just unnecessary.

andygrove · 2026-03-20T14:39:28Z

I read the relevant implementation in Gluten, which defines a lightweight ColumanBatch that only holds a nativeHandle and does not make arrow imports.

https://github.com/apache/gluten/blob/main/gluten-arrow/src/main/java/org/apache/gluten/columnarbatch/IndicatorVector.java

That's interesting. I think Comet could benefit from this approach as well.

… JVM shuffle ShuffleScanExec reads compressed IPC blocks directly, but native shuffle may dictionary-encode string/binary columns. The IPC format preserves this encoding, causing a schema mismatch since the protobuf schema only declares value types. Add an unpack step after read_ipc_compressed() to cast dictionary arrays to their value types. With this fix, the direct read optimization can safely handle dictionary-encoded data, so extend it to also support JVM columnar shuffle (which uses the same wire format as native shuffle).

andygrove · 2026-03-20T21:53:56Z

Thanks for the detailed review @wForget. I have pushed commits to address the feedback so far.

comphead · 2026-03-20T22:14:14Z

+      builder.clearChildren()
+      Some(builder.setShuffleScan(scanBuilder).build())
+    } else {
+      withInfo(op, "unsupported data types for shuffle direct read")


this error message not really conforms to the condition IMO

I added the node name to the message

comphead · 2026-03-20T22:15:47Z

+              + bytesToRead
+              + " exceeds maximum of "
+              + Integer.MAX_VALUE
+              + ". Try reducing shuffle batch size.");


please put the shuffle batch size config param name

comphead · 2026-03-20T22:16:40Z

+    // Note: native side uses get_direct_buffer_address (base pointer) + currentBlockLength,
+    // not the buffer's position/limit. No flip needed.
+
+    currentBlockLength = (int) bytesToRead;


maybe we can do int conversion only once?

comphead · 2026-03-20T22:19:08Z

+    headerBuf.clear();
+    while (headerBuf.hasRemaining()) {


headerBuf.clear(); while (headerBuf.hasRemaining()) {

how does it work? 🤔 if headerBuf.clear() executed it shouldn't have headerBuf.hasRemaining() ?

while it has remaining space (up to the limit) to be filled. I added a clarifying comment

- Clarify ByteBuffer.clear()/hasRemaining() pattern with comment - Include config param name in block size error message - Perform int cast of bytesToRead once and reuse - Include operator name in unsupported data type message

comphead · 2026-03-20T23:25:48Z

+  private final InputStream inputStream;
+  private final ByteBuffer headerBuf = ByteBuffer.allocate(16).order(ByteOrder.LITTLE_ENDIAN);
+  private ByteBuffer dataBuf = ByteBuffer.allocateDirect(INITIAL_BUFFER_SIZE);
+  private boolean closed = false;


is this class thread safe? and closed should be volatile or atomic?

this will only be called on a single thread

comphead · 2026-03-20T23:30:13Z

+            ))));
+        }
+
+        let mut env = JVMClasses::get_env()?;


maybe we can pass env by reference? or is it expensive to create env for each blcok?

get_env calls attach_current_thread, which is a no-op for already-attached threads, so the overhead is minimal AFAIK

comphead · 2026-03-20T23:40:10Z

+    /// The ShuffleScanExec producing input batches.
+    shuffle_scan: ShuffleScanExec,
+    /// Schema of the output.
+    schema: SchemaRef,


can we take schema from shuffle_scan?

Good point. Fixed.

Use shuffle_scan.schema() instead of storing a separate copy.

andygrove · 2026-03-21T13:18:15Z

Thanks for the review @comphead. I have addressed feedback so far.

comphead

Thanks @andygrove it is lgtm, also I was able to see some benefits on prod like environment

andygrove · 2026-03-21T18:02:19Z

Thanks @comphead @parthchandra @wForget! Next up is #3754

andygrove added 16 commits March 18, 2026 15:14

docs: add design spec for shuffle direct read optimization

827a8ca

Adds a design document for bypassing Arrow FFI in the shuffle read path when both the shuffle writer and downstream operator are native.

docs: add implementation plan for shuffle direct read

3a3edb4

feat: add spark.comet.shuffle.directRead.enabled config

cb2fe12

feat: add ShuffleScan protobuf message

191bbe1

feat: add CometShuffleBlockIterator for raw shuffle block access

7ac1d93

feat: add JNI bridge for CometShuffleBlockIterator

98bab73

feat: wire ShuffleScanExec into planner and pre-pull mechanism

e1c9111

feat: emit ShuffleScan protobuf for native shuffle with direct read

9a9812a

feat: wire CometShuffleBlockIterator into JVM execution path

e098cd5

style: remove unused import

b41889d

remove design doc

6e24a27

Remove doc

33c2f11

test: skip miri-incompatible zstd FFI test

19cb04b

Skip test_read_compressed_ipc_block under Miri since it calls foreign zstd functions that Miri cannot execute.

andygrove marked this pull request as ready for review March 19, 2026 13:59

andygrove added the performance label Mar 19, 2026

andygrove requested review from comphead, mbutrovich and parthchandra March 19, 2026 16:29

andygrove changed the title ~~feat: bypass Arrow FFI for native shuffle read path~~ feat: replace Arrow IPC with raw buffer format in shuffle Mar 19, 2026

andygrove changed the title ~~feat: replace Arrow IPC with raw buffer format in shuffle~~ feat: shuffle direct read and raw buffer shuffle format Mar 19, 2026

andygrove marked this pull request as draft March 19, 2026 16:33

andygrove force-pushed the shuffle-direct-read branch from a7e9659 to 19cb04b Compare March 19, 2026 16:36

andygrove changed the title ~~feat: shuffle direct read and raw buffer shuffle format~~ feat: stop using FFI in native shuffle read path Mar 19, 2026

andygrove marked this pull request as ready for review March 19, 2026 16:38

andygrove mentioned this pull request Mar 19, 2026

perf: Replace Arrow IPC with more efficient shuffle format [WIP] #3733

Closed

andygrove requested review from wForget March 19, 2026 17:02

change category

25d7878

wForget reviewed Mar 20, 2026

View reviewed changes

Merge branch 'main' into shuffle-direct-read

c801599

andygrove changed the title ~~feat: stop using FFI in native shuffle read path~~ perf: stop using FFI in native shuffle read path Mar 20, 2026

andygrove added 2 commits March 20, 2026 09:37

fix: resolve merge conflicts from crate refactor

5ef130e

comphead reviewed Mar 20, 2026

View reviewed changes

refactor: address review feedback on shuffle direct read

f377995

- Clarify ByteBuffer.clear()/hasRemaining() pattern with comment - Include config param name in block size error message - Perform int cast of bytesToRead once and reuse - Include operator name in unsupported data type message

andygrove force-pushed the shuffle-direct-read branch from 85b0d4e to f377995 Compare March 20, 2026 22:33

comphead reviewed Mar 20, 2026

View reviewed changes

refactor: remove redundant schema field from ShuffleScanStream

f0895db

Use shuffle_scan.schema() instead of storing a separate copy.

comphead approved these changes Mar 21, 2026

View reviewed changes

andygrove merged commit 8bab4a5 into apache:main Mar 21, 2026
60 checks passed

andygrove deleted the shuffle-direct-read branch March 21, 2026 18:02

andygrove mentioned this pull request Mar 22, 2026

perf: bypass Arrow FFI for broadcast exchange reads #3762

Open

Conversation

andygrove commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

wForget commented Mar 20, 2026

Uh oh!

wForget commented Mar 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Mar 20, 2026

Uh oh!

andygrove commented Mar 20, 2026

Uh oh!

andygrove commented Mar 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Mar 21, 2026

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Mar 18, 2026 •

edited

Loading