Fix CUDA ReduceSum crash on empty tensors with explicit axes by justinchuby · Pull Request #28353 · microsoft/onnxruntime

justinchuby · 2026-05-04T21:19:38Z

Description

Remove the overly strict assertion in CUDA PrepareForReduce that rejects reducing along a zero-sized dimension even with explicit axes. This matches the behavior of the CPU implementation which handles empty tensors via check_and_reduce_empty_set_input().

Motivation

ORT GenAI's Gemma4 CUDA pipeline triggers ReduceSum on {1, 0} tensors during prefill (past_sequence_length=0). The CPU implementation handles this correctly, but the CUDA path crashes with:

input_dims[axis] != 0 was false. Can't reduce on dim with value of 0 if 'keepdims' is false.

Reducing axis 1 of {1, 0} with keepdims=false produces shape {1} filled with the identity value (0 for sum). This is mathematically valid and numpy handles it correctly.

Changes

Removed the ORT_ENFORCE(input_dims[axis] != 0, ...) assertion at line 291 of reduction_ops.cc. The existing ReduceComputeCore already handles input_count == 0 correctly (line 369-370).

The default-axes path (line 302) is left unchanged — it already conditionally checks keepdims || dim != 0.

Testing

Verified with Gemma4 e2b-it model on H200 GPU:

Before: ReduceSum_node_232 crashes on {1, 0} tensor
After: ReduceSum succeeds, inference proceeds to next node (GroupQueryAttention)

Related issues

CUDA EP: CUBLAS misaligned address crash during prefill with Gemma4 model onnxruntime-genai#2120 (Gemma4 CUDA prefill crash)

Remove the overly strict assertion that rejected reducing along a zero-sized dimension even with explicit axes. Reducing axis K of shape {N, 0} with keepdims=false produces shape {N} filled with the identity value (0 for sum), which is mathematically valid. The CPU implementation already handles this case via check_and_reduce_empty_set_input(). The CUDA path now allows PrepareForReduce to succeed, and ReduceComputeCore (line 369) already handles input_count==0 correctly. This fixes CUDA inference for models with dynamic KV cache where past_sequence_length=0 during prefill (e.g., Gemma4 via ORT GenAI). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Add 3 test cases verifying CUDA ReduceSum handles zero-sized dimensions: - {1, 0} with axis=1, keepdims=false → {1} with value 0 - {1, 0} with axis=1, keepdims=true → {1, 1} with value 0 - {2, 0, 3} with axis=1, keepdims=false → {2, 3} with all zeros These test the fix that removed the overly strict assertion rejecting reduction along zero-sized dimensions on CUDA. Signed-off-by: Justin Chu <justinchu@microsoft.com>

justinchuby and others added 2 commits May 4, 2026 21:19

justinchuby mentioned this pull request May 4, 2026

CUDA EP: CUBLAS misaligned address crash during prefill with Gemma4 model microsoft/onnxruntime-genai#2120

Open

justinchuby requested a review from Copilot May 4, 2026 23:45

Copilot started reviewing on behalf of justinchuby May 4, 2026 23:55 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA ReduceSum crash on empty tensors with explicit axes#28353

Fix CUDA ReduceSum crash on empty tensors with explicit axes#28353
justinchuby wants to merge 2 commits intomainfrom
fix-reducesum-empty-tensor

justinchuby commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justinchuby commented May 4, 2026

Description

Motivation

Changes

Testing

Related issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant