Skip to content

Fix CUDA ReduceSum crash on empty tensors with explicit axes#28353

Open
justinchuby wants to merge 2 commits intomainfrom
fix-reducesum-empty-tensor
Open

Fix CUDA ReduceSum crash on empty tensors with explicit axes#28353
justinchuby wants to merge 2 commits intomainfrom
fix-reducesum-empty-tensor

Conversation

@justinchuby
Copy link
Copy Markdown
Contributor

Description

Remove the overly strict assertion in CUDA PrepareForReduce that rejects reducing along a zero-sized dimension even with explicit axes. This matches the behavior of the CPU implementation which handles empty tensors via check_and_reduce_empty_set_input().

Motivation

ORT GenAI's Gemma4 CUDA pipeline triggers ReduceSum on {1, 0} tensors during prefill (past_sequence_length=0). The CPU implementation handles this correctly, but the CUDA path crashes with:

input_dims[axis] != 0 was false. Can't reduce on dim with value of 0 if 'keepdims' is false.

Reducing axis 1 of {1, 0} with keepdims=false produces shape {1} filled with the identity value (0 for sum). This is mathematically valid and numpy handles it correctly.

Changes

Removed the ORT_ENFORCE(input_dims[axis] != 0, ...) assertion at line 291 of reduction_ops.cc. The existing ReduceComputeCore already handles input_count == 0 correctly (line 369-370).

The default-axes path (line 302) is left unchanged — it already conditionally checks keepdims || dim != 0.

Testing

Verified with Gemma4 e2b-it model on H200 GPU:

  • Before: ReduceSum_node_232 crashes on {1, 0} tensor
  • After: ReduceSum succeeds, inference proceeds to next node (GroupQueryAttention)

Related issues

justinchuby and others added 2 commits May 4, 2026 21:19
Remove the overly strict assertion that rejected reducing along a
zero-sized dimension even with explicit axes. Reducing axis K of shape
{N, 0} with keepdims=false produces shape {N} filled with the identity
value (0 for sum), which is mathematically valid.

The CPU implementation already handles this case via
check_and_reduce_empty_set_input(). The CUDA path now allows
PrepareForReduce to succeed, and ReduceComputeCore (line 369) already
handles input_count==0 correctly.

This fixes CUDA inference for models with dynamic KV cache where
past_sequence_length=0 during prefill (e.g., Gemma4 via ORT GenAI).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Add 3 test cases verifying CUDA ReduceSum handles zero-sized dimensions:
- {1, 0} with axis=1, keepdims=false → {1} with value 0
- {1, 0} with axis=1, keepdims=true → {1, 1} with value 0
- {2, 0, 3} with axis=1, keepdims=false → {2, 3} with all zeros

These test the fix that removed the overly strict assertion rejecting
reduction along zero-sized dimensions on CUDA.

Signed-off-by: Justin Chu <justinchu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant