Fix CUDA ReduceSum crash on empty tensors with explicit axes#28353
Open
justinchuby wants to merge 2 commits intomainfrom
Open
Fix CUDA ReduceSum crash on empty tensors with explicit axes#28353justinchuby wants to merge 2 commits intomainfrom
justinchuby wants to merge 2 commits intomainfrom
Conversation
Remove the overly strict assertion that rejected reducing along a
zero-sized dimension even with explicit axes. Reducing axis K of shape
{N, 0} with keepdims=false produces shape {N} filled with the identity
value (0 for sum), which is mathematically valid.
The CPU implementation already handles this case via
check_and_reduce_empty_set_input(). The CUDA path now allows
PrepareForReduce to succeed, and ReduceComputeCore (line 369) already
handles input_count==0 correctly.
This fixes CUDA inference for models with dynamic KV cache where
past_sequence_length=0 during prefill (e.g., Gemma4 via ORT GenAI).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Add 3 test cases verifying CUDA ReduceSum handles zero-sized dimensions:
- {1, 0} with axis=1, keepdims=false → {1} with value 0
- {1, 0} with axis=1, keepdims=true → {1, 1} with value 0
- {2, 0, 3} with axis=1, keepdims=false → {2, 3} with all zeros
These test the fix that removed the overly strict assertion rejecting
reduction along zero-sized dimensions on CUDA.
Signed-off-by: Justin Chu <justinchu@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Remove the overly strict assertion in CUDA
PrepareForReducethat rejects reducing along a zero-sized dimension even with explicit axes. This matches the behavior of the CPU implementation which handles empty tensors viacheck_and_reduce_empty_set_input().Motivation
ORT GenAI's Gemma4 CUDA pipeline triggers ReduceSum on
{1, 0}tensors during prefill (past_sequence_length=0). The CPU implementation handles this correctly, but the CUDA path crashes with:Reducing axis 1 of
{1, 0}with keepdims=false produces shape{1}filled with the identity value (0 for sum). This is mathematically valid and numpy handles it correctly.Changes
Removed the
ORT_ENFORCE(input_dims[axis] != 0, ...)assertion at line 291 ofreduction_ops.cc. The existingReduceComputeCorealready handlesinput_count == 0correctly (line 369-370).The default-axes path (line 302) is left unchanged — it already conditionally checks
keepdims || dim != 0.Testing
Verified with Gemma4 e2b-it model on H200 GPU:
ReduceSum_node_232crashes on{1, 0}tensorRelated issues