[xpu] Guard oneDNN SDPA against small head_dim causing GPU page faults by laifenxiawucha · Pull Request #8 · laifenxiawucha/pytorch

laifenxiawucha · 2026-05-21T17:09:09Z

Description

Guard oneDNN SDPA against head_dim values that trigger GPU page faults in the oneDNN MatMul kernel (MFDNN-14479).

The oneDNN 3.11.1 upgrade (PyTorch pytorch#177607) fixed the original crash case (head_dim=96, seq_len=65536) but the fix was incomplete — small head_dim values still crash.

Root Cause

check_head_dim_size_xpu() in aten/src/ATen/native/mkldnn/xpu/Attention.cpp only enforced a maximum head_dim (576) but had no minimum. oneDNN's MatMul kernel for SDPA requires head_dim >= SIMD tiling block size (32 for fp32, 16 for fp16/bf16). When head_dim is below this, the kernel writes beyond allocated buffers, producing:

Segmentation fault from GPU at 0xff00000000094000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 1 (Write), banned: 1, aborting.
Abort was called at 288 line in file: ./shared/source/os_interface/linux/drm_neo.cpp

Fix

Add minimum head_dim guard in check_head_dim_size_xpu():

fp32: head_dim >= 32 (SIMD16 = 64B tiling block)
fp16/bf16: head_dim >= 16 (SIMD16 = 32B tiling block)

Shapes below the minimum fall through to the math backend.

Reproducer

import torch

# Crash case from intel/torch-xpu-ops#3394
batch_size, seq_len, num_heads, head_dim = 4, 16413, 3, 16
torch.manual_seed(0)

query = torch.randn(batch_size, seq_len, num_heads, head_dim, device="xpu", dtype=torch.float32)
key = torch.randn(batch_size, seq_len, num_heads, head_dim, device="xpu", dtype=torch.float32)
value = torch.randn(batch_size, seq_len, num_heads, head_dim, device="xpu", dtype=torch.float32)

query_t = query.permute(0, 2, 1, 3)
key_t = key.permute(0, 2, 1, 3)
value_t = value.permute(0, 2, 1, 3)

out = torch.nn.functional.scaled_dot_product_attention(
    query=query_t, key=key_t, value=value_t,
    attn_mask=None, dropout_p=0.0, is_causal=False, scale=None, enable_gqa=False,
)

After fix: falls back to math backend (no crash), produces correct output.

Self-Review

References

crash still occur in sdpa intel/torch-xpu-ops#3394 — crash still occur in sdpa
crash in sdpa. intel/torch-xpu-ops#2570 — original crash (fixed by oneDNN 3.11.1)
MFDNN-14479 — oneDNN JIRA tracking
PyTorch Upgrade submodule oneDNN to v3.11.2 pytorch/pytorch#177607 — oneDNN 3.11.1 upgrade
CUDA ref: aten/src/ATen/native/transformers/sdp_utils_cpp.cpp:14-34

Fixes: intel/torch-xpu-ops#3394

…PU page fault in oneDNN MatMul kernel (MFDNN-14479) when head_dim is\nbelow the kernel tiling block size. The oneDNN 3.11.1 upgrade\n(PyTorch pytorch#177607) fixed the original crash (head_dim=96, seq_len=65536)\nbut missed small head_dim values (head_dim=16, seq_len=16413).\n\nRoot cause: can_use_overrideable_attention had no minimum head_dim\ncheck, so shapes that crash oneDNN were still routed to it.\n\nFix: Add minimum head_dim guard in check_head_dim_size_xpu():\n fp32 requires head_dim >= 32 (SIMD16=64B tiling block)\n fp16/bf16 requires head_dim >= 16 (SIMD16=32B tiling block)\n\nFixes: intel/torch-xpu-ops#3394\nReference: MFDNN-14479\nCUDA ref: aten/src/ATen/native/transformers/sdp_utils_cpp.cpp:14-34\n (CUDA has no min check - SIMT handles small dims natively)

laifenxiawucha force-pushed the hermes/fix-333-sdpa-head-dim-guard branch from e54b848 to b2e8d74 Compare May 22, 2026 05:14

laifenxiawucha changed the base branch from main to fix/xpu-ut-issues May 22, 2026 05:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[xpu] Guard oneDNN SDPA against small head_dim causing GPU page faults#8

[xpu] Guard oneDNN SDPA against small head_dim causing GPU page faults#8
laifenxiawucha wants to merge 1 commit into
fix/xpu-ut-issuesfrom
hermes/fix-333-sdpa-head-dim-guard

laifenxiawucha commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

laifenxiawucha commented May 21, 2026

Description

Root Cause

Fix

Reproducer

Self-Review

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant