Support Attention head_dim=512 by CC-Yeh · Pull Request #412 · trymirai/uzu

CC-Yeh · 2026-05-14T14:33:33Z

MatmulArguments: b_offset, b_leading_dimension, b_transpose (the last as a VARIANTS axis on gemm.metal).
head_dim=512 dispatch (suffix > 8): per-group matmul(Q, K^T) → mask → softmax → matmul(P, V) → scatter. Smaller cases use existing single_pass/two_pass (also gained
HEAD_DIM=512).
New generic Softmax kernel under kernel/softmax/.
Two attention helpers: ScatterScores, ScatterValues

CC-Yeh · 2026-05-14T14:33:44Z

Why BD=512 fused GEMM attention didn't work

Threadgroup memory ceiling is 32 KB on Apple Silicon. Fused GEMM needs Q-tile + K/V-tile resident:

BQ BK q_smem + kv_smem

32 32 66 KB — overflows

32 16 49 KB — overflows

16 16 32 KB — at the limit

16 8 24 KB — fits
The only tile that fit (BQ=16, BK=8) was 2–3× slower than unfused matmul:

seq fused BD=512 unfused matmul ratio

512 4.23 ms 1.80 ms 2.4× slower

2048 66.4 ms 22.5 ms 2.9× slower
Why: BK=8 means 128 FMAs per KV reload vs SteelMatmul's 2048 (16× worse bandwidth amortization). BQ=16 means 2 simdgroup-matrix rows — too few to overlap math with loads.

Why we use the unfused matmul pipeline instead

At HEAD_DIM=512 the work IS two big GEMMs (Q @ K^T and P @ V). GEMM kernels are already peak-tuned for those shapes; a hand-written attention kernel hardly beat them.
MLX does the same — head_dim ∉ {64, 80, 128} → use_fallback = true → matmul + softmax + matmul at the graph level.

Support Attention head_dim=512

fbcac05

Merge branch 'main' into support_headdim_512_attention

089aa62

uuuvn approved these changes May 14, 2026

View reviewed changes

CC-Yeh marked this pull request as ready for review May 14, 2026 15:25

CC-Yeh requested review from LuckyIYI and eugenebokhan as code owners May 14, 2026 15:25

CC-Yeh enabled auto-merge (squash) May 14, 2026 15:26

eugenebokhan approved these changes May 14, 2026

View reviewed changes

CC-Yeh merged commit 5ff6f3a into main May 14, 2026
7 checks passed

CC-Yeh deleted the support_headdim_512_attention branch May 14, 2026 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Attention head_dim=512#412

Support Attention head_dim=512#412
CC-Yeh merged 2 commits into
mainfrom
support_headdim_512_attention

CC-Yeh commented May 14, 2026 •

edited

Loading

Uh oh!

CC-Yeh commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CC-Yeh commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CC-Yeh commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why BD=512 fused GEMM attention didn't work

Why we use the unfused matmul pipeline instead

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CC-Yeh commented May 14, 2026 •

edited

Loading

CC-Yeh commented May 14, 2026 •

edited

Loading