fix(mha): don't compute FA3 scheduler metadata for non-FA3 backends by qywu · Pull Request #276 · lightseekorg/tokenspeed

qywu · 2026-05-27T01:49:45Z

Summary

MHAAttnBackend._maybe_compute_scheduler_metadata's docstring says it returns None when the active backend doesn't consume pre-computed scheduler metadata, but the implementation unconditionally calls mha_decode_scheduler_metadata and returns its result. Downstream in forward_decode, the non-None tensor gets passed to the selected kernel — and triton / fa4 / flashinfer reject the unknown scheduler_metadata kwarg:

TypeError: triton_mha_decode_with_kvcache() got an unexpected keyword
argument 'scheduler_metadata'

Guard the call so the docstring matches the behaviour: skip the compute (and therefore the downstream kwarg) when kernel_solution is anything other than "fa3" or None (None == auto-select, may land on FA3).

Repro

python -m tokenspeed.cli serve <any-MHA-model> --attention-backend=triton
# first decode forward TypeErrors as above

Test plan

--attention-backend=triton end-to-end inference on Hopper (Qwen2-1.5B-Instruct) — confirmed locally; produces correct output.
--attention-backend=fa3 continues to consume the pre-computed metadata as before.
--attention-backend=mha (auto) continues to compute it (auto-select may still land on FA3).
--attention-backend=fa4 / flashinfer no longer trip the TypeError.

``_maybe_compute_scheduler_metadata``'s docstring promises ``None`` when the active backend doesn't consume pre-computed scheduler metadata, but the implementation unconditionally calls ``mha_decode_scheduler_metadata`` and returns its result. Downstream, ``forward_decode`` then passes the non-None tensor to whichever kernel is selected — and triton / fa4 / flashinfer reject the unknown ``scheduler_metadata`` kwarg: TypeError: triton_mha_decode_with_kvcache() got an unexpected keyword argument 'scheduler_metadata' Guard the call so the docstring is the truth: skip the compute (and the downstream kwarg) when ``kernel_solution`` is anything other than ``"fa3"`` or ``None`` (None == auto-select, which may land on FA3). Found while running --attention-backend=triton end-to-end on H100; reproduces with any non-FA3 selection on Hopper. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

This was referenced May 27, 2026

feat: expose POST /release_memory_occupation and /resume_memory_occupation #272

Closed

feat(memory-saver): optional CPU staging for round-trip weight preservation #275

Draft

qywu closed this May 27, 2026

lightseek-bot deleted the fix/mha-scheduler-metadata-guard branch May 27, 2026 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mha): don't compute FA3 scheduler metadata for non-FA3 backends#276

fix(mha): don't compute FA3 scheduler metadata for non-FA3 backends#276
qywu wants to merge 1 commit into
mainfrom
fix/mha-scheduler-metadata-guard

qywu commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qywu commented May 27, 2026

Summary

Repro

Test plan

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant