fix(deepseek-v4): close MTP acceptance gap#207
Conversation
c285d95 to
63e22c5
Compare
There was a problem hiding this comment.
💡 Codex Review
tokenspeed/python/tokenspeed/runtime/layers/attention/registry.py
Lines 433 to 436 in 63e22c5
When the draft model is also DeepSeek V4 (the new is_deepseek_v4_draft_model path), this branch still computes draft_cache_cell_size from draft_attn_config.cache_cell_size(), which is the generic MLA estimate and does not include V4 grouped caches (SWA/compressed/indexer/state). profile_deepseek_v4_max_num_pages then overestimates available KV pages for target+draft, so deployments can admit a token/page budget that exceeds real GPU memory and fail with OOM under load; this should use the V4-specific draft size (draft_profile_cache_cell_size / layout-based sizing) instead.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c01d2a08dd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f8108eed5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2494dc30ac
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb37d86925
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
f77d47a to
e4223a0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4223a006a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@Xiangyi1996 please fix the conflicts thanks! |
e4223a0 to
c08927d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c08927dc81
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
c08927d to
5abff4a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5abff4a8cd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5fa924b9da
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
f819dd2 to
5cc6bcd
Compare
|
This PR is a bit large because it contains the V4/R1 MTP stack plus the acceptance-gap fix and follow-up CI/review fixes. The main risk areas are:
Validation so far:
|
dongjiyingdjy
left a comment
There was a problem hiding this comment.
Code Review
23 files, +3096/-180. Enables MTP speculative decoding for V4, fixes acceptance gap vs TRTLLM (2.8447 tok/iter match). Core approach — separate TARGET_VERIFY/DRAFT_EXTEND modes with V4-specific metadata lifecycle — is sound.
Issues to address
1. _forward_flashmla_sparse per-token Python loop (perf)
The new sparse attention method iterates per-token in Python with GPU ops inside each iteration. If this is on the MTP draft hot path, it will dominate latency. Please confirm this is only a correctness fallback, or add a guard/warning.
2. Mixed batch disabled for ALL speculative models, not just V4 (regression risk)
enable_mixed_prefill_decode = (
server_args.enable_mixed_batch and server_args.speculative_algorithm is None
)This disables mixed batch whenever any speculative algorithm is active, including EAGLE3 on non-V4 models. If mixed batch previously worked with EAGLE3, this is a regression. Should this be V4-specific?
3. _select_decode_metadata fallback chain is fragile for CUDA graph replay
The method searches through forward_metadata, forward_decode_metadata, forward_prefill_metadata by matching num_tokens. If a stale graph replay happens to match the wrong metadata's token count, it silently reads wrong data. The RuntimeError fallback is fine for eager but catastrophic inside a captured graph.
4. _refresh_cuda_graph_packed_metadata allocates temporaries every replay
self._cuda_graph_query_start_loc[:bs+1].copy_(torch.arange(bs+1, ...) * tokens_per_req)
self._cuda_graph_token_to_req[:total_tokens].copy_(torch.arange(bs, ...).repeat_interleave(tokens_per_req))torch.arange + repeat_interleave on every CUDA graph replay step. These could be precomputed once during capture and reused.
Suggestions
5. Repeated is_target_verify() or is_draft_extend() pattern (~15 occurrences)
Consider adding a ForwardMode.is_speculative() helper to reduce repetition and ensure consistent handling.
6. num_tokens fallback in init_forward_metadata uses stale spec config
When neither positions nor explicit num_tokens is provided for speculative modes, the code estimates from speculative_num_draft_tokens or speculative_num_steps set at init time. After partial accept the actual count could differ. Callers in CudaGraphWrapper do pass num_tokens explicitly, so this is a latent edge case, not a current bug.
Looks good
- Test coverage is thorough (metadata shape transitions, non-V4 backend contracts, SWA slot sanitization)
cache_start.clone()correctly prevents drafter from modifying sharedseq_lens_buf- SWA slot sanitization + CUDA kernel guard is defense-in-depth
_init_capture_metadataordering (before warmup + before capture) is correct, though a comment explaining why it runs twice would help
|
Supplement: 6 new functions in this PR are never called — dead code. Suggest removing them:
If these are reserved for follow-up work, please don't include them in this PR — add them when they're actually wired in. |
|
Thanks for the detailed review. I addressed the items one by one:
Removed
Fixed. I replaced the global
Fixed.
Added
Fixed. V4 speculative metadata setup now requires explicit dead code: Removed the unused |
…ken_id R1-0528-NVFP4-v2 marks q_a_proj / kv_a_proj_with_mqa in exclude_modules (stored as bf16 at logical shape), but DeepseekV3FusedQkvAProjWithMqa allocates an NVFP4-packed buffer because the fused prefix is not in exclude_modules. Detect component-level exclusion and pass through quant_config=None to fall back to bf16. Also add get_hot_token_id() returning None to DeepseekV3ForCausalLMNextN to match the EAGLE3/MTP drafter contract (mirrors qwen3_5_nextn.py). Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Mask graph-padded cache writes in the DeepSeek V4 paged-cache paths. CUDA graph replay can pad token rows with stale but in-range slot values, so capacity checks alone are not enough before cache inserts. Plumb ForwardMetadata.is_valid_token through V4 cache metadata and mask invalid rows to -1 before SWA KV cache, compressed KV cache, compressor state, indexer state, and CSA indexer cache inserts. Also expand per-request request indices and validity masks to per-token form for packed TP draft decode. Keep the earlier review follow-ups in this commit: preserve non-V4 draft-extend metadata, pass MTP spec_step_idx into draft models, and cover logits-processor gather fallback behavior. Validated: focused pytest groups for slot mapping and V4 group mapping pass locally; DP=4 short E2E sanity 4/4 OK with no hard errors; TP=2 short E2E sanity 2/2 OK with no NCCL hang or hard errors. Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
d7f0b0e to
a3f8f62
Compare
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f47a91173c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Xiangyi Zhang <xiangyiz@nvidia.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 01942d545a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def __init__(self, **kwargs): | ||
| # Explicit __init__ prevents transformers from auto-generating one | ||
| # that skips Qwen3_5Config.__init__ (text/vision config setup). | ||
| super().__init__(**kwargs) |
There was a problem hiding this comment.
Why this Qwen3.5 fix is in a V4 MTP PR
This PR adds new attribute-access paths in ModelConfig.__init__ (via
_derive_num_attention_layers / is_deepseek_v4_nextn) to support V4 MTP.
Those new paths happen to exercise the Qwen3.5 config in a way that triggers
a latent bug on main introduced by transformers 5.6:
Qwen3_5MoeConfighad no explicit__init__- transformers 5.6 auto-generates one for subclasses, bypassing
Qwen3_5Config.__init__ text_configis therefore left in an uninitialized/wrapped form- Subsequent
__getattr__forwards loop ontext_configitself → RecursionError
Reproduced on origin/main (without this PR) using a minimal
self-referencing text_config setup; the same trace matches the CI
failure at qwen3_5_config.py:114-115.
@yweng0828 confirmed his other PR doesn't hit this because it doesn't go
through the new ModelConfig path.
Discussed with @dongjiyingdjy and @yweng0828: keeping the fix in this PR
since it's needed to unbreak CI here. Happy to cherry-pick to a standalone
follow-up if preferred.
CI evidence (pre-fix): https://github.com/lightseekorg/tokenspeed/actions/runs/26626283458/job/78483017501
Summary
++This PR is a bit large because it contains the V4/R1 MTP stack plus the acceptance-gap fix and follow-up CI/review fixes.
This PR closes the DeepSeek V4 MTP acceptance gap between TokenSpeed and TRTLLM.
Root cause:
Fix:
Validation
pre-commit run --all-files: passedpy_compileon touched runtime/test files: passedDecoded Tok/Iter = 2.8447Spec Accept Rate = 0.6485