Skip to content

perf: Optimize inter-iteration small op#291

Open
yweng0828 wants to merge 2 commits into
mainfrom
yweng/dev/opt_inter_iteration_small_kernel
Open

perf: Optimize inter-iteration small op#291
yweng0828 wants to merge 2 commits into
mainfrom
yweng/dev/opt_inter_iteration_small_kernel

Conversation

@yweng0828
Copy link
Copy Markdown
Contributor

@yweng0828 yweng0828 commented May 28, 2026

Summary

This PR optimizes some small operations between iterations (runtime).

Take Kimi2.5 + eagle3 as example

Before:

~160us (from the last kernel of the previous iteration to the first kernel of the next iteration)
image

34 kernels
image

After:

~85us (from the last kernel of the previous iteration to the first kernel of the next iteration)
image

14 kernels
image

Test Plan

@yweng0828 yweng0828 force-pushed the yweng/dev/opt_inter_iteration_small_kernel branch 3 times, most recently from 65ca7c9 to 07075df Compare May 28, 2026 06:32
@yweng0828 yweng0828 marked this pull request as ready for review May 28, 2026 06:41
@yweng0828 yweng0828 requested a review from a team as a code owner May 28, 2026 06:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07075dffb2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +336 to 337
if batch_size < self.max_bs:
self.seq_lens_buf[batch_size:].fill_(1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reset padded request slots before graph replay

When CUDA graph replay pads a decode batch (padded_bs > bs) and speculative decoding is enabled, _forward_step writes drafter outputs to future_input_map[self.input_buffers.req_pool_indices_buf[:ctx.bs]] using the persistent buffer up to the padded size. This block now refreshes only seq_lens_buf for the tail, so after a previous larger batch the padded rows can still contain stale real pool indices and the captured graph can overwrite those inactive requests' future_input_map entries; restore zero-filling req_pool_indices_buf[batch_size:] whenever batch_size < self.max_bs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@borontion borontion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. this is a place I also want to optimize.

@yweng0828 yweng0828 marked this pull request as draft May 29, 2026 02:59
@yweng0828 yweng0828 force-pushed the yweng/dev/opt_inter_iteration_small_kernel branch from 07075df to b46047c Compare May 31, 2026 14:28
@yweng0828 yweng0828 marked this pull request as ready for review May 31, 2026 15:06
@yweng0828
Copy link
Copy Markdown
Contributor Author

This PR is not ready yet (but it will be soon), and I changed its status to ready in order to trigger CI.

@yweng0828 yweng0828 force-pushed the yweng/dev/opt_inter_iteration_small_kernel branch from d474b78 to ba7d7da Compare June 1, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants