perf: Optimize inter-iteration small op#291
Conversation
65ca7c9 to
07075df
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 07075dffb2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if batch_size < self.max_bs: | ||
| self.seq_lens_buf[batch_size:].fill_(1) |
There was a problem hiding this comment.
Reset padded request slots before graph replay
When CUDA graph replay pads a decode batch (padded_bs > bs) and speculative decoding is enabled, _forward_step writes drafter outputs to future_input_map[self.input_buffers.req_pool_indices_buf[:ctx.bs]] using the persistent buffer up to the padded size. This block now refreshes only seq_lens_buf for the tail, so after a previous larger batch the padded rows can still contain stale real pool indices and the captured graph can overwrite those inactive requests' future_input_map entries; restore zero-filling req_pool_indices_buf[batch_size:] whenever batch_size < self.max_bs.
Useful? React with 👍 / 👎.
borontion
left a comment
There was a problem hiding this comment.
thanks. this is a place I also want to optimize.
07075df to
b46047c
Compare
|
This PR is not ready yet (but it will be soon), and I changed its status to ready in order to trigger CI. |
d474b78 to
ba7d7da
Compare
Summary
This PR optimizes some small operations between iterations (runtime).
Take Kimi2.5 + eagle3 as example
Before:
~160us (from the last kernel of the previous iteration to the first kernel of the next iteration)

34 kernels

After:
~85us (from the last kernel of the previous iteration to the first kernel of the next iteration)

14 kernels

Test Plan