perf: optimize logits allgather and parallelize eagle3 input projection by syuoni · Pull Request #295 · lightseekorg/tokenspeed

syuoni · 2026-05-28T09:56:30Z

Summary

Hidden-dim allgather fast path that beats NCCL on small token counts, wired through ColumnParallelLinear + logits projection, and EAGLE3 input projection parallelized.

New kernel: all_gather_inner (Triton, NVIDIA-only) — multimem.st-based push allgather over the hidden axis, with skip_entry_sync constexpr to drop the entry CAS barrier when the caller has externally guaranteed peer drain.
Backend wiring: TritonRSAGBackend.all_gather(dim=-1) dispatches to all_gather_inner on NVIDIA; AutoBackend routes 2-D hidden-dim allgathers through it. Comm-backend signatures cleaned up — rank is now derived internally via group.index(dist.get_rank()).
Logits: LogitsProcessor uses all_gather_inner for T ≤ 128, NCCL fallback above.
EAGLE3 fc: Eagle3MlaModel.fc (DSv3) and Eagle3LlamaModel.fc (Llama) switched from ReplicatedLinear / nn.Linear to ColumnParallelLinear(gather_output=True) — sharding the fc weight across attn TP. Expected fc time: ~50 μs → ~22 μs; saves ~231 MB/rank at TP=4.
Cleanup: deleted the now-dead simple_all_gather stack — FusionOp.AG_VOCAB, allgather_vocab, the TRT-LLM workspace plumbing, the CUDA kernel (all_gather.cu + flashinfer simple_all_gather namespace), Python bindings, and the kernel test.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50603a6bba

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22492842c2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e011953b5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50056a0d89

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

… skippable entry sync Adds a new collective entry point ``all_gather_inner`` next to the existing ``all_gather``. Unlike the outer (which gathers along the token dim and stages input through ``state.comm_buff``), the inner: * Concatenates along the **hidden** dim — each rank contributes a column slab ``(total_tokens, hidden_list_in_group[rank])`` and the kernel multimem-broadcasts it into ``state.comm_buff`` at this rank's column band. Result is ``(total_tokens, sum(hidden_list_in_group))``. * Reads ``hidden_states`` directly via the kernel's input pointer; no staging copy into ``state.comm_buff``. (The local rank's slot is populated by ``multimem.st``-to-self.) * Accepts ``skip_entry_sync: bool`` (compile-time constexpr in the kernel) that elides the entry CAS barrier when the caller can externally guarantee cross-rank synchronization since the last buffer read. * Uses CAS-based ``blockwise_barrier`` for both barriers (entry conditional via SKIP_ENTRY_SYNC constexpr, exit always). No change to barrier semantics, signal pad sizing, or symm-mem handle ownership. * Mirrors the outer's ``rsag_resize_hidden_if_needed`` resize trick so a state oversized vs the active ``total_hidden`` returns a contiguous slice instead of a strided view. API parallel to the outer: ``tp_hidden_dim`` (auto even-split, refuses remainder distribution because per-rank slices must be 8-aligned) or ``hidden_list_in_group`` (explicit per-rank widths). Each list entry must be ``> 0`` and a multiple of 8 bf16 (16-byte multimem.st alignment). Input must be contiguous bf16 with a 16-byte-aligned ``data_ptr()``; ``state.hidden_dim`` must be a multiple of 8. NVIDIA-only; AMD is intentionally unsupported on this path. Acks two rounds of codex review covering input/state alignment, zero-width shard rejection, and the precise skip_entry_sync safety contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yicheng Qiang <qiangyicheng@icloud.com>

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbfcafed6f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

syuoni · 2026-05-29T12:32:40Z

The timeout of PR Test / ut-runtime-1gpu / linux-mi355-1gpu-lightseek (pull_request) is a pre-existing issue: #217 (comment)

Merging.

syuoni requested review from LorrinWWW, dongjiyingdjy, yweng0828 and zhyncs May 28, 2026 09:56

syuoni requested a review from a team as a code owner May 28, 2026 09:56

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/distributed/comm_ops.py

Comment thread python/tokenspeed/runtime/distributed/comm_backend/auto.py

Comment thread python/tokenspeed/runtime/layers/logits_processor.py

syuoni force-pushed the perf/logits-allgather branch from 50603a6 to 2249284 Compare May 28, 2026 10:02

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/layers/logits_processor.py Outdated

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/distributed/comm_ops.py

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/distributed/comm_backend/triton_rsag.py

qiangyicheng and others added 11 commits May 29, 2026 05:32

logits allgather

a222b90

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

TritonRSAG

b58b189

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

clean up

0593f8e

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

parallelize eagle input proj

c54fc52

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

clean simple_all_gather

bdac711

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix

acdccff

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix

882b667

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix ut

1436457

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix tp=1

275357b

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix eagle3 input proj

dbfcafe

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

syuoni force-pushed the perf/logits-allgather branch from 50056a0 to dbfcafe Compare May 29, 2026 06:15

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/layers/logits_processor.py

yweng0828 approved these changes May 29, 2026

View reviewed changes

syuoni merged commit 5241fd9 into lightseekorg:main May 29, 2026
108 of 122 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize logits allgather and parallelize eagle3 input projection#295

perf: optimize logits allgather and parallelize eagle3 input projection#295
syuoni merged 11 commits into
lightseekorg:mainfrom
syuoni:perf/logits-allgather

syuoni commented May 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

syuoni commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

syuoni commented May 28, 2026

Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

syuoni commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants