确定各种Sever的Node序号

每组Prefill使用2个node，总共启动1~3组prefill。当前脚本启动了1组prefill。

每组Decode使用12个node。

Router和Client都运行在第一个node上。

这里的Client就是Benchmark。

执行步骤

所有的指令都是先登录到第一个节点，再执行。

有时候还需要进入到docker再运行。

下载docker

bash 0.prepare_docker.sh

下载模型checkpoint

bash 0.download_model.sh

申请资源

bash 1.salloc.sh

确定各种Server使用的node list

在申请资源的界面上执行：

bash 2.get_node_list_env.sh

运行上面的命令会生成 node_list_env.sh，在后面启动server的脚本中会用到。

启动Prefill Server

3.launch_prefill_server.sh

启动Decode Server

4.launch_decode_server.sh

启动Router

等待 decoder server 和 prefill server 启动完成之后再启动 router。

decoder server（或者 prefill server）启动完成的标志是输出 “The server is fired up and ready to roll!”。

# 进入docker
bash enroot_exec_first_container.sh
# 启动router
bash 5.launch_router.sh

启动Benchmark

# 进入docker
bash enroot_exec_first_container.sh
# 启动router
bash 6.start_benchmark.sh

decode开始slow down

# 进入docker
bash enroot_exec_first_container.sh
# Decoder接收到这个指令之后会在每次 run_batch() 都会先 sleep 180s 在执行model forward。
bash 7.start_slow_down_decode.sh

观察decoder的输出

执行了slow down 180s之后，会让 decode 每次 run_batch() 都sleep 180s，所以prefill生成的kv cache会积累得越来越多。

decode run_batch() 也就能拿到更多的running-req并行执行。

running-req最大会达到 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK。

观察decode的打印，等到 running-req 增大到 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 之后执行下一条指令，也就是 slow_down null，让decode恢复正常，不再sleep。

decoder 结束 slow_down

# 进入docker
bash enroot_exec_first_container.sh
# Decoder接收到这个指令之后会结束每次前向之前的sleep.
bash 8.stop_slow_down_decode.sh

发送这条指令之后，需要等180s左右，decode才会有反馈。

[2025-11-22 23:05:07 DP0 TP0 EP0] Capture cuda graph begin. This can take up to several minutes. avail mem=40.56 GB
[2025-11-22 23:05:07 DP0 TP0 EP0] Capture cuda graph bs [1024]
[2025-11-22 23:05:27 DP0 TP0 EP0] Capture cuda graph end. Time elapsed: 19.62 s. mem usage=31.07 GB. avail mem=9.49 GB.
[2025-11-22 23:05:30 DP0 TP0 EP0] max_total_num_tokens=3122368, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=1024, context_len=2176, available_gpu_mem=9.49 GB
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 0.42, #queue-req: 0, 
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 25.99, #queue-req: 0, 
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.31, #queue-req: 0, 
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.20, #queue-req: 0, 
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.08, #queue-req: 0, 
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.12, #queue-req: 0, 
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12.76, #queue-req: 0, 
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 116.35, #queue-req: 0, 
[2025-11-22 23:08:00 DP0 TP0 EP0] Cache flushed successfully!
[2025-11-22 23:15:42 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:18:43 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:21:43 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 0.89, #queue-req: 853, 
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12112.66, #queue-req: 853, 
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12241.95, #queue-req: 853, 
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12156.81, #queue-req: 853,

抓取Torch Profile

等到Decode响应了上一条指令之后，表明decode已经不再sleep，进入到正常的decode阶段，就可以用下面的指令抓取profile了。

可以反复执行多次，获得不同时刻的profile信息。

脚本里面写的是每次保存5个step的profile信息。

# 进入docker
bash enroot_exec_first_container.sh
# 抓取profile
bash 9.sglang_profile.sh

[2025-11-22 23:24:52 DP0 TP0 EP0] Profiling starts. Traces will be saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/7.SGLang_PD/Scripts-SGLang/../torch_profiler (with profile id: 1763882692.8351896)
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 6892.73, #queue-req: 853, 
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13324.23, #queue-req: 853, 
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13146.41, #queue-req: 853, 
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13180.73, #queue-req: 853, 
[2025-11-22 23:24:53 DP0 TP0 EP0] Stop profiling...
[2025-11-22 23:24:56 DP0 TP0 EP0] Profiling done. Traces are saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/7.SGLang_PD/Scripts-SGLang/../torch_profiler

load balance between ep ranks

In order to make load balance between ep ranks, there two solutions:

solution-1: using pre dumped expert distribution file to init expert location.

GB200 blog2 use this solution.

step 1. create expert distribution data

lauching decode server with --expert-distribution-recorder-mode stat and --expert-distribution-recorder-buffer-size -1 .

before start benchmark, start expert distribution recode with bash enroot_exec_first_container.sh; bash z.1.start_record.sh.

wait 30 minutes after executing bash 7.slowdown_decoder_null.sh, then dump expert distribution recode with bash enroot_exec_first_container.sh; bash z.2.dump_record.sh. The dumped file will be saved in ${SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR}.

For more informanntion, please refer to Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Org

step 2. using `--init-expert-location` to indicate pre dumped expert distribution file

lauching decode server with --init-expert-location ${SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR}/expert_distribution_recorder_xxx.pt

solution-2: using `--eplb-algorithm deepseek` and `--enable-eplb` flags to run real time load rebalance.

Lauching decode server with --eplb-algorithm deepseek and --enable-eplb flags. The eplb manager to do some work in order to make load balance. This step will need extra gpu memory。

calculate throughput

Here is an example of throughput calculation. You can follow this example to calculate the throughput for ep48.

Global Batch Size

Search "Profiling starts" and "Stop profiling" in "logs/launch_server_decode_node_rank_0.log", We can find some logs like the following line, and "1760576844.4826152" indicate the filename of torch profile.

[2025-10-15 18:07:24 DP0 TP0 EP0] Profiling starts. Traces will be saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/6.SGLang_PD/Scripts-SGLang/../torch_profiler (with profile id: 1760576844.4826152)

We also can find some logs like the following line, and "#running-req: 232" indicate the local batch size for DP0 is 232.

[2025-10-15 18:07:24 DP0 TP0 EP0] Decode batch. #running-req: 285, #token: 328960, token usage: 0.51, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 3344.60, #queue-req: 0,

Using the method above, we can find local batch size for each DP rank: 285, 256, 254, 265, 271, 227, 221, 269. so global batch size is: 285 + 256 + 254 + 265 + 271 + 227 + 221 + 269 = 2048.

Duation For Each Forward Step

Now open the torch profile torch_profiler/1760576844.4826152-TP-0.trace.json.gz We can find that the duation for each forward step is 65ms.

Throughput For Each GPU

Now we can calculate the throughput for each gpu: 2048/0.065/8 = 3938 (toks/s/gpu)

参考资料

SGLang developer guide

SGLang developer guide: bench_serving

SGLang的系列博客

Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Org

Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput | LMSYS Org

Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput | LMSYS Org

Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G | LMSYS Org

DeepSeek V3模型简单示例

SGLang的这个pr里面有B200 TP8并行的脚本和执行结果。可以作为搭建GB200 EP8的参考示例。

Enables TRT-LLM backend to be used for target_verify by pranavm-nvidia · Pull Request #10281 · sgl-project/sglang

SGLang code walk through

Awesome-ML-SYS-Tutorial/sglang/code-walk-through

FlashInfer论文

flashinfer 作为sglang的后端，从中可以了解一些底层的设计思路。

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
decode_sglang_expert_distribution_recorder		decode_sglang_expert_distribution_recorder
images		images
logs_gb200_blog2_fp4		logs_gb200_blog2_fp4
0.download_model.sh		0.download_model.sh
0.prepare_docker.sh		0.prepare_docker.sh
1.salloc.sh		1.salloc.sh
2.get_node_list_env.sh		2.get_node_list_env.sh
3.launch_prefill_server_fp4.sh		3.launch_prefill_server_fp4.sh
4.launch_decode_server_fp4.sh		4.launch_decode_server_fp4.sh
5.launch_router.sh		5.launch_router.sh
6.start_benchmark.sh		6.start_benchmark.sh
7.start_slow_down_decode.sh		7.start_slow_down_decode.sh
8.stop_slow_down_decode.sh		8.stop_slow_down_decode.sh
9.sglang_profile.sh		9.sglang_profile.sh
README.md		README.md
download_model.py		download_model.py
enroot_exec_first_container.sh		enroot_exec_first_container.sh
launch_server_in_docker_fp4.sh		launch_server_in_docker_fp4.sh
ssh_head_node.sh		ssh_head_node.sh
z.1.start_expert_distribution_record.sh		z.1.start_expert_distribution_record.sh
z.2.dump_expert_distribution_record.sh		z.2.dump_expert_distribution_record.sh
z.3.nsys_profile.sh		z.3.nsys_profile.sh
z.blog2_High-prec_decode.sh		z.blog2_High-prec_decode.sh
z.blog2_Low-prec_decode.sh		z.blog2_Low-prec_decode.sh
z.h100_blog_dump_expert_distribution.sh		z.h100_blog_dump_expert_distribution.sh
z.print_expert_dump.py		z.print_expert_dump.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

确定各种Sever的Node序号

执行步骤

下载docker

下载模型checkpoint

申请资源

确定各种Server使用的node list

启动Prefill Server

启动Decode Server

启动Router

启动Benchmark

decode开始slow down

观察decoder的输出

decoder 结束 slow_down

抓取Torch Profile

load balance between ep ranks

solution-1: using pre dumped expert distribution file to init expert location.

step 1. create expert distribution data

step 2. using `--init-expert-location` to indicate pre dumped expert distribution file

solution-2: using `--eplb-algorithm deepseek` and `--enable-eplb` flags to run real time load rebalance.

calculate throughput

Global Batch Size

Duation For Each Forward Step

Throughput For Each GPU

参考资料

SGLang developer guide

SGLang的系列博客

DeepSeek V3模型简单示例

SGLang code walk through

FlashInfer论文

About

Uh oh!

Releases

Packages

Languages

shifangx/Scripts-SGLang

Folders and files

Latest commit

History

Repository files navigation

确定各种Sever的Node序号

执行步骤

下载docker

下载模型checkpoint

申请资源

确定各种Server使用的node list

启动Prefill Server

启动Decode Server

启动Router

启动Benchmark

decode开始slow down

观察decoder的输出

decoder 结束 slow_down

抓取Torch Profile

load balance between ep ranks

solution-1: using pre dumped expert distribution file to init expert location.

step 1. create expert distribution data

step 2. using --init-expert-location to indicate pre dumped expert distribution file

solution-2: using --eplb-algorithm deepseek and --enable-eplb flags to run real time load rebalance.

calculate throughput

Global Batch Size

Duation For Each Forward Step

Throughput For Each GPU

参考资料

SGLang developer guide

SGLang的系列博客

DeepSeek V3模型简单示例

SGLang code walk through

FlashInfer论文

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

step 2. using `--init-expert-location` to indicate pre dumped expert distribution file

solution-2: using `--eplb-algorithm deepseek` and `--enable-eplb` flags to run real time load rebalance.

Packages