每组Prefill使用2个node,总共启动1~3组prefill。当前脚本启动了1组prefill。
每组Decode使用12个node。
Router和Client都运行在第一个node上。
这里的Client就是Benchmark。
所有的指令都是先登录到第一个节点,再执行。
有时候还需要进入到docker再运行。
bash 0.prepare_docker.sh
bash 0.download_model.sh
bash 1.salloc.sh
在申请资源的界面上执行:
bash 2.get_node_list_env.sh
运行上面的命令会生成 node_list_env.sh,在后面启动server的脚本中会用到。
3.launch_prefill_server.sh
4.launch_decode_server.sh
等待 decoder server 和 prefill server 启动完成之后再启动 router。
decoder server(或者 prefill server) 启动完成的标志是输出 “The server is fired up and ready to roll!”。
# 进入docker
bash enroot_exec_first_container.sh
# 启动router
bash 5.launch_router.sh
# 进入docker
bash enroot_exec_first_container.sh
# 启动router
bash 6.start_benchmark.sh
# 进入docker
bash enroot_exec_first_container.sh
# Decoder接收到这个指令之后会在每次 run_batch() 都会先 sleep 180s 在执行model forward。
bash 7.start_slow_down_decode.sh
执行了slow down 180s之后,会让 decode 每次 run_batch() 都sleep 180s,所以prefill生成的kv cache会积累得越来越多。
decode run_batch() 也就能拿到更多的running-req并行执行。
running-req最大会达到 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK。
观察decode的打印,等到 running-req 增大到 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 之后执行下一条指令,也就是 slow_down null,让decode恢复正常,不再sleep。
# 进入docker
bash enroot_exec_first_container.sh
# Decoder接收到这个指令之后会结束每次前向之前的sleep.
bash 8.stop_slow_down_decode.sh
发送这条指令之后,需要等180s左右,decode才会有反馈。
[2025-11-22 23:05:07 DP0 TP0 EP0] Capture cuda graph begin. This can take up to several minutes. avail mem=40.56 GB
[2025-11-22 23:05:07 DP0 TP0 EP0] Capture cuda graph bs [1024]
[2025-11-22 23:05:27 DP0 TP0 EP0] Capture cuda graph end. Time elapsed: 19.62 s. mem usage=31.07 GB. avail mem=9.49 GB.
[2025-11-22 23:05:30 DP0 TP0 EP0] max_total_num_tokens=3122368, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=1024, context_len=2176, available_gpu_mem=9.49 GB
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 0.42, #queue-req: 0,
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 25.99, #queue-req: 0,
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.31, #queue-req: 0,
[2025-11-22 23:05:32 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.20, #queue-req: 0,
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.08, #queue-req: 0,
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 27.12, #queue-req: 0,
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12.76, #queue-req: 0,
[2025-11-22 23:05:33 DP0 TP0 EP0] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 116.35, #queue-req: 0,
[2025-11-22 23:08:00 DP0 TP0 EP0] Cache flushed successfully!
[2025-11-22 23:15:42 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:18:43 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:21:43 DP0 TP0 EP0] Scheduler.run_batch sleep 180.0s
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 0.89, #queue-req: 853,
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12112.66, #queue-req: 853,
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12241.95, #queue-req: 853,
[2025-11-22 23:24:43 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 1922048, token usage: 0.62, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 12156.81, #queue-req: 853,
等到Decode响应了上一条指令之后,表明decode已经不再sleep,进入到正常的decode阶段,就可以用下面的指令抓取profile了。
可以反复执行多次,获得不同时刻的profile信息。
脚本里面写的是每次保存5个step的profile信息。
# 进入docker
bash enroot_exec_first_container.sh
# 抓取profile
bash 9.sglang_profile.sh
[2025-11-22 23:24:52 DP0 TP0 EP0] Profiling starts. Traces will be saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/7.SGLang_PD/Scripts-SGLang/../torch_profiler (with profile id: 1763882692.8351896)
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 6892.73, #queue-req: 853,
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13324.23, #queue-req: 853,
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13146.41, #queue-req: 853,
[2025-11-22 23:24:53 DP0 TP0 EP0] Decode batch, #running-req: 1024, #token: 2053120, token usage: 0.66, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 13180.73, #queue-req: 853,
[2025-11-22 23:24:53 DP0 TP0 EP0] Stop profiling...
[2025-11-22 23:24:56 DP0 TP0 EP0] Profiling done. Traces are saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/7.SGLang_PD/Scripts-SGLang/../torch_profiler
In order to make load balance between ep ranks, there two solutions:
GB200 blog2 use this solution.
lauching decode server with --expert-distribution-recorder-mode stat and --expert-distribution-recorder-buffer-size -1 .
before start benchmark, start expert distribution recode with bash enroot_exec_first_container.sh; bash z.1.start_record.sh.
wait 30 minutes after executing bash 7.slowdown_decoder_null.sh, then dump expert distribution recode with bash enroot_exec_first_container.sh; bash z.2.dump_record.sh.
The dumped file will be saved in ${SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR}.
For more informanntion, please refer to Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Org
lauching decode server with --init-expert-location ${SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR}/expert_distribution_recorder_xxx.pt
solution-2: using --eplb-algorithm deepseek and --enable-eplb flags to run real time load rebalance.
Lauching decode server with --eplb-algorithm deepseek and --enable-eplb flags. The eplb manager to do some work in order to make load balance. This step will need extra gpu memory。
Here is an example of throughput calculation. You can follow this example to calculate the throughput for ep48.
Search "Profiling starts" and "Stop profiling" in "logs/launch_server_decode_node_rank_0.log", We can find some logs like the following line, and "1760576844.4826152" indicate the filename of torch profile.
[2025-10-15 18:07:24 DP0 TP0 EP0] Profiling starts. Traces will be saved to: /lustre/fs1/portfolios/coreai/projects/coreai_devtech_all/users/shifangx/1.workspace/6.SGLang_PD/Scripts-SGLang/../torch_profiler (with profile id: 1760576844.4826152)
We also can find some logs like the following line, and "#running-req: 232" indicate the local batch size for DP0 is 232.
[2025-10-15 18:07:24 DP0 TP0 EP0] Decode batch. #running-req: 285, #token: 328960, token usage: 0.51, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 3344.60, #queue-req: 0,
Using the method above, we can find local batch size for each DP rank: 285, 256, 254, 265, 271, 227, 221, 269. so global batch size is: 285 + 256 + 254 + 265 + 271 + 227 + 221 + 269 = 2048.
Now open the torch profile torch_profiler/1760576844.4826152-TP-0.trace.json.gz We can find that the duation for each forward step is 65ms.
Now we can calculate the throughput for each gpu: 2048/0.065/8 = 3938 (toks/s/gpu)
SGLang developer guide: bench_serving
Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G | LMSYS Org
SGLang的这个pr里面有B200 TP8并行的脚本和执行结果。 可以作为搭建GB200 EP8的参考示例。
Awesome-ML-SYS-Tutorial/sglang/code-walk-through
flashinfer 作为sglang的后端,从中可以了解一些底层的设计思路。
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
