-
Notifications
You must be signed in to change notification settings - Fork 150
[sglang broken] Add MI355X config: qwen3.5-fp4-sglang-mtp #1078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| export SGLANG_USE_AITER=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| MEM_FRAC_STATIC=${MEM_FRAC_STATIC:-0.8} | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| fi | ||
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| python3 -m sglang.launch_server --model-path=$MODEL --trust-remote-code \ | ||
| --host=0.0.0.0 --port=$PORT \ | ||
| --tensor-parallel-size=$TP \ | ||
| --attention-backend aiter \ | ||
| --mem-fraction-static $MEM_FRAC_STATIC \ | ||
| --model-loader-extra-config '{"enable_multithread_load": true}' \ | ||
| --watchdog-timeout 1200 \ | ||
| --disable-radix-cache \ | ||
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 4 \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" --sleep-interval 60 | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ | ||
|
Comment on lines
+53
to
+63
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The new Extended reasoning...What the bug is: The The specific code path: During benchmarking, Why existing code doesn't prevent it: The flag is purely opt-in on the benchmark client side. The SGLang server starts successfully with the EAGLE speculative decoding flags regardless, so there is no runtime error or warning—the benchmark runs to completion and produces results that look plausible but are inflated. Impact: The reported MTP acceptance rates and resulting throughput numbers will be higher than what users see in production deployments where chat-formatted prompts are the norm. This overstates the benefit of EAGLE speculative decoding for Qwen3.5 FP4 on MI355X. How to fix it: Add Step-by-step proof: (1) The non-MTP script |
||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 When EVAL_ONLY=true, qwen3.5_fp4_mi355x_mtp.sh calls setup_eval_context (line 26-27) but never wires the resulting EVAL_MAX_MODEL_LEN into --context-length on the server launch command (lines 34-45), so the server allocates KV cache for the model's full native context window (~131K) instead of the eval-appropriate size. For an EAGLE MTP script already under memory pressure (FP4 + speculative decoding), this unnecessary over-allocation can cause OOM during eval runs; fix by adding CONTEXT_LENGTH=$EVAL_MAX_MODEL_LEN after setup_eval_context and passing --context-length $CONTEXT_LENGTH to sglang.launch_server, matching every other MTP script in the repo.
Extended reasoning...
What the bug is and how it manifests
In benchmarks/single_node/qwen3.5_fp4_mi355x_mtp.sh (lines 26-27), setup_eval_context() is called when EVAL_ONLY=true. That function (benchmark_lib.sh line 640-643) computes EVAL_MAX_MODEL_LEN and exports it, with an explicit comment: "Scripts then wire $EVAL_MAX_MODEL_LEN into whichever server variable they need." The new script never performs that wiring — the sglang.launch_server invocation on lines 34-45 has no --context-length argument at all.
The specific code path and why existing code does not prevent it
When EVAL_ONLY=true the server is started fresh by this script. Without --context-length, SGLang uses the model-config default, which for Qwen3.5-397B-A17B is ~131K tokens. The KV cache is allocated proportionally to that full window even when the eval tasks only require a small fraction of it. The EAGLE speculative decoding flags (--speculative-num-steps 3, --speculative-num-draft-tokens 4) already add a draft-model memory overhead on top of the main model. Combined with FP4 quantization under --mem-fraction-static 0.8, the total memory pressure during eval is significantly higher than in the non-MTP parent script, making the unnecessary KV cache over-allocation an OOM risk that the parent did not face.
Addressing the refutation
One verifier argues that compute_eval_context_length() caps EVAL_MAX_MODEL_LEN at the model's native max, so the server launched without --context-length (which also defaults to native max) would always fit eval prompts — no truncation occurs. This is technically correct for eval correctness. However, correctness and resource efficiency are distinct concerns. The server does not know at startup that eval prompts are short; it eagerly allocates KV cache pages for the full 131K window. In a memory-constrained EAGLE MTP configuration this unnecessary allocation is the trigger for OOM. The non-MTP parent script is not a valid precedent here: it runs without speculative decoding draft buffers, so its memory budget is materially different. The benchmark_lib.sh documentation explicitly designates wiring EVAL_MAX_MODEL_LEN as a required step for scripts that call setup_eval_context.
What impact this has
When CI runs this script with EVAL_ONLY=true on MI355X hardware, the server may OOM and crash before eval begins, producing a silent failure or a misleading error. At minimum, KV cache is over-allocated relative to what eval requires, reducing available memory for concurrent eval requests and potentially degrading throughput metrics.
How to fix it
After the setup_eval_context call, add:
Then pass to the sglang.launch_server invocation. This matches the pattern in every other MTP benchmark script: qwen3.5_fp8_b200_mtp.sh (L51-65), qwen3.5_fp8_b300_mtp.sh (L22-27), dsr1_fp8_b200_mtp.sh (L50-53).
Step-by-step proof