feat(launcher): add DFlash support for DeepSeek-V4-Flash target model#1379
feat(launcher): add DFlash support for DeepSeek-V4-Flash target model#1379
Conversation
Add launcher examples and infrastructure to run DeepSeek-V4-Flash as the DFlash speculative decoding target model in vLLM (container: deepseekv4-cu130). New files: - examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py Runtime patch script (P0–P11) applied to the vLLM container at job startup. Adds the EAGLE3/DFlash aux-hidden-state interface to DeepseekV4ForCausalLM, fixes KV cache group allocation for heterogeneous DFlash draft layers, handles the fp8_ds_mla dtype, and bypasses the fp8_fp4_paged_mqa_logits kernel that overflows H100 shared memory with block_size=256 × MLA 576 bytes/token. Includes a PR contribution strategy note in the module docstring. - examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml CW-DFW H100 smoke test: 8xH100, TP=8, fp8 KV cache, block_size=256, gpu_memory_utilization=0.85 (leaves ~4 GB headroom for Triton JIT), 15 speculative tokens, draft model z-lab/DeepSeek-V4-Flash-DFlash. - examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/ (vllm + sglang MTP smoke tests) MTP smoke test YAMLs for B200, B300, and CW-DFW H100. - common/specdec/sglang_smoke_test.sh SGLang speculative decoding smoke test. - common/specdec/read_vllm_files.sh Diagnostic: dump relevant vLLM source lines. common/specdec/vllm_smoke_test.sh: Add VLLM_PATCH_SCRIPT hook — if set, run the specified Python script before starting vLLM. Allows per-model patches without modifying the common script. common/vllm/query.sh: Port B200/data-synthesis fixes: ulimit -u unlimited, DEEPGEMM_TMPDIR redirect, COPY_MODEL_TO_TMPFS for NFS stale-handle avoidance, and FORCE_AF_V2 patch for torch inductor compatibility with DeepSeek V4 fp8 ops. core.py / slurm_config.py: Expose retries, requeue, and additional_parameters in SlurmConfig and slurm_factory so long-running jobs can use Slurm's native requeue mechanism. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe pull request introduces new runtime patching mechanisms and configuration enhancements for vLLM speculative decoding support. New bash scripts enable PyTorch Inductor v2 forcing and model placement optimizations, a Python patching utility configures DeepSeek-V4-Flash DFlash support, and SLURM configuration is extended with retry and requeue capabilities. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (6)
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py (2)
82-93: 💤 Low valueConsider adding error handling for write failures.
The
_patch_filefunction doesn't handle potential exceptions frompath.write_text()(e.g., permission denied, disk full). While the container environment is likely controlled, unhandled write failures could silently corrupt state.💡 Proposed defensive error handling
def _patch_file(path: pathlib.Path, old: str, new: str, sentinel: str, label: str) -> bool: src = path.read_text() if sentinel in src: print(f"{label}: already patched") return True if old not in src: print(f"WARNING: {label}: pattern not found — skipping") return False - path.write_text(src.replace(old, new, 1)) - _delete_pyc(path.stem) - print(f"{label}: OK") - return True + try: + path.write_text(src.replace(old, new, 1)) + _delete_pyc(path.stem) + print(f"{label}: OK") + return True + except OSError as e: + print(f"ERROR: {label}: write failed — {e}") + return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py` around lines 82 - 93, The _patch_file function can raise exceptions from path.write_text (e.g., permission errors or disk full) which are not currently handled; update _patch_file to catch exceptions around the write and _delete_pyc call (wrap the path.write_text and _delete_pyc(path.stem) in a try/except), log or print a clear error message including the exception and label, and return False on failure so callers know the patch failed; ensure sentinel/old checks remain unchanged and only the write/delete operations are guarded.
408-414: 💤 Low valueConsider using existing torch import instead of inline import.
The injected code uses
import torch as _tinline. Sinceflash_attn.pyalready importstorch, this redundant import could be avoided by using the existing module-leveltorch.💡 Simplified patch without redundant import
_fa_new = re.sub( r"([ \t]*)raise ValueError\(f\"Unrecognized FP8 dtype: \{kv_cache_dtype\}\"\)", lambda m: ( m.group(1) + "# _dflash_fp8_ds_mla_patch: fp8_ds_mla is e4m3fn stored by the compressor\n" + m.group(1) + "if kv_cache_dtype == \"fp8_ds_mla\":\n" - + m.group(1) + " import torch as _t; return _t.float8_e4m3fn\n" + + m.group(1) + " return torch.float8_e4m3fn\n" + m.group(1) + "raise ValueError(f\"Unrecognized FP8 dtype: {kv_cache_dtype}\")" ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py` around lines 408 - 414, The injected replacement in the re.sub lambda currently does an inline "import torch as _t" when handling kv_cache_dtype == "fp8_ds_mla"; change it to use the existing module-level torch symbol instead (e.g., replace the inline import/alias and return _t.float8_e4m3fn with a direct return of torch.float8_e4m3fn) so the patch uses the already-imported torch in flash_attn.py and avoids a redundant inline import; update the lambda string in the _fa_new assignment that matches the raise ValueError to reference torch directly.tools/launcher/core.py (1)
275-276: 💤 Low valueLong line reduces readability; consider splitting.
Line 276 combines dict unpacking, conditional expression, and getattr in a single expression. While functional, this is hard to parse at a glance.
💡 Proposed refactor for readability
retries=slurm_config.retries, - additional_parameters={**(slurm_config.additional_parameters or {}), **({"requeue": True} if getattr(slurm_config, "requeue", False) else {})}, + additional_parameters={ + **(slurm_config.additional_parameters or {}), + **({"requeue": True} if getattr(slurm_config, "requeue", False) else {}), + }, packager=packager,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/launcher/core.py` around lines 275 - 276, The one-line dict merge for additional_parameters is hard to read; extract the merge into a small, clear pre-computation: create a local variable (e.g., merged_additional_parameters) by first copying slurm_config.additional_parameters or {} and then, if getattr(slurm_config, "requeue", False) is True, set merged_additional_parameters["requeue"] = True; finally pass merged_additional_parameters to the function call instead of the long one-liner. This keeps the logic around slurm_config, requeue, and additional_parameters explicit and easy to read.tools/launcher/common/specdec/vllm_smoke_test.sh (1)
549-560: 💤 Low valueConsider using arrays for optional arguments to avoid word-splitting fragility.
The unquoted
${PARALLELISM_ARGS}and${OPTIONAL_ARGS}rely on word splitting, which works for simple flags but can break if values contain spaces or special characters. While currently functional, using arrays throughout would be more robust.♻️ Example: array-based approach
-# Build optional args -OPTIONAL_ARGS="" +# Build optional args as array +OPTIONAL_ARGS=() if [ -n "${REASONING_PARSER:-}" ]; then - OPTIONAL_ARGS="${OPTIONAL_ARGS} --reasoning-parser ${REASONING_PARSER}" + OPTIONAL_ARGS+=(--reasoning-parser "${REASONING_PARSER}") fi # ... similar for other flags ... -VLLM_CMD=(vllm serve "${MODEL}" - --max-num-batched-tokens "${MAX_BATCHED_TOKENS:-32768}" - ${PARALLELISM_ARGS} - --port "${PORT}" - ${OPTIONAL_ARGS}) +VLLM_CMD=(vllm serve "${MODEL}" + --max-num-batched-tokens "${MAX_BATCHED_TOKENS:-32768}" + "${PARALLELISM_ARGS[@]}" + --port "${PORT}" + "${OPTIONAL_ARGS[@]}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/launcher/common/specdec/vllm_smoke_test.sh` around lines 549 - 560, The VLLM command currently injects ${PARALLELISM_ARGS} and ${OPTIONAL_ARGS} via unquoted string expansion which is fragile; change those to arrays and append them into VLLM_CMD as array elements (e.g., ensure PARALLELISM_ARGS and OPTIONAL_ARGS are declared as arrays and replace unquoted expansions with VLLM_CMD+=( "${PARALLELISM_ARGS[@]}" ) and VLLM_CMD+=( "${OPTIONAL_ARGS[@]}" )), keep the conditional appends for --speculative-config and --compilation-config as array additions (VLLM_CMD+=(--speculative-config "$SPEC_CONFIG") etc.), and finally invoke "${VLLM_CMD[@]}" to preserve proper tokenization.tools/launcher/common/vllm/query.sh (1)
103-316: 💤 Low valueConsider extracting shared FORCE_AF_V2 patching logic.
This FORCE_AF_V2 patching block (lines 139-316) is nearly identical to the implementation in
vllm_smoke_test.sh(lines 85-316). While duplicating the logic in each script ensures independence and avoids cross-script dependencies, consider extracting to a shared Python script if maintenance burden increases.The
exec(PATCH_CODE)on line 235 is safe sincePATCH_CODEis a hardcoded string constant, not external input.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/launcher/common/vllm/query.sh` around lines 103 - 316, The FORCE_AF_V2 patching block is duplicated; extract the shared logic into a single reusable script/module (e.g., a Python file that exposes and installs the patch) and have both query.sh and vllm_smoke_test.sh load and run it instead of embedding PATCH_CODE inline. Move the large PATCH_CODE string and the install/exec logic (symbols: PATCH_CODE, vllm_force_af_v2_runtime, _install, exec(PATCH_CODE), and the FORCE_AF_V2 conditional) into the shared script, update each shell script to check FORCE_AF_V2 and call that shared script (or import/execute it) so the behavior remains identical while centralizing maintenance. Ensure the new shared file is installed into a location accessible to the runtime and keep the existing runtime prints and exception handling unchanged.tools/launcher/common/specdec/sglang_smoke_test.sh (1)
65-67: ⚡ Quick winAvoid unconditional pre-release
transformersupgrades in smoke runs.Line [65]-Line [67] mutates runtime dependencies on every execution (
--pre), which makes results non-reproducible and network-sensitive. Gate this behind an opt-in env var and require a tested spec/range.Suggested fix
-echo "Upgrading transformers (--pre for deepseek_v4 support)..." -pip install --upgrade --pre transformers -q || echo "WARNING: transformers upgrade failed, continuing" +if [ "${UPGRADE_TRANSFORMERS:-0}" = "1" ]; then + : "${TRANSFORMERS_SPEC:?Set TRANSFORMERS_SPEC to a tested version/range}" + echo "Upgrading transformers (${TRANSFORMERS_SPEC})..." + pip install --upgrade "${TRANSFORMERS_SPEC}" -q || { + echo "ERROR: transformers upgrade failed" + exit 1 + } +fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 65 - 67, The smoke test unconditionally runs pip install --upgrade --pre transformers which mutates runtime deps each run; change the logic around the echo/"pip install --upgrade --pre transformers -q" invocation (the upgrade step labeled "Upgrading transformers (--pre for deepseek_v4 support)...") to only run when an opt-in environment variable (e.g., UPGRADE_TRANSFORMERS or USE_PRETRANSFORMERS) is set, and require a constrained, tested spec (e.g., a VERSION or SPEC env var or documented version range) instead of always using --pre; update the message to reflect the gated behavior and skip the network upgrade by default to keep smoke runs reproducible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/launcher/common/specdec/sglang_smoke_test.sh`:
- Around line 35-42: The cleanup() function should guard use of SERVER_PID and
only remove tmpfs copies this script created: check SERVER_PID exists/ non-empty
using parameter expansion (e.g., test -n "${SERVER_PID-}" or similar) before
running kill/kil -9 to avoid failures under set -u, and introduce a
TMPFS_MODEL_OWNED (or TMPFS_MODEL_CREATED) flag when you create or copy the
tmpfs model in the code that discovers TMPFS_MODEL (the logic around the TMPFS
discovery at the block that sets TMPFS_MODEL near where you check Lines 49-51);
then in cleanup() only rm -rf "$TMPFS_MODEL" when TMPFS_MODEL_OWNED is true so
shared tmpfs paths are not deleted by concurrent jobs.
- Around line 90-104: The loop that tries to write _deepseek_v4_patch.py and
deepseek_v4.pth may fail silently and only exec(STUB) patches the current
interpreter, so the spawned "python -m sglang.launch_server" process won't get
the patch; add a boolean flag (e.g., wrote_pth = False) before iterating
site.getsitepackages() + [site.getusersitepackages()], set it to True when
writing deepseek_v4.pth succeeds, and after the loop check the flag and call
sys.exit(1) (and print an error) if no writable site-packages were found so the
script fails fast instead of continuing to exec(STUB) and spawning the server
without the .pth patch.
In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b200.yaml`:
- Line 24: Remove the unrecognized environment variable
SGLANG_APPLY_CONFIG_BACKUP from the YAML or, if it is intended to be supported,
add handling and documentation: update sglang_smoke_test.sh to read and act on
SGLANG_APPLY_CONFIG_BACKUP (e.g., apply a backup config path or "none"
semantics) and add a comment and README entry describing the variable and valid
values; ensure the YAML entry "- SGLANG_APPLY_CONFIG_BACKUP: \"none\"" is either
deleted or retained only after the script and docs are updated to reference
SGLANG_APPLY_CONFIG_BACKUP.
---
Nitpick comments:
In `@tools/launcher/common/specdec/sglang_smoke_test.sh`:
- Around line 65-67: The smoke test unconditionally runs pip install --upgrade
--pre transformers which mutates runtime deps each run; change the logic around
the echo/"pip install --upgrade --pre transformers -q" invocation (the upgrade
step labeled "Upgrading transformers (--pre for deepseek_v4 support)...") to
only run when an opt-in environment variable (e.g., UPGRADE_TRANSFORMERS or
USE_PRETRANSFORMERS) is set, and require a constrained, tested spec (e.g., a
VERSION or SPEC env var or documented version range) instead of always using
--pre; update the message to reflect the gated behavior and skip the network
upgrade by default to keep smoke runs reproducible.
In `@tools/launcher/common/specdec/vllm_smoke_test.sh`:
- Around line 549-560: The VLLM command currently injects ${PARALLELISM_ARGS}
and ${OPTIONAL_ARGS} via unquoted string expansion which is fragile; change
those to arrays and append them into VLLM_CMD as array elements (e.g., ensure
PARALLELISM_ARGS and OPTIONAL_ARGS are declared as arrays and replace unquoted
expansions with VLLM_CMD+=( "${PARALLELISM_ARGS[@]}" ) and VLLM_CMD+=(
"${OPTIONAL_ARGS[@]}" )), keep the conditional appends for --speculative-config
and --compilation-config as array additions (VLLM_CMD+=(--speculative-config
"$SPEC_CONFIG") etc.), and finally invoke "${VLLM_CMD[@]}" to preserve proper
tokenization.
In `@tools/launcher/common/vllm/query.sh`:
- Around line 103-316: The FORCE_AF_V2 patching block is duplicated; extract the
shared logic into a single reusable script/module (e.g., a Python file that
exposes and installs the patch) and have both query.sh and vllm_smoke_test.sh
load and run it instead of embedding PATCH_CODE inline. Move the large
PATCH_CODE string and the install/exec logic (symbols: PATCH_CODE,
vllm_force_af_v2_runtime, _install, exec(PATCH_CODE), and the FORCE_AF_V2
conditional) into the shared script, update each shell script to check
FORCE_AF_V2 and call that shared script (or import/execute it) so the behavior
remains identical while centralizing maintenance. Ensure the new shared file is
installed into a location accessible to the runtime and keep the existing
runtime prints and exception handling unchanged.
In `@tools/launcher/core.py`:
- Around line 275-276: The one-line dict merge for additional_parameters is hard
to read; extract the merge into a small, clear pre-computation: create a local
variable (e.g., merged_additional_parameters) by first copying
slurm_config.additional_parameters or {} and then, if getattr(slurm_config,
"requeue", False) is True, set merged_additional_parameters["requeue"] = True;
finally pass merged_additional_parameters to the function call instead of the
long one-liner. This keeps the logic around slurm_config, requeue, and
additional_parameters explicit and easy to read.
In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`:
- Around line 82-93: The _patch_file function can raise exceptions from
path.write_text (e.g., permission errors or disk full) which are not currently
handled; update _patch_file to catch exceptions around the write and _delete_pyc
call (wrap the path.write_text and _delete_pyc(path.stem) in a try/except), log
or print a clear error message including the exception and label, and return
False on failure so callers know the patch failed; ensure sentinel/old checks
remain unchanged and only the write/delete operations are guarded.
- Around line 408-414: The injected replacement in the re.sub lambda currently
does an inline "import torch as _t" when handling kv_cache_dtype ==
"fp8_ds_mla"; change it to use the existing module-level torch symbol instead
(e.g., replace the inline import/alias and return _t.float8_e4m3fn with a direct
return of torch.float8_e4m3fn) so the patch uses the already-imported torch in
flash_attn.py and avoids a redundant inline import; update the lambda string in
the _fa_new assignment that matches the raise ValueError to reference torch
directly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e768f4bf-2470-478f-a408-7f7a00c11644
📒 Files selected for processing (17)
tools/launcher/common/specdec/read_vllm_files.shtools/launcher/common/specdec/sglang_smoke_test.shtools/launcher/common/specdec/vllm_smoke_test.shtools/launcher/common/vllm/query.shtools/launcher/core.pytools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/hf_offline_eagle3.yamltools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/hf_offline_eagle3_data_cw_dfw.yamltools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b200.yamltools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b300.yamltools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_cw_dfw.yamltools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test.yamltools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_b200.yamltools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_b300.yamltools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_cw_dfw.yamltools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.pytools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yamltools/launcher/slurm_config.py
| cleanup() { | ||
| kill "$SERVER_PID" 2>/dev/null || true | ||
| sleep 2 | ||
| kill -9 "$SERVER_PID" 2>/dev/null || true | ||
| if [ -n "$TMPFS_MODEL" ] && [ -d "$TMPFS_MODEL" ]; then | ||
| echo "Removing tmpfs model copy: $TMPFS_MODEL" | ||
| rm -rf "$TMPFS_MODEL" | ||
| fi |
There was a problem hiding this comment.
Guard cleanup for unset PID and avoid deleting shared tmpfs copies.
With set -u, Line [36] can fail when the script exits before Line [141] assigns SERVER_PID. Also, tmpfs copies discovered at Line [49]-Line [51] are reused but still removed at exit, which can break concurrent jobs using the same path.
Suggested fix
TMPFS_MODEL=""
+SERVER_PID=""
+TMPFS_MODEL_OWNED=0
cleanup() {
- kill "$SERVER_PID" 2>/dev/null || true
- sleep 2
- kill -9 "$SERVER_PID" 2>/dev/null || true
- if [ -n "$TMPFS_MODEL" ] && [ -d "$TMPFS_MODEL" ]; then
+ if [ -n "${SERVER_PID:-}" ] && kill -0 "${SERVER_PID}" 2>/dev/null; then
+ kill "${SERVER_PID}" 2>/dev/null || true
+ sleep 2
+ kill -9 "${SERVER_PID}" 2>/dev/null || true
+ fi
+ if [ "${TMPFS_MODEL_OWNED:-0}" = "1" ] && [ -n "${TMPFS_MODEL:-}" ] && [ -d "${TMPFS_MODEL}" ]; then
echo "Removing tmpfs model copy: $TMPFS_MODEL"
rm -rf "$TMPFS_MODEL"
fi
}
@@
if [ -d "$TMPFS_MODEL" ] && [ -f "$TMPFS_MODEL/config.json" ]; then
echo "Using existing tmpfs model copy: $TMPFS_MODEL"
+ TMPFS_MODEL_OWNED=0
else
@@
cp -r "$MODEL" "$TMPFS_MODEL"
echo "Model copy done: $TMPFS_MODEL"
+ TMPFS_MODEL_OWNED=1
fiAlso applies to: 46-57, 141-141
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 35 - 42, The
cleanup() function should guard use of SERVER_PID and only remove tmpfs copies
this script created: check SERVER_PID exists/ non-empty using parameter
expansion (e.g., test -n "${SERVER_PID-}" or similar) before running kill/kil -9
to avoid failures under set -u, and introduce a TMPFS_MODEL_OWNED (or
TMPFS_MODEL_CREATED) flag when you create or copy the tmpfs model in the code
that discovers TMPFS_MODEL (the logic around the TMPFS discovery at the block
that sets TMPFS_MODEL near where you check Lines 49-51); then in cleanup() only
rm -rf "$TMPFS_MODEL" when TMPFS_MODEL_OWNED is true so shared tmpfs paths are
not deleted by concurrent jobs.
| for sp in site.getsitepackages() + [site.getusersitepackages()]: | ||
| if not os.path.isdir(sp): | ||
| continue | ||
| try: | ||
| with open(os.path.join(sp, '_deepseek_v4_patch.py'), 'w') as f: | ||
| f.write(STUB) | ||
| with open(os.path.join(sp, 'deepseek_v4.pth'), 'w') as f: | ||
| f.write('import _deepseek_v4_patch\n') | ||
| print(f"[patch] Wrote deepseek_v4.pth to {sp}") | ||
| break | ||
| except Exception as e: | ||
| print(f"[patch] Could not write to {sp}: {e}") | ||
|
|
||
| exec(STUB) | ||
| PYEOF |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '85,110p'Repository: NVIDIA/Model-Optimizer
Length of output: 1169
🏁 Script executed:
cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '1,95p'Repository: NVIDIA/Model-Optimizer
Length of output: 4915
🏁 Script executed:
cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '105,150p'Repository: NVIDIA/Model-Optimizer
Length of output: 2286
Fail fast when the .pth patch cannot be installed for spawned processes.
If all writable site-packages attempts fail (lines 90–102), the script continues silently. exec(STUB) at line 103 only patches the current interpreter; the spawned python -m sglang.launch_server process (line 131) depends on the .pth file for automatic patch application at startup. Without the file, the server starts without deepseek_v4 registered, causing opaque failures downstream.
Track whether .pth was successfully written and exit with an error if all site-packages locations are non-writable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 90 - 104,
The loop that tries to write _deepseek_v4_patch.py and deepseek_v4.pth may fail
silently and only exec(STUB) patches the current interpreter, so the spawned
"python -m sglang.launch_server" process won't get the patch; add a boolean flag
(e.g., wrote_pth = False) before iterating site.getsitepackages() +
[site.getusersitepackages()], set it to True when writing deepseek_v4.pth
succeeds, and after the loop check the flag and call sys.exit(1) (and print an
error) if no writable site-packages were found so the script fails fast instead
of continuing to exec(STUB) and spawning the server without the .pth patch.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1379 +/- ##
==========================================
- Coverage 74.46% 70.83% -3.64%
==========================================
Files 464 485 +21
Lines 50089 55831 +5742
==========================================
+ Hits 37300 39548 +2248
- Misses 12789 16283 +3494
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
SGLang MTP smoke tests have not been run and verified. Remove them until they are tested end-to-end on a cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>
These files belong to the draft model folder and were not created or verified in this PR. Remove until they have dedicated test coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>
The z-lab/DeepSeek-V4-Flash-DFlash checkpoint is a 4-layer Qwen3-based scaffold created from z-lab's DFlash architecture reference with random weights. The smoke test verifies the inference pipeline starts and generates tokens, not acceptance rate or generation quality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>
Add launcher examples and infrastructure to run DeepSeek-V4-Flash as the DFlash speculative decoding target model in vLLM (container: deepseekv4-cu130).
New files:
examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py Runtime patch script (P0–P11) applied to the vLLM container at job startup. Adds the EAGLE3/DFlash aux-hidden-state interface to DeepseekV4ForCausalLM, fixes KV cache group allocation for heterogeneous DFlash draft layers, handles the fp8_ds_mla dtype, and bypasses the fp8_fp4_paged_mqa_logits kernel that overflows H100 shared memory with block_size=256 × MLA 576 bytes/token. Includes a PR contribution strategy note in the module docstring.
examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml CW-DFW H100 smoke test: 8xH100, TP=8, fp8 KV cache, block_size=256, gpu_memory_utilization=0.85 (leaves ~4 GB headroom for Triton JIT), 15 speculative tokens, draft model z-lab/DeepSeek-V4-Flash-DFlash.
examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/ (vllm + sglang MTP smoke tests) MTP smoke test YAMLs for B200, B300, and CW-DFW H100.
common/specdec/sglang_smoke_test.sh SGLang speculative decoding smoke test.
common/specdec/read_vllm_files.sh Diagnostic: dump relevant vLLM source lines.
common/specdec/vllm_smoke_test.sh:
Add VLLM_PATCH_SCRIPT hook — if set, run the specified Python script before starting vLLM. Allows per-model patches without modifying the common script.
common/vllm/query.sh:
Port B200/data-synthesis fixes: ulimit -u unlimited, DEEPGEMM_TMPDIR redirect, COPY_MODEL_TO_TMPFS for NFS stale-handle avoidance, and FORCE_AF_V2 patch for torch inductor compatibility with DeepSeek V4 fp8 ops.
core.py / slurm_config.py:
Expose retries, requeue, and additional_parameters in SlurmConfig and slurm_factory so long-running jobs can use Slurm's native requeue mechanism.
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Release Notes
New Features
Chores