Skip to content

feat(launcher): add DFlash support for DeepSeek-V4-Flash target model#1379

Open
ChenhanYu wants to merge 4 commits intomainfrom
chenhany/dflash-deepseek-v4-flash
Open

feat(launcher): add DFlash support for DeepSeek-V4-Flash target model#1379
ChenhanYu wants to merge 4 commits intomainfrom
chenhany/dflash-deepseek-v4-flash

Conversation

@ChenhanYu
Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu commented Apr 30, 2026

Add launcher examples and infrastructure to run DeepSeek-V4-Flash as the DFlash speculative decoding target model in vLLM (container: deepseekv4-cu130).

New files:

  • examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py Runtime patch script (P0–P11) applied to the vLLM container at job startup. Adds the EAGLE3/DFlash aux-hidden-state interface to DeepseekV4ForCausalLM, fixes KV cache group allocation for heterogeneous DFlash draft layers, handles the fp8_ds_mla dtype, and bypasses the fp8_fp4_paged_mqa_logits kernel that overflows H100 shared memory with block_size=256 × MLA 576 bytes/token. Includes a PR contribution strategy note in the module docstring.

  • examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml CW-DFW H100 smoke test: 8xH100, TP=8, fp8 KV cache, block_size=256, gpu_memory_utilization=0.85 (leaves ~4 GB headroom for Triton JIT), 15 speculative tokens, draft model z-lab/DeepSeek-V4-Flash-DFlash.

  • examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/ (vllm + sglang MTP smoke tests) MTP smoke test YAMLs for B200, B300, and CW-DFW H100.

  • common/specdec/sglang_smoke_test.sh SGLang speculative decoding smoke test.

  • common/specdec/read_vllm_files.sh Diagnostic: dump relevant vLLM source lines.

common/specdec/vllm_smoke_test.sh:
Add VLLM_PATCH_SCRIPT hook — if set, run the specified Python script before starting vLLM. Allows per-model patches without modifying the common script.

common/vllm/query.sh:
Port B200/data-synthesis fixes: ulimit -u unlimited, DEEPGEMM_TMPDIR redirect, COPY_MODEL_TO_TMPFS for NFS stale-handle avoidance, and FORCE_AF_V2 patch for torch inductor compatibility with DeepSeek V4 fp8 ops.

core.py / slurm_config.py:
Expose retries, requeue, and additional_parameters in SlurmConfig and slurm_factory so long-running jobs can use Slurm's native requeue mechanism.

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for running DeepSeek-V4 Flash models with speculative decoding optimization
    • Extended launcher configuration with new options for GPU memory utilization, server timeouts, and resource constraints
    • Added job retry and requeue capabilities for more reliable job execution
  • Chores

    • Improved launcher utilities and testing infrastructure

Add launcher examples and infrastructure to run DeepSeek-V4-Flash as the
DFlash speculative decoding target model in vLLM (container: deepseekv4-cu130).

New files:
- examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py
  Runtime patch script (P0–P11) applied to the vLLM container at job startup.
  Adds the EAGLE3/DFlash aux-hidden-state interface to DeepseekV4ForCausalLM,
  fixes KV cache group allocation for heterogeneous DFlash draft layers, handles
  the fp8_ds_mla dtype, and bypasses the fp8_fp4_paged_mqa_logits kernel that
  overflows H100 shared memory with block_size=256 × MLA 576 bytes/token.
  Includes a PR contribution strategy note in the module docstring.

- examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml
  CW-DFW H100 smoke test: 8xH100, TP=8, fp8 KV cache, block_size=256,
  gpu_memory_utilization=0.85 (leaves ~4 GB headroom for Triton JIT), 15
  speculative tokens, draft model z-lab/DeepSeek-V4-Flash-DFlash.

- examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/ (vllm + sglang MTP smoke tests)
  MTP smoke test YAMLs for B200, B300, and CW-DFW H100.

- common/specdec/sglang_smoke_test.sh  SGLang speculative decoding smoke test.
- common/specdec/read_vllm_files.sh    Diagnostic: dump relevant vLLM source lines.

common/specdec/vllm_smoke_test.sh:
  Add VLLM_PATCH_SCRIPT hook — if set, run the specified Python script before
  starting vLLM. Allows per-model patches without modifying the common script.

common/vllm/query.sh:
  Port B200/data-synthesis fixes: ulimit -u unlimited, DEEPGEMM_TMPDIR redirect,
  COPY_MODEL_TO_TMPFS for NFS stale-handle avoidance, and FORCE_AF_V2 patch for
  torch inductor compatibility with DeepSeek V4 fp8 ops.

core.py / slurm_config.py:
  Expose retries, requeue, and additional_parameters in SlurmConfig and
  slurm_factory so long-running jobs can use Slurm's native requeue mechanism.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: chenhany <chenhany@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9eaa4ebc-9d1e-48c8-8ce2-6c72dca88ecf

📥 Commits

Reviewing files that changed from the base of the PR and between 0b2f2b4 and ffe00b6.

📒 Files selected for processing (1)
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml

📝 Walkthrough

Walkthrough

The pull request introduces new runtime patching mechanisms and configuration enhancements for vLLM speculative decoding support. New bash scripts enable PyTorch Inductor v2 forcing and model placement optimizations, a Python patching utility configures DeepSeek-V4-Flash DFlash support, and SLURM configuration is extended with retry and requeue capabilities.

Changes

Cohort / File(s) Summary
vLLM Smoke Test & Execution Scripts
tools/launcher/common/specdec/read_vllm_files.sh, tools/launcher/common/specdec/vllm_smoke_test.sh, tools/launcher/common/vllm/query.sh
Introduces file extraction utility and expands smoke test script with 7+ new environment variables controlling KV cache dtype, block sizing, expert parallelism, and GPU memory utilization. Adds PyTorch/inductor patching via FORCE_AF_V2 flag, vLLM custom-op monkey patching for DeepSeek, optional transformers upgrade, external patch-script hooks, and resource optimization (ulimit, temp file routing, model tmpfs copying, extended timeouts).
DeepSeek-V4-Flash DFlash Support
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py, tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml
Adds idempotent source-patching script performing 11 multi-module edits to enable DFlash speculative decoding for DeepSeek-V4-Flash, including DFlash registration, EAGLE3 auxiliary state handling, KV-cache grouping logic, FP8 dtype fixes, and sparse attention fallbacks. Workflow YAML configures smoke test with DFlash settings (spec method, token count, tensor parallelism, fp8 cache) and points to patching script.
SLURM Configuration
tools/launcher/core.py, tools/launcher/slurm_config.py
Extends SlurmConfig dataclass with retries, requeue, and additional_parameters fields; updates slurm_factory to accept and propagate these parameters; modifies build_slurm_executor to forward retry counts and conditionally include requeue flag in Slurm parameters.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error PR violates SECURITY.md by using bare # nosec comments without inline justification in core.py and launch.py, and lacks security exception section in PR description. Replace bare # nosec comments with inline explanations and add security exception section to PR description requesting @NVIDIA/modelopt-setup-codeowners review.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the PR: adding DFlash support for DeepSeek-V4-Flash as a target model in the launcher infrastructure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/dflash-deepseek-v4-flash

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py (2)

82-93: 💤 Low value

Consider adding error handling for write failures.

The _patch_file function doesn't handle potential exceptions from path.write_text() (e.g., permission denied, disk full). While the container environment is likely controlled, unhandled write failures could silently corrupt state.

💡 Proposed defensive error handling
 def _patch_file(path: pathlib.Path, old: str, new: str, sentinel: str, label: str) -> bool:
     src = path.read_text()
     if sentinel in src:
         print(f"{label}: already patched")
         return True
     if old not in src:
         print(f"WARNING: {label}: pattern not found — skipping")
         return False
-    path.write_text(src.replace(old, new, 1))
-    _delete_pyc(path.stem)
-    print(f"{label}: OK")
-    return True
+    try:
+        path.write_text(src.replace(old, new, 1))
+        _delete_pyc(path.stem)
+        print(f"{label}: OK")
+        return True
+    except OSError as e:
+        print(f"ERROR: {label}: write failed — {e}")
+        return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`
around lines 82 - 93, The _patch_file function can raise exceptions from
path.write_text (e.g., permission errors or disk full) which are not currently
handled; update _patch_file to catch exceptions around the write and _delete_pyc
call (wrap the path.write_text and _delete_pyc(path.stem) in a try/except), log
or print a clear error message including the exception and label, and return
False on failure so callers know the patch failed; ensure sentinel/old checks
remain unchanged and only the write/delete operations are guarded.

408-414: 💤 Low value

Consider using existing torch import instead of inline import.

The injected code uses import torch as _t inline. Since flash_attn.py already imports torch, this redundant import could be avoided by using the existing module-level torch.

💡 Simplified patch without redundant import
     _fa_new = re.sub(
         r"([ \t]*)raise ValueError\(f\"Unrecognized FP8 dtype: \{kv_cache_dtype\}\"\)",
         lambda m: (
             m.group(1) + "# _dflash_fp8_ds_mla_patch: fp8_ds_mla is e4m3fn stored by the compressor\n"
             + m.group(1) + "if kv_cache_dtype == \"fp8_ds_mla\":\n"
-            + m.group(1) + "    import torch as _t; return _t.float8_e4m3fn\n"
+            + m.group(1) + "    return torch.float8_e4m3fn\n"
             + m.group(1) + "raise ValueError(f\"Unrecognized FP8 dtype: {kv_cache_dtype}\")"
         ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`
around lines 408 - 414, The injected replacement in the re.sub lambda currently
does an inline "import torch as _t" when handling kv_cache_dtype ==
"fp8_ds_mla"; change it to use the existing module-level torch symbol instead
(e.g., replace the inline import/alias and return _t.float8_e4m3fn with a direct
return of torch.float8_e4m3fn) so the patch uses the already-imported torch in
flash_attn.py and avoids a redundant inline import; update the lambda string in
the _fa_new assignment that matches the raise ValueError to reference torch
directly.
tools/launcher/core.py (1)

275-276: 💤 Low value

Long line reduces readability; consider splitting.

Line 276 combines dict unpacking, conditional expression, and getattr in a single expression. While functional, this is hard to parse at a glance.

💡 Proposed refactor for readability
         retries=slurm_config.retries,
-        additional_parameters={**(slurm_config.additional_parameters or {}), **({"requeue": True} if getattr(slurm_config, "requeue", False) else {})},
+        additional_parameters={
+            **(slurm_config.additional_parameters or {}),
+            **({"requeue": True} if getattr(slurm_config, "requeue", False) else {}),
+        },
         packager=packager,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/core.py` around lines 275 - 276, The one-line dict merge for
additional_parameters is hard to read; extract the merge into a small, clear
pre-computation: create a local variable (e.g., merged_additional_parameters) by
first copying slurm_config.additional_parameters or {} and then, if
getattr(slurm_config, "requeue", False) is True, set
merged_additional_parameters["requeue"] = True; finally pass
merged_additional_parameters to the function call instead of the long one-liner.
This keeps the logic around slurm_config, requeue, and additional_parameters
explicit and easy to read.
tools/launcher/common/specdec/vllm_smoke_test.sh (1)

549-560: 💤 Low value

Consider using arrays for optional arguments to avoid word-splitting fragility.

The unquoted ${PARALLELISM_ARGS} and ${OPTIONAL_ARGS} rely on word splitting, which works for simple flags but can break if values contain spaces or special characters. While currently functional, using arrays throughout would be more robust.

♻️ Example: array-based approach
-# Build optional args
-OPTIONAL_ARGS=""
+# Build optional args as array
+OPTIONAL_ARGS=()
 if [ -n "${REASONING_PARSER:-}" ]; then
-    OPTIONAL_ARGS="${OPTIONAL_ARGS} --reasoning-parser ${REASONING_PARSER}"
+    OPTIONAL_ARGS+=(--reasoning-parser "${REASONING_PARSER}")
 fi
 # ... similar for other flags ...

-VLLM_CMD=(vllm serve "${MODEL}"
-    --max-num-batched-tokens "${MAX_BATCHED_TOKENS:-32768}"
-    ${PARALLELISM_ARGS}
-    --port "${PORT}"
-    ${OPTIONAL_ARGS})
+VLLM_CMD=(vllm serve "${MODEL}"
+    --max-num-batched-tokens "${MAX_BATCHED_TOKENS:-32768}"
+    "${PARALLELISM_ARGS[@]}"
+    --port "${PORT}"
+    "${OPTIONAL_ARGS[@]}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/vllm_smoke_test.sh` around lines 549 - 560, The
VLLM command currently injects ${PARALLELISM_ARGS} and ${OPTIONAL_ARGS} via
unquoted string expansion which is fragile; change those to arrays and append
them into VLLM_CMD as array elements (e.g., ensure PARALLELISM_ARGS and
OPTIONAL_ARGS are declared as arrays and replace unquoted expansions with
VLLM_CMD+=( "${PARALLELISM_ARGS[@]}" ) and VLLM_CMD+=( "${OPTIONAL_ARGS[@]}" )),
keep the conditional appends for --speculative-config and --compilation-config
as array additions (VLLM_CMD+=(--speculative-config "$SPEC_CONFIG") etc.), and
finally invoke "${VLLM_CMD[@]}" to preserve proper tokenization.
tools/launcher/common/vllm/query.sh (1)

103-316: 💤 Low value

Consider extracting shared FORCE_AF_V2 patching logic.

This FORCE_AF_V2 patching block (lines 139-316) is nearly identical to the implementation in vllm_smoke_test.sh (lines 85-316). While duplicating the logic in each script ensures independence and avoids cross-script dependencies, consider extracting to a shared Python script if maintenance burden increases.

The exec(PATCH_CODE) on line 235 is safe since PATCH_CODE is a hardcoded string constant, not external input.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/vllm/query.sh` around lines 103 - 316, The FORCE_AF_V2
patching block is duplicated; extract the shared logic into a single reusable
script/module (e.g., a Python file that exposes and installs the patch) and have
both query.sh and vllm_smoke_test.sh load and run it instead of embedding
PATCH_CODE inline. Move the large PATCH_CODE string and the install/exec logic
(symbols: PATCH_CODE, vllm_force_af_v2_runtime, _install, exec(PATCH_CODE), and
the FORCE_AF_V2 conditional) into the shared script, update each shell script to
check FORCE_AF_V2 and call that shared script (or import/execute it) so the
behavior remains identical while centralizing maintenance. Ensure the new shared
file is installed into a location accessible to the runtime and keep the
existing runtime prints and exception handling unchanged.
tools/launcher/common/specdec/sglang_smoke_test.sh (1)

65-67: ⚡ Quick win

Avoid unconditional pre-release transformers upgrades in smoke runs.

Line [65]-Line [67] mutates runtime dependencies on every execution (--pre), which makes results non-reproducible and network-sensitive. Gate this behind an opt-in env var and require a tested spec/range.

Suggested fix
-echo "Upgrading transformers (--pre for deepseek_v4 support)..."
-pip install --upgrade --pre transformers -q || echo "WARNING: transformers upgrade failed, continuing"
+if [ "${UPGRADE_TRANSFORMERS:-0}" = "1" ]; then
+    : "${TRANSFORMERS_SPEC:?Set TRANSFORMERS_SPEC to a tested version/range}"
+    echo "Upgrading transformers (${TRANSFORMERS_SPEC})..."
+    pip install --upgrade "${TRANSFORMERS_SPEC}" -q || {
+        echo "ERROR: transformers upgrade failed"
+        exit 1
+    }
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 65 - 67, The
smoke test unconditionally runs pip install --upgrade --pre transformers which
mutates runtime deps each run; change the logic around the echo/"pip install
--upgrade --pre transformers -q" invocation (the upgrade step labeled "Upgrading
transformers (--pre for deepseek_v4 support)...") to only run when an opt-in
environment variable (e.g., UPGRADE_TRANSFORMERS or USE_PRETRANSFORMERS) is set,
and require a constrained, tested spec (e.g., a VERSION or SPEC env var or
documented version range) instead of always using --pre; update the message to
reflect the gated behavior and skip the network upgrade by default to keep smoke
runs reproducible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/launcher/common/specdec/sglang_smoke_test.sh`:
- Around line 35-42: The cleanup() function should guard use of SERVER_PID and
only remove tmpfs copies this script created: check SERVER_PID exists/ non-empty
using parameter expansion (e.g., test -n "${SERVER_PID-}" or similar) before
running kill/kil -9 to avoid failures under set -u, and introduce a
TMPFS_MODEL_OWNED (or TMPFS_MODEL_CREATED) flag when you create or copy the
tmpfs model in the code that discovers TMPFS_MODEL (the logic around the TMPFS
discovery at the block that sets TMPFS_MODEL near where you check Lines 49-51);
then in cleanup() only rm -rf "$TMPFS_MODEL" when TMPFS_MODEL_OWNED is true so
shared tmpfs paths are not deleted by concurrent jobs.
- Around line 90-104: The loop that tries to write _deepseek_v4_patch.py and
deepseek_v4.pth may fail silently and only exec(STUB) patches the current
interpreter, so the spawned "python -m sglang.launch_server" process won't get
the patch; add a boolean flag (e.g., wrote_pth = False) before iterating
site.getsitepackages() + [site.getusersitepackages()], set it to True when
writing deepseek_v4.pth succeeds, and after the loop check the flag and call
sys.exit(1) (and print an error) if no writable site-packages were found so the
script fails fast instead of continuing to exec(STUB) and spawning the server
without the .pth patch.

In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b200.yaml`:
- Line 24: Remove the unrecognized environment variable
SGLANG_APPLY_CONFIG_BACKUP from the YAML or, if it is intended to be supported,
add handling and documentation: update sglang_smoke_test.sh to read and act on
SGLANG_APPLY_CONFIG_BACKUP (e.g., apply a backup config path or "none"
semantics) and add a comment and README entry describing the variable and valid
values; ensure the YAML entry "- SGLANG_APPLY_CONFIG_BACKUP: \"none\"" is either
deleted or retained only after the script and docs are updated to reference
SGLANG_APPLY_CONFIG_BACKUP.

---

Nitpick comments:
In `@tools/launcher/common/specdec/sglang_smoke_test.sh`:
- Around line 65-67: The smoke test unconditionally runs pip install --upgrade
--pre transformers which mutates runtime deps each run; change the logic around
the echo/"pip install --upgrade --pre transformers -q" invocation (the upgrade
step labeled "Upgrading transformers (--pre for deepseek_v4 support)...") to
only run when an opt-in environment variable (e.g., UPGRADE_TRANSFORMERS or
USE_PRETRANSFORMERS) is set, and require a constrained, tested spec (e.g., a
VERSION or SPEC env var or documented version range) instead of always using
--pre; update the message to reflect the gated behavior and skip the network
upgrade by default to keep smoke runs reproducible.

In `@tools/launcher/common/specdec/vllm_smoke_test.sh`:
- Around line 549-560: The VLLM command currently injects ${PARALLELISM_ARGS}
and ${OPTIONAL_ARGS} via unquoted string expansion which is fragile; change
those to arrays and append them into VLLM_CMD as array elements (e.g., ensure
PARALLELISM_ARGS and OPTIONAL_ARGS are declared as arrays and replace unquoted
expansions with VLLM_CMD+=( "${PARALLELISM_ARGS[@]}" ) and VLLM_CMD+=(
"${OPTIONAL_ARGS[@]}" )), keep the conditional appends for --speculative-config
and --compilation-config as array additions (VLLM_CMD+=(--speculative-config
"$SPEC_CONFIG") etc.), and finally invoke "${VLLM_CMD[@]}" to preserve proper
tokenization.

In `@tools/launcher/common/vllm/query.sh`:
- Around line 103-316: The FORCE_AF_V2 patching block is duplicated; extract the
shared logic into a single reusable script/module (e.g., a Python file that
exposes and installs the patch) and have both query.sh and vllm_smoke_test.sh
load and run it instead of embedding PATCH_CODE inline. Move the large
PATCH_CODE string and the install/exec logic (symbols: PATCH_CODE,
vllm_force_af_v2_runtime, _install, exec(PATCH_CODE), and the FORCE_AF_V2
conditional) into the shared script, update each shell script to check
FORCE_AF_V2 and call that shared script (or import/execute it) so the behavior
remains identical while centralizing maintenance. Ensure the new shared file is
installed into a location accessible to the runtime and keep the existing
runtime prints and exception handling unchanged.

In `@tools/launcher/core.py`:
- Around line 275-276: The one-line dict merge for additional_parameters is hard
to read; extract the merge into a small, clear pre-computation: create a local
variable (e.g., merged_additional_parameters) by first copying
slurm_config.additional_parameters or {} and then, if getattr(slurm_config,
"requeue", False) is True, set merged_additional_parameters["requeue"] = True;
finally pass merged_additional_parameters to the function call instead of the
long one-liner. This keeps the logic around slurm_config, requeue, and
additional_parameters explicit and easy to read.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`:
- Around line 82-93: The _patch_file function can raise exceptions from
path.write_text (e.g., permission errors or disk full) which are not currently
handled; update _patch_file to catch exceptions around the write and _delete_pyc
call (wrap the path.write_text and _delete_pyc(path.stem) in a try/except), log
or print a clear error message including the exception and label, and return
False on failure so callers know the patch failed; ensure sentinel/old checks
remain unchanged and only the write/delete operations are guarded.
- Around line 408-414: The injected replacement in the re.sub lambda currently
does an inline "import torch as _t" when handling kv_cache_dtype ==
"fp8_ds_mla"; change it to use the existing module-level torch symbol instead
(e.g., replace the inline import/alias and return _t.float8_e4m3fn with a direct
return of torch.float8_e4m3fn) so the patch uses the already-imported torch in
flash_attn.py and avoids a redundant inline import; update the lambda string in
the _fa_new assignment that matches the raise ValueError to reference torch
directly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e768f4bf-2470-478f-a408-7f7a00c11644

📥 Commits

Reviewing files that changed from the base of the PR and between bb08094 and 0b2f2b4.

📒 Files selected for processing (17)
  • tools/launcher/common/specdec/read_vllm_files.sh
  • tools/launcher/common/specdec/sglang_smoke_test.sh
  • tools/launcher/common/specdec/vllm_smoke_test.sh
  • tools/launcher/common/vllm/query.sh
  • tools/launcher/core.py
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/hf_offline_eagle3.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/hf_offline_eagle3_data_cw_dfw.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b200.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b300.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_cw_dfw.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_b200.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_b300.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_cw_dfw.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml
  • tools/launcher/slurm_config.py

Comment on lines +35 to +42
cleanup() {
kill "$SERVER_PID" 2>/dev/null || true
sleep 2
kill -9 "$SERVER_PID" 2>/dev/null || true
if [ -n "$TMPFS_MODEL" ] && [ -d "$TMPFS_MODEL" ]; then
echo "Removing tmpfs model copy: $TMPFS_MODEL"
rm -rf "$TMPFS_MODEL"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard cleanup for unset PID and avoid deleting shared tmpfs copies.

With set -u, Line [36] can fail when the script exits before Line [141] assigns SERVER_PID. Also, tmpfs copies discovered at Line [49]-Line [51] are reused but still removed at exit, which can break concurrent jobs using the same path.

Suggested fix
 TMPFS_MODEL=""
+SERVER_PID=""
+TMPFS_MODEL_OWNED=0
 cleanup() {
-    kill "$SERVER_PID" 2>/dev/null || true
-    sleep 2
-    kill -9 "$SERVER_PID" 2>/dev/null || true
-    if [ -n "$TMPFS_MODEL" ] && [ -d "$TMPFS_MODEL" ]; then
+    if [ -n "${SERVER_PID:-}" ] && kill -0 "${SERVER_PID}" 2>/dev/null; then
+        kill "${SERVER_PID}" 2>/dev/null || true
+        sleep 2
+        kill -9 "${SERVER_PID}" 2>/dev/null || true
+    fi
+    if [ "${TMPFS_MODEL_OWNED:-0}" = "1" ] && [ -n "${TMPFS_MODEL:-}" ] && [ -d "${TMPFS_MODEL}" ]; then
         echo "Removing tmpfs model copy: $TMPFS_MODEL"
         rm -rf "$TMPFS_MODEL"
     fi
 }
@@
     if [ -d "$TMPFS_MODEL" ] && [ -f "$TMPFS_MODEL/config.json" ]; then
         echo "Using existing tmpfs model copy: $TMPFS_MODEL"
+        TMPFS_MODEL_OWNED=0
     else
@@
         cp -r "$MODEL" "$TMPFS_MODEL"
         echo "Model copy done: $TMPFS_MODEL"
+        TMPFS_MODEL_OWNED=1
     fi

Also applies to: 46-57, 141-141

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 35 - 42, The
cleanup() function should guard use of SERVER_PID and only remove tmpfs copies
this script created: check SERVER_PID exists/ non-empty using parameter
expansion (e.g., test -n "${SERVER_PID-}" or similar) before running kill/kil -9
to avoid failures under set -u, and introduce a TMPFS_MODEL_OWNED (or
TMPFS_MODEL_CREATED) flag when you create or copy the tmpfs model in the code
that discovers TMPFS_MODEL (the logic around the TMPFS discovery at the block
that sets TMPFS_MODEL near where you check Lines 49-51); then in cleanup() only
rm -rf "$TMPFS_MODEL" when TMPFS_MODEL_OWNED is true so shared tmpfs paths are
not deleted by concurrent jobs.

Comment on lines +90 to +104
for sp in site.getsitepackages() + [site.getusersitepackages()]:
if not os.path.isdir(sp):
continue
try:
with open(os.path.join(sp, '_deepseek_v4_patch.py'), 'w') as f:
f.write(STUB)
with open(os.path.join(sp, 'deepseek_v4.pth'), 'w') as f:
f.write('import _deepseek_v4_patch\n')
print(f"[patch] Wrote deepseek_v4.pth to {sp}")
break
except Exception as e:
print(f"[patch] Could not write to {sp}: {e}")

exec(STUB)
PYEOF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '85,110p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1169


🏁 Script executed:

cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '1,95p'

Repository: NVIDIA/Model-Optimizer

Length of output: 4915


🏁 Script executed:

cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '105,150p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2286


Fail fast when the .pth patch cannot be installed for spawned processes.

If all writable site-packages attempts fail (lines 90–102), the script continues silently. exec(STUB) at line 103 only patches the current interpreter; the spawned python -m sglang.launch_server process (line 131) depends on the .pth file for automatic patch application at startup. Without the file, the server starts without deepseek_v4 registered, causing opaque failures downstream.

Track whether .pth was successfully written and exit with an error if all site-packages locations are non-writable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 90 - 104,
The loop that tries to write _deepseek_v4_patch.py and deepseek_v4.pth may fail
silently and only exec(STUB) patches the current interpreter, so the spawned
"python -m sglang.launch_server" process won't get the patch; add a boolean flag
(e.g., wrote_pth = False) before iterating site.getsitepackages() +
[site.getusersitepackages()], set it to True when writing deepseek_v4.pth
succeeds, and after the loop check the flag and call sys.exit(1) (and print an
error) if no writable site-packages were found so the script fails fast instead
of continuing to exec(STUB) and spawning the server without the .pth patch.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.83%. Comparing base (0678136) to head (ffe00b6).
⚠️ Report is 31 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1379      +/-   ##
==========================================
- Coverage   74.46%   70.83%   -3.64%     
==========================================
  Files         464      485      +21     
  Lines       50089    55831    +5742     
==========================================
+ Hits        37300    39548    +2248     
- Misses      12789    16283    +3494     
Flag Coverage Δ
unit 52.82% <ø> (+0.38%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenhanYu and others added 3 commits April 30, 2026 11:19
SGLang MTP smoke tests have not been run and verified. Remove them until
they are tested end-to-end on a cluster.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: chenhany <chenhany@nvidia.com>
These files belong to the draft model folder and were not created or
verified in this PR. Remove until they have dedicated test coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: chenhany <chenhany@nvidia.com>
The z-lab/DeepSeek-V4-Flash-DFlash checkpoint is a 4-layer Qwen3-based
scaffold created from z-lab's DFlash architecture reference with random
weights. The smoke test verifies the inference pipeline starts and
generates tokens, not acceptance rate or generation quality.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: chenhany <chenhany@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant