Skip to content

[DeepSeek] Default to top-k calibration with peer-max input amax sync#1380

Open
cjluo-nv wants to merge 3 commits intomainfrom
chenjiel/ds_moe
Open

[DeepSeek] Default to top-k calibration with peer-max input amax sync#1380
cjluo-nv wants to merge 3 commits intomainfrom
chenjiel/ds_moe

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented May 1, 2026

Summary

  • DeepSeek PTQ (examples/deepseek/ptq.py) now defaults to native top-k routing during MoE calibration. The previous all-tokens-to-all-experts path (CalibMoe) is preserved behind a new --calib_all_experts flag.
  • After mtq.quantize, fixup_moe_expert_amax syncs every expert's input_quantizer.amax (w1/w2/w3) to the per-layer global peer max via dist.all_reduce(MAX) across EP ranks. weight_quantizer.amax stays per-expert; any uncalibrated expert is filled by computing amax over the dequantized FP8 weight.
  • mtq.print_quant_summary is now also written to <output_path>/.quant_summary.txt, mirroring llm_ptq/hf_ptq.py.

Why

Forcing all tokens through every expert doubled calibration time and inflated input_quantizer.amax for cold-routing experts with outliers they never see at inference. The new flow matches the inference distribution, runs roughly 2x faster, and mirrors the layer_sync_moe_local_experts_amax semantics that mtq runs automatically for QuantSequentialMLP-derived MoEs.

Validation (DeepSeek-V3.2-Exp, MP=8, NVFP4_DEFAULT_CFG)

Compared _amax_baseline (CalibMoe) vs _amax_synced (new default):

  • All 44,544 expert weight amaxes bit-identical.
  • Attention, shared experts, gate: identical.
  • Expert w1.input and w3.input (shared MoE block input): identical.
  • Expert w2.input (post-SiLU gated, expert-specific): synced to layer-wide peer max — 99.3% are larger than baseline (median 11.4x) since peer-max captures the worst-case outlier from any expert in the layer; 0.7% are smaller. This is the same trade-off set_expert_quantizer_amax makes for HF MoEs in unified_export_hf.py.

Test plan

  • DeepSeek-V3.2-Exp MP8 PTQ with default flags — completes in ~7 min (vs ~27 min with CalibMoe), produces _amax_synced/ consistent with the comparison above.
  • DeepSeek-V3.2-Exp MP8 PTQ with --calib_all_experts — produces _amax_baseline/ identical (other than rounding) to the prior CalibMoe-default behavior.
  • .quant_summary.txt written under output_path on rank 0.

Summary by CodeRabbit

  • New Features
    • Added a --calib_all_experts option to enable an alternate PTQ calibration mode; default remains top-k routing with a post-calibration per-layer peer-max synchronization and a compute fallback for uncalibrated experts.
  • Documentation
    • Clarified default and alternate calibration behaviors and added note about generation of a .quant_summary.txt summary file.

Previously DeepSeek PTQ forced every token through every MoE expert during
calibration via CalibMoe. This doubled the calibration forward pass and
exposed cold-routing experts to outliers they would never see at inference,
inflating their input_quantizer.amax.

Default now uses native top-k routing and runs fixup_moe_expert_amax after
mtq.quantize. For each MoE layer x linear (w1/w2/w3), every expert's
input_quantizer.amax is synced to the per-layer global peer max
(dist.all_reduce(MAX) across EP ranks). weight_quantizer.amax stays
per-expert; uncalibrated experts fall back to a compute path over the
dequantized FP8 weight.

The previous behavior is preserved behind --calib_all_experts.

Also write mtq.print_quant_summary output to <output_path>/.quant_summary.txt
to mirror llm_ptq/hf_ptq.py.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested a review from a team as a code owner May 1, 2026 19:06
@cjluo-nv cjluo-nv requested a review from meenchen May 1, 2026 19:06
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b8d835fa-c34f-4ee4-98e8-dc2f7fb3bd93

📥 Commits

Reviewing files that changed from the base of the PR and between 3e6b05e and 681c0ad.

📒 Files selected for processing (1)
  • CHANGELOG.rst
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.rst

📝 Walkthrough

Walkthrough

Adds a calib_all_experts mode to DeepSeek PTQ. Default uses native top-k routing plus a post-calibration per-layer peer-max all-reduce to fill missing expert input_quantizer.amax; --calib_all_experts forces routing every token to every expert. CLI and quant-summary output path handling updated.

Changes

Cohort / File(s) Summary
Changelog / README
CHANGELOG.rst, examples/deepseek/README.md
Documented new default calibration behavior (native top-k + post-hoc per-layer peer-max sync for expert input_quantizer.amax), the alternate --calib_all_experts all-experts calibration mode, compute fallback for uncalibrated experts, and creation of .quant_summary.txt in the output path.
PTQ Script / MoE calibration
examples/deepseek/ptq.py
Adds calib_all_experts flag threaded through model loading, PTQ, and quant-summary output. When enabled, registers CalibMoe to route every token to every expert during calibration. When disabled, introduces fixup_moe_expert_amax() to compute per-layer peer-max input_quantizer.amax (all-reduced across EP ranks) and to fill missing weight_quantizer.amax by dequantizing FP8 weights and reducing amax; updates CLI wiring and conditional execution.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI/User
    participant PTQ as PTQ Script
    participant MoE as MoE Layers
    participant Dist as Distributed Backend
    participant FS as Filesystem

    User->>PTQ: run with flags (--calib_all_experts?, --output_path)
    PTQ->>MoE: load model & register replacements
    alt calib_all_experts = true
        PTQ->>MoE: register CalibMoe (route all tokens to all experts)
        PTQ->>MoE: run calibration (every token -> every expert)
    else default (top-k)
        PTQ->>MoE: run calibration (native top-k routing)
        PTQ->>PTQ: run fixup_moe_expert_amax()
        PTQ->>MoE: collect valid expert input_quantizer.amax per layer
        PTQ->>Dist: all-reduce per-layer peer-max amax
        Dist->>PTQ: reduced peer-max values
        PTQ->>MoE: fill missing expert input/weight amax values
    end
    PTQ->>FS: write .quant_summary.txt to output_path
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: switching DeepSeek PTQ to default top-k calibration with peer-max input amax synchronization, which is the primary focus across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Pull request modifications adhere to security coding practices with proper safetensors usage, configurable trust_remote_code parameter, and no unsafe patterns detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/ds_moe

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.10%. Comparing base (50706d1) to head (681c0ad).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1380      +/-   ##
==========================================
+ Coverage   66.36%   67.10%   +0.74%     
==========================================
  Files         471      471              
  Lines       50510    50510              
==========================================
+ Hits        33522    33896     +374     
+ Misses      16988    16614     -374     
Flag Coverage Δ
examples 41.58% <ø> (+0.93%) ⬆️
unit 52.83% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjluo-nv cjluo-nv requested review from jingyu-ml and mxinO May 1, 2026 19:30
Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Bug found in fixup_moe_expert_amax: weight_dequant(w, w.scale, torch.bfloat16) passes torch.bfloat16 as the block_size positional argument instead of dtype. This will cause a runtime error (or incorrect behavior) in the triton kernel. The existing linear() function in the same file correctly uses the keyword form dtype=torch.bfloat16.

Otherwise the PR is well-structured and reasonable in size (~141 lines). The approach of syncing expert input amaxes to peer max across EP ranks makes sense, and the custom fixup_moe_expert_amax is justified since the existing sync_moe_expert_amax utility doesn't handle cross-rank dist.all_reduce needed for expert parallelism. The print_quant_summary with output_path uses the newly added signature in model_quant.py correctly. No new tests, but this is an example script so integration testing is the primary validation path.

Comment thread examples/deepseek/ptq.py
# to bf16 first so we measure the real weight distribution, not bytes.
deq = weight_dequant(w, w.scale, torch.bfloat16) if w.element_size() == 1 else w
axis = getattr(wq, "_axis", None)
if axis is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Bug: torch.bfloat16 is passed as the third positional argument, which maps to block_size (an int), not dtype. This will fail at runtime in the triton kernel.

# Current (broken):
deq = weight_dequant(w, w.scale, torch.bfloat16) if w.element_size() == 1 else w

# Fix — use the keyword argument, matching the `linear()` function above:
deq = weight_dequant(w, w.scale, dtype=torch.bfloat16) if w.element_size() == 1 else w

weight_dequant(x, s, block_size=128, dtype=...) — the third positional arg
is block_size, not dtype. Passing torch.bfloat16 there sets block_size to
the dtype object, which would either fail inside the triton kernel or
compute amax over corrupt blocks for any uncalibrated expert.

The bug never fired in our validation run because every expert was
activated during calibration (top-k over 256 experts × 1024 samples), so
the _missing(wq.amax) branch was dead. Spotted by bot review on PR #1380.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv
Copy link
Copy Markdown
Collaborator Author

cjluo-nv commented May 1, 2026

Thanks for the catch — fixed in 3e6b05e. Confirmed via grep that this was the only call site that passed dtype positionally; the existing linear() helper in the same file already uses the keyword form. As noted, the buggy branch never fired in our validation run (every expert got activated, so _missing(wq.amax) was always False), but it would have triggered for any setup with cold experts.

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

The critical bug from the previous review — weight_dequant(w, w.scale, torch.bfloat16) passing torch.bfloat16 as positional block_size — has been fixed to use dtype=torch.bfloat16 as a keyword argument.

The PR is well-scoped (~143 lines), correctly implements peer-max input amax sync across EP ranks via dist.all_reduce(MAX), preserves the old behavior behind --calib_all_experts, and includes good documentation. As an example script, integration testing (documented in the test plan) is the appropriate validation path. The reduce_amax import is valid and available from the quantization utils.

Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/deepseek/ptq.py`:
- Around line 267-279: The current code only calls dist.all_reduce(m, ...) on
ranks that have local valid values (m), which can deadlock; change it so every
rank participates: compute a local_max tensor (use the same shape as m when any
rank has valid entries — e.g., if valid on this rank set local_max = m,
otherwise create a tensor filled with -inf or a sentinel of the same shape on
the same device), then call dist.all_reduce(local_max, op=dist.ReduceOp.MAX)
unconditionally, assign the reduced result to a global_max variable, and finally
set each q.amax = global_max.clone() (and handle the case where global_max is
all sentinel/-inf by skipping or keeping previous values). Ensure you update the
code paths around qs, valid, m, world_size, and the dist.all_reduce call so
every rank calls dist.all_reduce with matching-shaped tensors.
- Around line 469-480: The code uses os.environ["LOCAL_RANK"] for rank-zero
logging and file writes, which is node-local and can be unset; replace those
checks with a global-rank guard using torch.distributed: use
dist.is_initialized() ? dist.get_rank() == 0 : True to decide the rank-zero path
around the fixup_moe_expert_amax call and the
mtq.print_quant_summary/output_path directory creation; remove direct os.environ
access and ensure only global rank 0 performs os.makedirs(output_path,
exist_ok=True) and the mtq.print_quant_summary(transformer, output_path) call;
keep the logic around calib_all_experts, fixup_moe_expert_amax(transformer),
synced_input, fixed_weight, and the print message but gated by the global-rank
check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1c3680e8-c836-4da0-b1e3-e6d4a5d3b4d5

📥 Commits

Reviewing files that changed from the base of the PR and between f5d57cb and 3e6b05e.

📒 Files selected for processing (1)
  • examples/deepseek/ptq.py

Comment thread examples/deepseek/ptq.py
Comment on lines +267 to +279
qs = [
lin.input_quantizer
for lin in linears
if lin.input_quantizer.is_enabled
and not getattr(lin.input_quantizer, "_dynamic", False)
]
valid = [q.amax.float() for q in qs if not _missing(q.amax)]
if valid:
m = torch.stack(valid).amax(dim=0)
if world_size > 1:
dist.all_reduce(m, op=dist.ReduceOp.MAX)
for q in qs:
q.amax = m.clone()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the full context around lines 267-279
head -290 examples/deepseek/ptq.py | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2017


🏁 Script executed:

# Search for the entire fixup_moe_expert_amax function to understand the context better
rg -A 30 "def fixup_moe_expert_amax" examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1318


🏁 Script executed:

# Check if there are any synchronization points before this code is called
rg -B 20 "fixup_moe_expert_amax" examples/deepseek/ptq.py | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1360


🏁 Script executed:

# Search for how world_size and distributed operations are set up
rg "world_size\s*=" examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 239


🏁 Script executed:

# Look for any barrier or synchronization calls
rg "dist\.(barrier|all_gather|all_reduce|broadcast)" examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 260


🏁 Script executed:

# Verify PyTorch distributed documentation on collective participation requirements
web_search

Repository: NVIDIA/Model-Optimizer

Length of output: 117


Make the all_reduce collective unconditional to prevent distributed deadlock.

dist.all_reduce() on the default process group requires every rank to participate. In expert-sharded MoE setups, some ranks may have no local experts or all experts with missing amax values, causing them to skip the collective while peer ranks enter it. This divergence hangs the entire calibration run. Switch to a pattern where all ranks participate unconditionally (e.g., gather optional local maxima from every rank, then reduce globally), or use a synchronized fallback when no valid values are present locally.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/deepseek/ptq.py` around lines 267 - 279, The current code only calls
dist.all_reduce(m, ...) on ranks that have local valid values (m), which can
deadlock; change it so every rank participates: compute a local_max tensor (use
the same shape as m when any rank has valid entries — e.g., if valid on this
rank set local_max = m, otherwise create a tensor filled with -inf or a sentinel
of the same shape on the same device), then call dist.all_reduce(local_max,
op=dist.ReduceOp.MAX) unconditionally, assign the reduced result to a global_max
variable, and finally set each q.amax = global_max.clone() (and handle the case
where global_max is all sentinel/-inf by skipping or keeping previous values).
Ensure you update the code paths around qs, valid, m, world_size, and the
dist.all_reduce call so every rank calls dist.all_reduce with matching-shaped
tensors.

Comment thread examples/deepseek/ptq.py
Comment on lines +469 to +480
if not calib_all_experts:
synced_input, fixed_weight = fixup_moe_expert_amax(transformer)
if int(os.environ["LOCAL_RANK"]) == 0:
print(
f"Synced peer-max for {synced_input} expert input_quantizer(s) "
f"and computed {fixed_weight} weight_quantizer amax(es) on rank 0."
)

if int(os.environ["LOCAL_RANK"]) == 0:
mtq.print_quant_summary(transformer)
if output_path:
os.makedirs(output_path, exist_ok=True)
mtq.print_quant_summary(transformer, output_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n examples/deepseek/ptq.py | sed -n '460,490p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1404


🏁 Script executed:

head -50 examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2164


🏁 Script executed:

grep -n "LOCAL_RANK\|RANK\|dist\." examples/deepseek/ptq.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 697


🏁 Script executed:

sed -n '315,330p' examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 659


🏁 Script executed:

sed -n '395,475p' examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3461


Use global rank for rank-zero logging, not LOCAL_RANK.

LOCAL_RANK is node-local; in multi-node distributed training, this condition executes on rank 0 of every node, causing concurrent writes to quant_summary.txt and other race conditions. Additionally, accessing os.environ["LOCAL_RANK"] directly raises KeyError if the variable is unset (e.g., outside torchrun).

Use dist.get_rank() with dist.is_initialized() to check global rank zero:

Suggested fix
+    is_rank_zero = not dist.is_initialized() or dist.get_rank() == 0
+
     if not calib_all_experts:
         synced_input, fixed_weight = fixup_moe_expert_amax(transformer)
-        if int(os.environ["LOCAL_RANK"]) == 0:
+        if is_rank_zero:
             print(
                 f"Synced peer-max for {synced_input} expert input_quantizer(s) "
                 f"and computed {fixed_weight} weight_quantizer amax(es) on rank 0."
             )

-    if int(os.environ["LOCAL_RANK"]) == 0:
+    if is_rank_zero:
         if output_path:
             os.makedirs(output_path, exist_ok=True)
         mtq.print_quant_summary(transformer, output_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/deepseek/ptq.py` around lines 469 - 480, The code uses
os.environ["LOCAL_RANK"] for rank-zero logging and file writes, which is
node-local and can be unset; replace those checks with a global-rank guard using
torch.distributed: use dist.is_initialized() ? dist.get_rank() == 0 : True to
decide the rank-zero path around the fixup_moe_expert_amax call and the
mtq.print_quant_summary/output_path directory creation; remove direct os.environ
access and ensure only global rank 0 performs os.makedirs(output_path,
exist_ok=True) and the mtq.print_quant_summary(transformer, output_path) call;
keep the logic around calib_all_experts, fixup_moe_expert_amax(transformer),
synced_input, fixed_weight, and the print message but gated by the global-rank
check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants