[DeepSeek] Default to top-k calibration with peer-max input amax sync by cjluo-nv · Pull Request #1380 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-01T19:06:06Z

Summary

DeepSeek PTQ (examples/deepseek/ptq.py) now defaults to native top-k routing during MoE calibration. The previous all-tokens-to-all-experts path (CalibMoe) is preserved behind a new --calib_all_experts flag.
After mtq.quantize, fixup_moe_expert_amax syncs every expert's input_quantizer.amax (w1/w2/w3) to the per-layer global peer max via dist.all_reduce(MAX) across EP ranks. weight_quantizer.amax stays per-expert; any uncalibrated expert is filled by computing amax over the dequantized FP8 weight.
mtq.print_quant_summary is now also written to <output_path>/.quant_summary.txt, mirroring llm_ptq/hf_ptq.py.

Why

Forcing all tokens through every expert doubled calibration time and inflated input_quantizer.amax for cold-routing experts with outliers they never see at inference. The new flow matches the inference distribution, runs roughly 2x faster, and mirrors the layer_sync_moe_local_experts_amax semantics that mtq runs automatically for QuantSequentialMLP-derived MoEs.

Validation (DeepSeek-V3.2-Exp, MP=8, NVFP4_DEFAULT_CFG)

Compared _amax_baseline (CalibMoe) vs _amax_synced (new default):

All 44,544 expert weight amaxes bit-identical.
Attention, shared experts, gate: identical.
Expert w1.input and w3.input (shared MoE block input): identical.
Expert w2.input (post-SiLU gated, expert-specific): synced to layer-wide peer max — 99.3% are larger than baseline (median 11.4x) since peer-max captures the worst-case outlier from any expert in the layer; 0.7% are smaller. This is the same trade-off set_expert_quantizer_amax makes for HF MoEs in unified_export_hf.py.

Test plan

DeepSeek-V3.2-Exp MP8 PTQ with default flags — completes in ~7 min (vs ~27 min with CalibMoe), produces _amax_synced/ consistent with the comparison above.
DeepSeek-V3.2-Exp MP8 PTQ with --calib_all_experts — produces _amax_baseline/ identical (other than rounding) to the prior CalibMoe-default behavior.
.quant_summary.txt written under output_path on rank 0.

Summary by CodeRabbit

New Features
- Added a --calib_all_experts option to enable an alternate PTQ calibration mode; default remains top-k routing with a post-calibration per-layer peer-max synchronization and a compute fallback for uncalibrated experts.
Documentation
- Clarified default and alternate calibration behaviors and added note about generation of a .quant_summary.txt summary file.

Previously DeepSeek PTQ forced every token through every MoE expert during calibration via CalibMoe. This doubled the calibration forward pass and exposed cold-routing experts to outliers they would never see at inference, inflating their input_quantizer.amax. Default now uses native top-k routing and runs fixup_moe_expert_amax after mtq.quantize. For each MoE layer x linear (w1/w2/w3), every expert's input_quantizer.amax is synced to the per-layer global peer max (dist.all_reduce(MAX) across EP ranks). weight_quantizer.amax stays per-expert; uncalibrated experts fall back to a compute path over the dequantized FP8 weight. The previous behavior is preserved behind --calib_all_experts. Also write mtq.print_quant_summary output to <output_path>/.quant_summary.txt to mirror llm_ptq/hf_ptq.py. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-05-01T19:06:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b8d835fa-c34f-4ee4-98e8-dc2f7fb3bd93

📥 Commits

Reviewing files that changed from the base of the PR and between 3e6b05e and 681c0ad.

📒 Files selected for processing (1)

CHANGELOG.rst

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.rst

📝 Walkthrough

Walkthrough

Adds a calib_all_experts mode to DeepSeek PTQ. Default uses native top-k routing plus a post-calibration per-layer peer-max all-reduce to fill missing expert input_quantizer.amax; --calib_all_experts forces routing every token to every expert. CLI and quant-summary output path handling updated.

Changes

Cohort / File(s)	Summary
Changelog / README `CHANGELOG.rst`, `examples/deepseek/README.md`	Documented new default calibration behavior (native top-k + post-hoc per-layer peer-max sync for expert `input_quantizer.amax`), the alternate `--calib_all_experts` all-experts calibration mode, compute fallback for uncalibrated experts, and creation of `.quant_summary.txt` in the output path.
PTQ Script / MoE calibration `examples/deepseek/ptq.py`	Adds `calib_all_experts` flag threaded through model loading, PTQ, and quant-summary output. When enabled, registers `CalibMoe` to route every token to every expert during calibration. When disabled, introduces `fixup_moe_expert_amax()` to compute per-layer peer-max `input_quantizer.amax` (all-reduced across EP ranks) and to fill missing `weight_quantizer.amax` by dequantizing FP8 weights and reducing amax; updates CLI wiring and conditional execution.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI/User
    participant PTQ as PTQ Script
    participant MoE as MoE Layers
    participant Dist as Distributed Backend
    participant FS as Filesystem

    User->>PTQ: run with flags (--calib_all_experts?, --output_path)
    PTQ->>MoE: load model & register replacements
    alt calib_all_experts = true
        PTQ->>MoE: register CalibMoe (route all tokens to all experts)
        PTQ->>MoE: run calibration (every token -> every expert)
    else default (top-k)
        PTQ->>MoE: run calibration (native top-k routing)
        PTQ->>PTQ: run fixup_moe_expert_amax()
        PTQ->>MoE: collect valid expert input_quantizer.amax per layer
        PTQ->>Dist: all-reduce per-layer peer-max amax
        Dist->>PTQ: reduced peer-max values
        PTQ->>MoE: fill missing expert input/weight amax values
    end
    PTQ->>FS: write .quant_summary.txt to output_path

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: switching DeepSeek PTQ to default top-k calibration with peer-max input amax synchronization, which is the primary focus across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Pull request modifications adhere to security coding practices with proper safetensors usage, configurable trust_remote_code parameter, and no unsafe patterns detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/ds_moe

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-01T19:24:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.10%. Comparing base (50706d1) to head (681c0ad).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1380      +/-   ##
==========================================
+ Coverage   66.36%   67.10%   +0.74%     
==========================================
  Files         471      471              
  Lines       50510    50510              
==========================================
+ Hits        33522    33896     +374     
+ Misses      16988    16614     -374

Flag	Coverage Δ
examples	`41.58% <ø> (+0.93%)`	⬆️
unit	`52.83% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meenchen

Bot review — DM the bot to share feedback.

Bug found in fixup_moe_expert_amax: weight_dequant(w, w.scale, torch.bfloat16) passes torch.bfloat16 as the block_size positional argument instead of dtype. This will cause a runtime error (or incorrect behavior) in the triton kernel. The existing linear() function in the same file correctly uses the keyword form dtype=torch.bfloat16.

Otherwise the PR is well-structured and reasonable in size (~141 lines). The approach of syncing expert input amaxes to peer max across EP ranks makes sense, and the custom fixup_moe_expert_amax is justified since the existing sync_moe_expert_amax utility doesn't handle cross-rank dist.all_reduce needed for expert parallelism. The print_quant_summary with output_path uses the newly added signature in model_quant.py correctly. No new tests, but this is an example script so integration testing is the primary validation path.

meenchen · 2026-05-01T19:57:53Z

+                # to bf16 first so we measure the real weight distribution, not bytes.
+                deq = weight_dequant(w, w.scale, torch.bfloat16) if w.element_size() == 1 else w
+                axis = getattr(wq, "_axis", None)
+                if axis is None:


Bot comment.

Bug: torch.bfloat16 is passed as the third positional argument, which maps to block_size (an int), not dtype. This will fail at runtime in the triton kernel.

# Current (broken): deq = weight_dequant(w, w.scale, torch.bfloat16) if w.element_size() == 1 else w # Fix — use the keyword argument, matching the `linear()` function above: deq = weight_dequant(w, w.scale, dtype=torch.bfloat16) if w.element_size() == 1 else w

weight_dequant(x, s, block_size=128, dtype=...) — the third positional arg is block_size, not dtype. Passing torch.bfloat16 there sets block_size to the dtype object, which would either fail inside the triton kernel or compute amax over corrupt blocks for any uncalibrated expert. The bug never fired in our validation run because every expert was activated during calibration (top-k over 256 experts × 1024 samples), so the _missing(wq.amax) branch was dead. Spotted by bot review on PR #1380. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv · 2026-05-01T20:11:23Z

Thanks for the catch — fixed in 3e6b05e. Confirmed via grep that this was the only call site that passed dtype positionally; the existing linear() helper in the same file already uses the keyword form. As noted, the buggy branch never fired in our validation run (every expert got activated, so _missing(wq.amax) was always False), but it would have triggered for any setup with cold experts.

meenchen

Bot review — DM the bot to share feedback.

The critical bug from the previous review — weight_dequant(w, w.scale, torch.bfloat16) passing torch.bfloat16 as positional block_size — has been fixed to use dtype=torch.bfloat16 as a keyword argument.

The PR is well-scoped (~143 lines), correctly implements peer-max input amax sync across EP ranks via dist.all_reduce(MAX), preserves the old behavior behind --calib_all_experts, and includes good documentation. As an example script, integration testing (documented in the test plan) is the appropriate validation path. The reduce_amax import is valid and available from the quantization utils.

Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/deepseek/ptq.py`:
- Around line 267-279: The current code only calls dist.all_reduce(m, ...) on
ranks that have local valid values (m), which can deadlock; change it so every
rank participates: compute a local_max tensor (use the same shape as m when any
rank has valid entries — e.g., if valid on this rank set local_max = m,
otherwise create a tensor filled with -inf or a sentinel of the same shape on
the same device), then call dist.all_reduce(local_max, op=dist.ReduceOp.MAX)
unconditionally, assign the reduced result to a global_max variable, and finally
set each q.amax = global_max.clone() (and handle the case where global_max is
all sentinel/-inf by skipping or keeping previous values). Ensure you update the
code paths around qs, valid, m, world_size, and the dist.all_reduce call so
every rank calls dist.all_reduce with matching-shaped tensors.
- Around line 469-480: The code uses os.environ["LOCAL_RANK"] for rank-zero
logging and file writes, which is node-local and can be unset; replace those
checks with a global-rank guard using torch.distributed: use
dist.is_initialized() ? dist.get_rank() == 0 : True to decide the rank-zero path
around the fixup_moe_expert_amax call and the
mtq.print_quant_summary/output_path directory creation; remove direct os.environ
access and ensure only global rank 0 performs os.makedirs(output_path,
exist_ok=True) and the mtq.print_quant_summary(transformer, output_path) call;
keep the logic around calib_all_experts, fixup_moe_expert_amax(transformer),
synced_input, fixed_weight, and the print message but gated by the global-rank
check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1c3680e8-c836-4da0-b1e3-e6d4a5d3b4d5

📥 Commits

Reviewing files that changed from the base of the PR and between f5d57cb and 3e6b05e.

📒 Files selected for processing (1)

examples/deepseek/ptq.py

coderabbitai · 2026-05-01T20:19:03Z

+            qs = [
+                lin.input_quantizer
+                for lin in linears
+                if lin.input_quantizer.is_enabled
+                and not getattr(lin.input_quantizer, "_dynamic", False)
+            ]
+            valid = [q.amax.float() for q in qs if not _missing(q.amax)]
+            if valid:
+                m = torch.stack(valid).amax(dim=0)
+                if world_size > 1:
+                    dist.all_reduce(m, op=dist.ReduceOp.MAX)
+                for q in qs:
+                    q.amax = m.clone()


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the full context around lines 267-279 head -290 examples/deepseek/ptq.py | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2017

🏁 Script executed:

# Search for the entire fixup_moe_expert_amax function to understand the context better rg -A 30 "def fixup_moe_expert_amax" examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1318

🏁 Script executed:

# Check if there are any synchronization points before this code is called rg -B 20 "fixup_moe_expert_amax" examples/deepseek/ptq.py | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1360

🏁 Script executed:

# Search for how world_size and distributed operations are set up rg "world_size\s*=" examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 239

🏁 Script executed:

# Look for any barrier or synchronization calls rg "dist\.(barrier|all_gather|all_reduce|broadcast)" examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 260

🏁 Script executed:

# Verify PyTorch distributed documentation on collective participation requirements web_search

Repository: NVIDIA/Model-Optimizer

Length of output: 117

Make the all_reduce collective unconditional to prevent distributed deadlock.

dist.all_reduce() on the default process group requires every rank to participate. In expert-sharded MoE setups, some ranks may have no local experts or all experts with missing amax values, causing them to skip the collective while peer ranks enter it. This divergence hangs the entire calibration run. Switch to a pattern where all ranks participate unconditionally (e.g., gather optional local maxima from every rank, then reduce globally), or use a synchronized fallback when no valid values are present locally.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/deepseek/ptq.py` around lines 267 - 279, The current code only calls dist.all_reduce(m, ...) on ranks that have local valid values (m), which can deadlock; change it so every rank participates: compute a local_max tensor (use the same shape as m when any rank has valid entries — e.g., if valid on this rank set local_max = m, otherwise create a tensor filled with -inf or a sentinel of the same shape on the same device), then call dist.all_reduce(local_max, op=dist.ReduceOp.MAX) unconditionally, assign the reduced result to a global_max variable, and finally set each q.amax = global_max.clone() (and handle the case where global_max is all sentinel/-inf by skipping or keeping previous values). Ensure you update the code paths around qs, valid, m, world_size, and the dist.all_reduce call so every rank calls dist.all_reduce with matching-shaped tensors.

coderabbitai · 2026-05-01T20:19:04Z

+    if not calib_all_experts:
+        synced_input, fixed_weight = fixup_moe_expert_amax(transformer)
+        if int(os.environ["LOCAL_RANK"]) == 0:
+            print(
+                f"Synced peer-max for {synced_input} expert input_quantizer(s) "
+                f"and computed {fixed_weight} weight_quantizer amax(es) on rank 0."
+            )
+
    if int(os.environ["LOCAL_RANK"]) == 0:
-        mtq.print_quant_summary(transformer)
+        if output_path:
+            os.makedirs(output_path, exist_ok=True)
+        mtq.print_quant_summary(transformer, output_path)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n examples/deepseek/ptq.py | sed -n '460,490p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1404

🏁 Script executed:

head -50 examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2164

🏁 Script executed:

grep -n "LOCAL_RANK\|RANK\|dist\." examples/deepseek/ptq.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 697

🏁 Script executed:

sed -n '315,330p' examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 659

🏁 Script executed:

sed -n '395,475p' examples/deepseek/ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3461

Use global rank for rank-zero logging, not LOCAL_RANK.

LOCAL_RANK is node-local; in multi-node distributed training, this condition executes on rank 0 of every node, causing concurrent writes to quant_summary.txt and other race conditions. Additionally, accessing os.environ["LOCAL_RANK"] directly raises KeyError if the variable is unset (e.g., outside torchrun).

Use dist.get_rank() with dist.is_initialized() to check global rank zero:

Suggested fix

+ is_rank_zero = not dist.is_initialized() or dist.get_rank() == 0 + if not calib_all_experts: synced_input, fixed_weight = fixup_moe_expert_amax(transformer) - if int(os.environ["LOCAL_RANK"]) == 0: + if is_rank_zero: print( f"Synced peer-max for {synced_input} expert input_quantizer(s) " f"and computed {fixed_weight} weight_quantizer amax(es) on rank 0." ) - if int(os.environ["LOCAL_RANK"]) == 0: + if is_rank_zero: if output_path: os.makedirs(output_path, exist_ok=True) mtq.print_quant_summary(transformer, output_path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/deepseek/ptq.py` around lines 469 - 480, The code uses os.environ["LOCAL_RANK"] for rank-zero logging and file writes, which is node-local and can be unset; replace those checks with a global-rank guard using torch.distributed: use dist.is_initialized() ? dist.get_rank() == 0 : True to decide the rank-zero path around the fixup_moe_expert_amax call and the mtq.print_quant_summary/output_path directory creation; remove direct os.environ access and ensure only global rank 0 performs os.makedirs(output_path, exist_ok=True) and the mtq.print_quant_summary(transformer, output_path) call; keep the logic around calib_all_experts, fixup_moe_expert_amax(transformer), synced_input, fixed_weight, and the print message but gated by the global-rank check.

cjluo-nv requested a review from a team as a code owner May 1, 2026 19:06

cjluo-nv requested a review from meenchen May 1, 2026 19:06

cjluo-nv requested review from jingyu-ml and mxinO May 1, 2026 19:30

meenchen reviewed May 1, 2026

View reviewed changes

meenchen approved these changes May 1, 2026

View reviewed changes

Merge branch 'main' into chenjiel/ds_moe

681c0ad

Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSeek] Default to top-k calibration with peer-max input amax sync#1380

[DeepSeek] Default to top-k calibration with peer-max input amax sync#1380
cjluo-nv wants to merge 3 commits intomainfrom
chenjiel/ds_moe

cjluo-nv commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 1, 2026 •

edited

Loading

Uh oh!

meenchen left a comment

Uh oh!

meenchen May 1, 2026

Uh oh!

cjluo-nv commented May 1, 2026

Uh oh!

meenchen left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 1, 2026

Uh oh!

coderabbitai Bot May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cjluo-nv commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation (DeepSeek-V3.2-Exp, MP=8, NVFP4_DEFAULT_CFG)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen May 1, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv commented May 1, 2026

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cjluo-nv commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading

codecov Bot commented May 1, 2026 •

edited

Loading