[DeepSeek] Default to top-k calibration with peer-max input amax sync#1380
[DeepSeek] Default to top-k calibration with peer-max input amax sync#1380
Conversation
Previously DeepSeek PTQ forced every token through every MoE expert during calibration via CalibMoe. This doubled the calibration forward pass and exposed cold-routing experts to outliers they would never see at inference, inflating their input_quantizer.amax. Default now uses native top-k routing and runs fixup_moe_expert_amax after mtq.quantize. For each MoE layer x linear (w1/w2/w3), every expert's input_quantizer.amax is synced to the per-layer global peer max (dist.all_reduce(MAX) across EP ranks). weight_quantizer.amax stays per-expert; uncalibrated experts fall back to a compute path over the dequantized FP8 weight. The previous behavior is preserved behind --calib_all_experts. Also write mtq.print_quant_summary output to <output_path>/.quant_summary.txt to mirror llm_ptq/hf_ptq.py. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a Changes
Sequence Diagram(s)sequenceDiagram
participant User as CLI/User
participant PTQ as PTQ Script
participant MoE as MoE Layers
participant Dist as Distributed Backend
participant FS as Filesystem
User->>PTQ: run with flags (--calib_all_experts?, --output_path)
PTQ->>MoE: load model & register replacements
alt calib_all_experts = true
PTQ->>MoE: register CalibMoe (route all tokens to all experts)
PTQ->>MoE: run calibration (every token -> every expert)
else default (top-k)
PTQ->>MoE: run calibration (native top-k routing)
PTQ->>PTQ: run fixup_moe_expert_amax()
PTQ->>MoE: collect valid expert input_quantizer.amax per layer
PTQ->>Dist: all-reduce per-layer peer-max amax
Dist->>PTQ: reduced peer-max values
PTQ->>MoE: fill missing expert input/weight amax values
end
PTQ->>FS: write .quant_summary.txt to output_path
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1380 +/- ##
==========================================
+ Coverage 66.36% 67.10% +0.74%
==========================================
Files 471 471
Lines 50510 50510
==========================================
+ Hits 33522 33896 +374
+ Misses 16988 16614 -374
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Bug found in fixup_moe_expert_amax: weight_dequant(w, w.scale, torch.bfloat16) passes torch.bfloat16 as the block_size positional argument instead of dtype. This will cause a runtime error (or incorrect behavior) in the triton kernel. The existing linear() function in the same file correctly uses the keyword form dtype=torch.bfloat16.
Otherwise the PR is well-structured and reasonable in size (~141 lines). The approach of syncing expert input amaxes to peer max across EP ranks makes sense, and the custom fixup_moe_expert_amax is justified since the existing sync_moe_expert_amax utility doesn't handle cross-rank dist.all_reduce needed for expert parallelism. The print_quant_summary with output_path uses the newly added signature in model_quant.py correctly. No new tests, but this is an example script so integration testing is the primary validation path.
| # to bf16 first so we measure the real weight distribution, not bytes. | ||
| deq = weight_dequant(w, w.scale, torch.bfloat16) if w.element_size() == 1 else w | ||
| axis = getattr(wq, "_axis", None) | ||
| if axis is None: |
There was a problem hiding this comment.
Bot comment.
Bug: torch.bfloat16 is passed as the third positional argument, which maps to block_size (an int), not dtype. This will fail at runtime in the triton kernel.
# Current (broken):
deq = weight_dequant(w, w.scale, torch.bfloat16) if w.element_size() == 1 else w
# Fix — use the keyword argument, matching the `linear()` function above:
deq = weight_dequant(w, w.scale, dtype=torch.bfloat16) if w.element_size() == 1 else wweight_dequant(x, s, block_size=128, dtype=...) — the third positional arg is block_size, not dtype. Passing torch.bfloat16 there sets block_size to the dtype object, which would either fail inside the triton kernel or compute amax over corrupt blocks for any uncalibrated expert. The bug never fired in our validation run because every expert was activated during calibration (top-k over 256 experts × 1024 samples), so the _missing(wq.amax) branch was dead. Spotted by bot review on PR #1380. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Thanks for the catch — fixed in 3e6b05e. Confirmed via grep that this was the only call site that passed dtype positionally; the existing |
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
The critical bug from the previous review — weight_dequant(w, w.scale, torch.bfloat16) passing torch.bfloat16 as positional block_size — has been fixed to use dtype=torch.bfloat16 as a keyword argument.
The PR is well-scoped (~143 lines), correctly implements peer-max input amax sync across EP ranks via dist.all_reduce(MAX), preserves the old behavior behind --calib_all_experts, and includes good documentation. As an example script, integration testing (documented in the test plan) is the appropriate validation path. The reduce_amax import is valid and available from the quantization utils.
Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/deepseek/ptq.py`:
- Around line 267-279: The current code only calls dist.all_reduce(m, ...) on
ranks that have local valid values (m), which can deadlock; change it so every
rank participates: compute a local_max tensor (use the same shape as m when any
rank has valid entries — e.g., if valid on this rank set local_max = m,
otherwise create a tensor filled with -inf or a sentinel of the same shape on
the same device), then call dist.all_reduce(local_max, op=dist.ReduceOp.MAX)
unconditionally, assign the reduced result to a global_max variable, and finally
set each q.amax = global_max.clone() (and handle the case where global_max is
all sentinel/-inf by skipping or keeping previous values). Ensure you update the
code paths around qs, valid, m, world_size, and the dist.all_reduce call so
every rank calls dist.all_reduce with matching-shaped tensors.
- Around line 469-480: The code uses os.environ["LOCAL_RANK"] for rank-zero
logging and file writes, which is node-local and can be unset; replace those
checks with a global-rank guard using torch.distributed: use
dist.is_initialized() ? dist.get_rank() == 0 : True to decide the rank-zero path
around the fixup_moe_expert_amax call and the
mtq.print_quant_summary/output_path directory creation; remove direct os.environ
access and ensure only global rank 0 performs os.makedirs(output_path,
exist_ok=True) and the mtq.print_quant_summary(transformer, output_path) call;
keep the logic around calib_all_experts, fixup_moe_expert_amax(transformer),
synced_input, fixed_weight, and the print message but gated by the global-rank
check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1c3680e8-c836-4da0-b1e3-e6d4a5d3b4d5
📒 Files selected for processing (1)
examples/deepseek/ptq.py
| qs = [ | ||
| lin.input_quantizer | ||
| for lin in linears | ||
| if lin.input_quantizer.is_enabled | ||
| and not getattr(lin.input_quantizer, "_dynamic", False) | ||
| ] | ||
| valid = [q.amax.float() for q in qs if not _missing(q.amax)] | ||
| if valid: | ||
| m = torch.stack(valid).amax(dim=0) | ||
| if world_size > 1: | ||
| dist.all_reduce(m, op=dist.ReduceOp.MAX) | ||
| for q in qs: | ||
| q.amax = m.clone() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the full context around lines 267-279
head -290 examples/deepseek/ptq.py | tail -50Repository: NVIDIA/Model-Optimizer
Length of output: 2017
🏁 Script executed:
# Search for the entire fixup_moe_expert_amax function to understand the context better
rg -A 30 "def fixup_moe_expert_amax" examples/deepseek/ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1318
🏁 Script executed:
# Check if there are any synchronization points before this code is called
rg -B 20 "fixup_moe_expert_amax" examples/deepseek/ptq.py | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 1360
🏁 Script executed:
# Search for how world_size and distributed operations are set up
rg "world_size\s*=" examples/deepseek/ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 239
🏁 Script executed:
# Look for any barrier or synchronization calls
rg "dist\.(barrier|all_gather|all_reduce|broadcast)" examples/deepseek/ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 260
🏁 Script executed:
# Verify PyTorch distributed documentation on collective participation requirements
web_searchRepository: NVIDIA/Model-Optimizer
Length of output: 117
Make the all_reduce collective unconditional to prevent distributed deadlock.
dist.all_reduce() on the default process group requires every rank to participate. In expert-sharded MoE setups, some ranks may have no local experts or all experts with missing amax values, causing them to skip the collective while peer ranks enter it. This divergence hangs the entire calibration run. Switch to a pattern where all ranks participate unconditionally (e.g., gather optional local maxima from every rank, then reduce globally), or use a synchronized fallback when no valid values are present locally.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/deepseek/ptq.py` around lines 267 - 279, The current code only calls
dist.all_reduce(m, ...) on ranks that have local valid values (m), which can
deadlock; change it so every rank participates: compute a local_max tensor (use
the same shape as m when any rank has valid entries — e.g., if valid on this
rank set local_max = m, otherwise create a tensor filled with -inf or a sentinel
of the same shape on the same device), then call dist.all_reduce(local_max,
op=dist.ReduceOp.MAX) unconditionally, assign the reduced result to a global_max
variable, and finally set each q.amax = global_max.clone() (and handle the case
where global_max is all sentinel/-inf by skipping or keeping previous values).
Ensure you update the code paths around qs, valid, m, world_size, and the
dist.all_reduce call so every rank calls dist.all_reduce with matching-shaped
tensors.
| if not calib_all_experts: | ||
| synced_input, fixed_weight = fixup_moe_expert_amax(transformer) | ||
| if int(os.environ["LOCAL_RANK"]) == 0: | ||
| print( | ||
| f"Synced peer-max for {synced_input} expert input_quantizer(s) " | ||
| f"and computed {fixed_weight} weight_quantizer amax(es) on rank 0." | ||
| ) | ||
|
|
||
| if int(os.environ["LOCAL_RANK"]) == 0: | ||
| mtq.print_quant_summary(transformer) | ||
| if output_path: | ||
| os.makedirs(output_path, exist_ok=True) | ||
| mtq.print_quant_summary(transformer, output_path) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n examples/deepseek/ptq.py | sed -n '460,490p'Repository: NVIDIA/Model-Optimizer
Length of output: 1404
🏁 Script executed:
head -50 examples/deepseek/ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2164
🏁 Script executed:
grep -n "LOCAL_RANK\|RANK\|dist\." examples/deepseek/ptq.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 697
🏁 Script executed:
sed -n '315,330p' examples/deepseek/ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 659
🏁 Script executed:
sed -n '395,475p' examples/deepseek/ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3461
Use global rank for rank-zero logging, not LOCAL_RANK.
LOCAL_RANK is node-local; in multi-node distributed training, this condition executes on rank 0 of every node, causing concurrent writes to quant_summary.txt and other race conditions. Additionally, accessing os.environ["LOCAL_RANK"] directly raises KeyError if the variable is unset (e.g., outside torchrun).
Use dist.get_rank() with dist.is_initialized() to check global rank zero:
Suggested fix
+ is_rank_zero = not dist.is_initialized() or dist.get_rank() == 0
+
if not calib_all_experts:
synced_input, fixed_weight = fixup_moe_expert_amax(transformer)
- if int(os.environ["LOCAL_RANK"]) == 0:
+ if is_rank_zero:
print(
f"Synced peer-max for {synced_input} expert input_quantizer(s) "
f"and computed {fixed_weight} weight_quantizer amax(es) on rank 0."
)
- if int(os.environ["LOCAL_RANK"]) == 0:
+ if is_rank_zero:
if output_path:
os.makedirs(output_path, exist_ok=True)
mtq.print_quant_summary(transformer, output_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/deepseek/ptq.py` around lines 469 - 480, The code uses
os.environ["LOCAL_RANK"] for rank-zero logging and file writes, which is
node-local and can be unset; replace those checks with a global-rank guard using
torch.distributed: use dist.is_initialized() ? dist.get_rank() == 0 : True to
decide the rank-zero path around the fixup_moe_expert_amax call and the
mtq.print_quant_summary/output_path directory creation; remove direct os.environ
access and ensure only global rank 0 performs os.makedirs(output_path,
exist_ok=True) and the mtq.print_quant_summary(transformer, output_path) call;
keep the logic around calib_all_experts, fixup_moe_expert_amax(transformer),
synced_input, fixed_weight, and the print message but gated by the global-rank
check.
Summary
examples/deepseek/ptq.py) now defaults to native top-k routing during MoE calibration. The previous all-tokens-to-all-experts path (CalibMoe) is preserved behind a new--calib_all_expertsflag.mtq.quantize,fixup_moe_expert_amaxsyncs every expert'sinput_quantizer.amax(w1/w2/w3) to the per-layer global peer max viadist.all_reduce(MAX)across EP ranks.weight_quantizer.amaxstays per-expert; any uncalibrated expert is filled by computing amax over the dequantized FP8 weight.mtq.print_quant_summaryis now also written to<output_path>/.quant_summary.txt, mirroringllm_ptq/hf_ptq.py.Why
Forcing all tokens through every expert doubled calibration time and inflated
input_quantizer.amaxfor cold-routing experts with outliers they never see at inference. The new flow matches the inference distribution, runs roughly 2x faster, and mirrors thelayer_sync_moe_local_experts_amaxsemantics that mtq runs automatically forQuantSequentialMLP-derived MoEs.Validation (DeepSeek-V3.2-Exp, MP=8, NVFP4_DEFAULT_CFG)
Compared
_amax_baseline(CalibMoe) vs_amax_synced(new default):w1.inputandw3.input(shared MoE block input): identical.w2.input(post-SiLU gated, expert-specific): synced to layer-wide peer max — 99.3% are larger than baseline (median 11.4x) since peer-max captures the worst-case outlier from any expert in the layer; 0.7% are smaller. This is the same trade-offset_expert_quantizer_amaxmakes for HF MoEs inunified_export_hf.py.Test plan
_amax_synced/consistent with the comparison above.--calib_all_experts— produces_amax_baseline/identical (other than rounding) to the priorCalibMoe-default behavior..quant_summary.txtwritten underoutput_pathon rank 0.Summary by CodeRabbit
--calib_all_expertsoption to enable an alternate PTQ calibration mode; default remains top-k routing with a post-calibration per-layer peer-max synchronization and a compute fallback for uncalibrated experts..quant_summary.txtsummary file.