Experiment: MXFP4 -> NVFP4 conversion MSE study (scratch)#1364
Experiment: MXFP4 -> NVFP4 conversion MSE study (scratch)#1364
Conversation
Research artifact comparing three algorithms for converting an MXFP4
tensor (block 32, E2M1 + E8M0) to NVFP4 (block 16, E2M1 + E4M3 + FP32
global scale):
Algo 1: dequantize MXFP4 -> bf16 -> standard NVFP4 quantize.
Algo 2: keep E2M1 nibbles verbatim; pick global S = 2^m and store
per-block E4M3 scales as 2^(k_j - m), snapping out-of-range
blocks. Two m strategies: midpoint and 1D integer search over
the closed-form snap-error objective.
Algo 3: hybrid - verbatim path for in-range blocks (zero error) plus
NVFP4 requantization with fixed scale_2 = 2^m for OOR blocks.
m chosen by direct-MSE 1D sweep.
Includes 27 scenarios (gaussian, heavy-tail, outlier patterns, spread
boundary tests, layer-shaped LLM weights) and a report summarizing
results, the snap-up/snap-down asymmetry that drives the m choice, and
the one pathological case (single dominant outlier) where Algo 3 still
trails Algo 1 by 0.21% due to integer-m vs continuous scale_2.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1364 +/- ##
=======================================
Coverage 76.93% 76.93%
=======================================
Files 471 471
Lines 50404 50404
=======================================
Hits 38776 38776
Misses 11628 11628
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The m-search loop in the original Algo 3 turns out to be unnecessary.
Across all 27 test scenarios the search converges on m = k_max - 8 and
that closed-form rule is provably the right pick:
- For spread <= 17, every block's k_j - m lands in [8 - spread, 8],
a subset of E4M3's exact-power-of-2 window [-9, 8]. All blocks take
the verbatim path; the conversion is lossless (MSE = 0).
- For spread > 17, m = k_max - 8 is the only choice that does not
NaN the highest-magnitude blocks: a lower m drives the per-block
scale amax/(6*2^m) above E4M3's max (448); a higher m only shrinks
in-range coverage on the low side without helping the high side.
Replaces the brute-force algo3_hybrid_requant with a single-pass
algo3_hybrid using the closed-form m. The Algo 4 / Algo 5 variants
that were used to discover this rule are removed; the script is back
to three algorithms (Algo 1 / Algo 2 / Algo 3) and the report has been
rewritten accordingly.
Same MSE numbers as before. No library changes — strictly under
scratch/.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Summary
Research artifact (in
scratch/) comparing three algorithms for converting an MXFP4 tensor (block 32, E2M1 + E8M0) to NVFP4 (block 16, E2M1 + E4M3 + FP32 global scale). Not for merge — opening as a draft for visibility and discussion of the algorithm comparison and m-search idea.scratch/mxfp4_to_nvfp4_mse.py— runnable comparison across 27 tensor scenarios.scratch/mxfp4_to_nvfp4_report.md— write-up of the algorithms, results, and conclusions.Algorithms
mstrategies: midpoint and 1D integer search over the closed-form snap-error objectiveΣ_j S_j · (2^k_j − 2^(m + clamp(k_j − m, −9, 8)))².scale_2 = 2^m.mchosen by direct-MSE 1D sweep.Key findings (27 scenarios, RTX 6000 Ada)
single extreme outlieronly; gap closeable by allowing a continuous (non-power-of-2) global scale.Test plan
python scratch/mxfp4_to_nvfp4_mse.pyruns end-to-end on a CUDA host (~5 s on a single GPU). Output reproduces the table in the report.scratch/.Notes
modelopt/torch/quantization/qtensor/as an MXFP4 → NVFP4 converter API.