Skip to content

Experiment: MXFP4 -> NVFP4 conversion MSE study (scratch)#1364

Draft
cjluo-nv wants to merge 2 commits intomainfrom
chenjiel/demo_mxfp4_nvfp4
Draft

Experiment: MXFP4 -> NVFP4 conversion MSE study (scratch)#1364
cjluo-nv wants to merge 2 commits intomainfrom
chenjiel/demo_mxfp4_nvfp4

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

Summary

Research artifact (in scratch/) comparing three algorithms for converting an MXFP4 tensor (block 32, E2M1 + E8M0) to NVFP4 (block 16, E2M1 + E4M3 + FP32 global scale). Not for merge — opening as a draft for visibility and discussion of the algorithm comparison and m-search idea.

  • scratch/mxfp4_to_nvfp4_mse.py — runnable comparison across 27 tensor scenarios.
  • scratch/mxfp4_to_nvfp4_report.md — write-up of the algorithms, results, and conclusions.

Algorithms

  • Algo 1 — dequant MXFP4 → bf16 → standard NVFP4 quantize (baseline).
  • Algo 2 — keep E2M1 nibbles verbatim; pick global S = 2^m and store per-block E4M3 scales as 2^(k_j − m), snapping out-of-range blocks. Two m strategies: midpoint and 1D integer search over the closed-form snap-error objective Σ_j S_j · (2^k_j − 2^(m + clamp(k_j − m, −9, 8)))².
  • Algo 3 — hybrid: verbatim where in-range (zero error); for OOR blocks, dequant + NVFP4-requant each 16-element half with fixed scale_2 = 2^m. m chosen by direct-MSE 1D sweep.

Key findings (27 scenarios, RTX 6000 Ada)

  • Algo 3 outright wins: 4 / 27. Tied (≥ 2 algos exact at MSE = 0): 22 / 27.
  • Algo 3 ≤ Algo 2 in every scenario (strict improvement on spread-too-large cases).
  • Algo 3 trails Algo 1 by 0.21% on single extreme outlier only; gap closeable by allowing a continuous (non-power-of-2) global scale.
  • Largest improvements over Algo 2: spread-50 case +16 dB SNR; mixed-block-scales case +12.7 dB SNR.

Test plan

  • python scratch/mxfp4_to_nvfp4_mse.py runs end-to-end on a CUDA host (~5 s on a single GPU). Output reproduces the table in the report.
  • Pre-commit hooks (ruff, mypy, markdownlint, license headers) pass.
  • No changes to library code — strictly additive under scratch/.

Notes

  • This is exploratory; if the m-search idea proves valuable for production conversion paths, the natural follow-up is to (a) drop the integer-m constraint, (b) productionize it inside modelopt/torch/quantization/qtensor/ as an MXFP4 → NVFP4 converter API.

Research artifact comparing three algorithms for converting an MXFP4
tensor (block 32, E2M1 + E8M0) to NVFP4 (block 16, E2M1 + E4M3 + FP32
global scale):

  Algo 1: dequantize MXFP4 -> bf16 -> standard NVFP4 quantize.
  Algo 2: keep E2M1 nibbles verbatim; pick global S = 2^m and store
          per-block E4M3 scales as 2^(k_j - m), snapping out-of-range
          blocks. Two m strategies: midpoint and 1D integer search over
          the closed-form snap-error objective.
  Algo 3: hybrid - verbatim path for in-range blocks (zero error) plus
          NVFP4 requantization with fixed scale_2 = 2^m for OOR blocks.
          m chosen by direct-MSE 1D sweep.

Includes 27 scenarios (gaussian, heavy-tail, outlier patterns, spread
boundary tests, layer-shaped LLM weights) and a report summarizing
results, the snap-up/snap-down asymmetry that drives the m choice, and
the one pathological case (single dominant outlier) where Algo 3 still
trails Algo 1 by 0.21% due to integer-m vs continuous scale_2.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f85baace-b24c-4c09-b27e-6aa39ae7a0dd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/demo_mxfp4_nvfp4

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.93%. Comparing base (8eec6d4) to head (ab62891).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1364   +/-   ##
=======================================
  Coverage   76.93%   76.93%           
=======================================
  Files         471      471           
  Lines       50404    50404           
=======================================
  Hits        38776    38776           
  Misses      11628    11628           
Flag Coverage Δ
unit 52.73% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The m-search loop in the original Algo 3 turns out to be unnecessary.
Across all 27 test scenarios the search converges on m = k_max - 8 and
that closed-form rule is provably the right pick:

  - For spread <= 17, every block's k_j - m lands in [8 - spread, 8],
    a subset of E4M3's exact-power-of-2 window [-9, 8]. All blocks take
    the verbatim path; the conversion is lossless (MSE = 0).
  - For spread > 17, m = k_max - 8 is the only choice that does not
    NaN the highest-magnitude blocks: a lower m drives the per-block
    scale amax/(6*2^m) above E4M3's max (448); a higher m only shrinks
    in-range coverage on the low side without helping the high side.

Replaces the brute-force algo3_hybrid_requant with a single-pass
algo3_hybrid using the closed-form m. The Algo 4 / Algo 5 variants
that were used to discover this rule are removed; the script is back
to three algorithms (Algo 1 / Algo 2 / Algo 3) and the report has been
rewritten accordingly.

Same MSE numbers as before. No library changes — strictly under
scratch/.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant