Bring in improvements from `modded-nanogpt` repo by leloykun · Pull Request #14 · KellerJordan/Muon

leloykun · 2025-02-24T03:32:54Z

Adds code for:

Optimizing Newton-Schulz coefficients
Tighter estimate of spectral norm using Gram iteration taken from https://arxiv.org/pdf/2305.16173

Usage note:

def zeropower_via_newtonschulz5(
    G: Tensor, steps: int, enable_better_spec_norm_est: bool = False
) -> Tensor:
    assert G.ndim >= 2 # batched Muon implementation by @scottjmaddox, and put into practice in the record by @YouJiacheng
-     a, b, c = (3.4445, -4.7750,  2.0315)
    X = G.bfloat16()
    if G.size(-2) > G.size(-1):
        X = X.mT

    # Ensure spectral norm is at most 1
    X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
    # Perform the NS iterations
-     for i in range(steps):
+     for i, (a, b, c) in enumerate([
+         ...[insert the coefficients here]...
+     ]):
        A = X @ X.mT
        if i == 0 and enable_better_spec_norm_est:
            # Tigher estimate of spectral norm using 1st Gram iteration.
            # Taken from https://arxiv.org/pdf/2305.16173
            S_norm_est_over_f_norm__squared = A.norm(dim=(-2, -1), keepdim=True)
            X = X / (S_norm_est_over_f_norm__squared**0.5 + 1e-7)
            A = A / (S_norm_est_over_f_norm__squared + 1e-7)
        B = b * A + c * A @ A # quintic computation strategy adapted from suggestion by @jxbz, @leloykun, and @YouJiacheng
        X = a * X + B @ X
    
    if G.size(-2) > G.size(-1):
        X = X.mT
    return X

…tral norm

KellerJordan · 2025-02-26T01:47:21Z

have we confirmed that this option never causes any instability?

It's potentially risky, so important to confirm; so I will wait to accept it until I have evidence

toothacher17 · 2025-02-27T04:52:39Z

Hey, @leloykun I tried your Jax scripts and get a group of new hyper coeffs:

(4.0246, -6.4224, 2.6026)
(3.9872, -6.2793, 2.5377)
(3.3260, -4.8258, 1.9451)
(2.8778, -3.6189, 1.6208)
(3.0133, -3.6424, 1.6122)

Do you have any recommendations for which one to use or I just pick a random one?

leloykun · 2025-02-27T07:08:17Z

Hi @toothacher17,

In zeropower_via_newtonschulz5, you should replace

for i in range(steps):

with

for i, (a, b, c) in enumerate([
    ...[insert the coefficients here]...
])

KellerJordan · 2025-03-25T01:45:27Z

I'm still a little afraid of this causing instability. Will test more and think about it.

leloykun · 2025-03-25T08:40:24Z

Same... I'll move this back to drafts until further analysis.

add code for optimizing NS coeffs & code for tighter estimate of spec…

646898b

…tral norm

leloykun mentioned this pull request Feb 24, 2025

Add code for optimizing NS coefficiefs as done in the 02/14/25 record for GPT2-medium track KellerJordan/modded-nanogpt#86

Closed

improve usage notes

9428222

leloykun marked this pull request as draft March 25, 2025 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring in improvements from `modded-nanogpt` repo#14

Bring in improvements from `modded-nanogpt` repo#14
leloykun wants to merge 2 commits intoKellerJordan:masterfrom
leloykun:fc--optimize-coeffs

leloykun commented Feb 24, 2025 •

edited

Loading

Uh oh!

KellerJordan commented Feb 26, 2025

Uh oh!

toothacher17 commented Feb 27, 2025

Uh oh!

leloykun commented Feb 27, 2025

Uh oh!

KellerJordan commented Mar 25, 2025

Uh oh!

leloykun commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leloykun commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KellerJordan commented Feb 26, 2025

Uh oh!

toothacher17 commented Feb 27, 2025

Uh oh!

leloykun commented Feb 27, 2025

Uh oh!

KellerJordan commented Mar 25, 2025

Uh oh!

leloykun commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leloykun commented Feb 24, 2025 •

edited

Loading