Skip to content

Modify mfsdp default data-parallel-sharding-strategy#3691

Merged
cspades merged 4 commits intoNVIDIA:mainfrom
wplf:mfsdp-fix
Apr 1, 2026
Merged

Modify mfsdp default data-parallel-sharding-strategy#3691
cspades merged 4 commits intoNVIDIA:mainfrom
wplf:mfsdp-fix

Conversation

@wplf
Copy link
Copy Markdown
Member

@wplf wplf commented Mar 4, 2026

PR for dev branch #3692

What does this PR do ?

The default value "no_shard" is not compatiable with many api. We can change this to optim_grads_params.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 4, 2026 08:52
@Phlip79
Copy link
Copy Markdown
Member

Phlip79 commented Mar 4, 2026

Can you explain why you are making this change?

@Phlip79 Phlip79 marked this pull request as draft March 4, 2026 23:45
@wplf
Copy link
Copy Markdown
Member Author

wplf commented Mar 5, 2026

Can you explain why you are making this change?

Hi, I've found that when using no_shard, group.model_weight_buffer will not be created in

if data_parallel_sharding_strategy != "no_shard":
group.model_weight_buffer = DataParallelBuffer(
self.ddp_config,
group.params,
is_data_distributed=is_model_weight_buffer_distributed
and model_wbuf_dp_group.size() > 1,
dtype=param_dtype,
device=self.device,
# Note: This will be DP-Outer + DP-Shard when sharding
# the optimizer state in HFSDP, else just DP-Shard when
# using basic HSDP or FSDP.
data_parallel_group=model_wbuf_dp_group,
is_transpose_buffer=False,
temporary_bucket_allocator=self.weight_alloc,
bucket_id=group_id,
chunk_size_factor=group.chunk_size_factor,
mem_alloc_context=self.mem_alloc_context,
**main_buf_extra_kwargs,
)

but somewhere will require group.mbuf, such as

  1. swilgu's saving ckpt using fsdp_slice in
    fsdp_slice = dist_param.megatron_fsdp_slice
    , which is defined at
    mbuf = pg.model_weight_buffer
    if mbuf:
    _start, _end = mbuf._get_item_slice_in_shard(item_id)
    setattr(dist_param, "megatron_fsdp_slice", slice(_start, _end))
    . But mbuf is None, so this is the bug.
  2. when using no_shard and no overlap_param_gather, code will step into else branch here
    if not force_sync and self.ddp_config.overlap_param_gather:
    # All-gather the first bucket before the forward pass.
    if self.ddp_config.fsdp_all_gather_in_start_param_sync:
    first_param = list(self.module.parameters())[0]
    self.all_gather_and_wait_parameters_ready(
    params=[first_param], prefetch=True, wait_bucket_ready=False
    )
    else:
    self.synchronize_param_gather()
    for bucket_id in range(self.all_gather_pipeline.num_buckets):
    self.all_gather_pipeline.async_bucket_gather(bucket_id=bucket_id, bwd=False)
    group = self.param_and_grad_buffer.parameter_groups[bucket_id]
    if group.model_weight_buffer is None:
    continue
    if group.model_weight_buffer.is_data_distributed:
    # If model weight is sharded, we wait for the all-gather to complete and
    # then release the bucket immediately to save memory usage.
    self.all_gather_pipeline.wait_bucket_ready(bucket_id, False)
    for bucket_id in range(self.all_gather_pipeline.num_buckets):
    self.all_gather_pipeline.wait_bucket_ready(bucket_id, False)
    . param.mbuf being None will raise bug
    wbuf = self.get_fsdp_buffer(bucket_id, bwd)
    # Lazy release the unused buckets.
    self.recycle_unused_buckets()
    # Allocate an empty bucket to store the module weights.
    bucket = wbuf.fetch_bucket(set_param_data=True)

cc @shjwudp @cspades

@wplf
Copy link
Copy Markdown
Member Author

wplf commented Mar 5, 2026

For bug2, I think when mfsdp_strategy is "no_shard", we can skip param_gather like this, because optimizer is not sharded, and all params in local rank are valid. So we don't need gather from other ddp ranks any more.

image

The same logic is here

if self.data_parallel_sharding_strategy == "no_shard":
return
. Does this work for the condition?
cc @Phlip79 @cspades @shjwudp

@wplf wplf marked this pull request as ready for review March 9, 2026 16:47
@cspades
Copy link
Copy Markdown
Member

cspades commented Mar 9, 2026

Part of this PR has been split to: #3754 Thanks for finding this bug!

I think our no_shard support with Torch DCP has some gaps that we have never tested for sure. Megatron-LM / Megatron-FSDP always uses Torch DCP but with no_shard we just use the original model weights:

                else:
                    # If neither the wbuf nor the mbuf are utilized in the case of "no_shard",
                    # we fall-back to using the original parameter data for optimization,
                    # and register the new parameter as a model training weight.
                    dist_param = make_fsdp_dtensor(
                        local_tensor=orig_param.data,
                        param=orig_param,
                        dist_index=self.dist_index,
                        is_sharded_param=False,
                        is_expert_param=pg.is_expert_param,
                        run_check=True,
                        update_uneven_dtensor_chunk_meta=False,
                        force_sync_tp_duplicated_param=True,
                    )
                    dist_main_weight[param_name] = dist_param

and I wonder if there is a hard-coded equivalent of this:

 if mbuf: 
     _start, _end = mbuf._get_item_slice_in_shard(item_id) 
     setattr(dist_param, "megatron_fsdp_slice", slice(_start, _end)) 

like (and I'm just guessing, need to read the code):

_start = 0
_end = # dist_param.shape OR dist_param.numel

Will take a closer look at this later!

@wplf wplf self-assigned this Mar 10, 2026
@wplf
Copy link
Copy Markdown
Member Author

wplf commented Mar 10, 2026

/ok to test b4df532

@cspades
Copy link
Copy Markdown
Member

cspades commented Apr 1, 2026

/ok to test e7f0398

@cspades cspades enabled auto-merge April 1, 2026 15:07
@cspades cspades added this pull request to the merge queue Apr 1, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23860137653

Merged via the queue into NVIDIA:main with commit 3499efe Apr 1, 2026
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants