Add FSDP option for Flux2 #12860

leisuzz · 2025-12-18T12:01:09Z

What does this PR do?

The text encoder is too large in Flux2, and offload to cpu requires a lot of time to get the prompt.

I add the feature to use FSDP in text encoder, which can compute efficiently with multiple GPUs.
The checkpoint is not supporting FSDP now, I added the option if the accelerate uses FSDP.

It is FSDP2, and the script is:

accelerate launch --config_file ${config_file} \
  ./train_dreambooth_lora_flux2_img2img.py \
  --pretrained_model_name_or_path=$model_name  \
  --dataset_name=$dataset_name \
  --image_column="output" --cond_image_column="file_name" --caption_column="instruction" \
  --resolution=$resolution \
  --train_batch_size=$batch_size \
  --guidance_scale=1 \
  --mixed_precision=$mixed_precision \
  --max_grad_norm=1 \
  --dataloader_num_workers=0 \
  --gradient_accumulation_steps=$gradient_accumulation_steps \
  --learning_rate=1e-05 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --gradient_checkpointing \
  --max_train_steps=$max_train_steps \
  --checkpointing_steps=5000 \
  --enable_npu_flash_attention \
  --rank=16 \
  --seed="0" \
  --skip_final_inference \
  --cache_latents \
  --offload \
  --fsdp_text_encoder \
  --output_dir=${output_path} \

The accelerate config is:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Flux2TransformerBlock,Flux2SingleTransformerBlock
  fsdp_forward_prefetch: true
  fsdp_sync_module_states: false
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_use_orig_params: false
  fsdp_activation_checkpointing: true
  fsdp_reshard_after_forward: true
  fsdp_cpu_ram_efficient_loading: false
main_training_function: main
machine_rank: 0
main_process_ip: localhost
main_process_port: 6878
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

leisuzz · 2025-12-18T12:37:11Z

@sayakpaul Please take a look at this PR. Thank you for your help :)

sayakpaul

Very cool work, thank you for this!

Just confirming -- this is FSDP2, right?

Also, could you provide an example command and your setup so that we can test?

Additionally, can we similarly the denoiser like this?

examples/dreambooth/train_dreambooth_lora_flux2.py

leisuzz · 2025-12-19T02:27:25Z

Very cool work, thank you for this!

Just confirming -- this is FSDP2, right?

Also, could you provide an example command and your setup so that we can test?

Additionally, can we similarly the denoiser like this?

It is FSDP2, and the script is:

accelerate launch --config_file ${config_file} \
  ./train_dreambooth_lora_flux2_img2img.py \
  --pretrained_model_name_or_path=$model_name  \
  --dataset_name=$dataset_name \
  --image_column="output" --cond_image_column="file_name" --caption_column="instruction" \
  --resolution=$resolution \
  --train_batch_size=$batch_size \
  --guidance_scale=1 \
  --mixed_precision=$mixed_precision \
  --max_grad_norm=1 \
  --dataloader_num_workers=0 \
  --gradient_accumulation_steps=$gradient_accumulation_steps \
  --learning_rate=1e-05 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --gradient_checkpointing \
  --max_train_steps=$max_train_steps \
  --checkpointing_steps=5000 \
  --enable_npu_flash_attention \
  --rank=16 \
  --seed="0" \
  --skip_final_inference \
  --cache_latents \
  --offload \
  --fsdp_text_encoder \
  --output_dir=${output_path} \

The accelerate config is:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Flux2TransformerBlock,Flux2SingleTransformerBlock
  fsdp_forward_prefetch: true
  fsdp_sync_module_states: false
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_use_orig_params: false
  fsdp_activation_checkpointing: true
  fsdp_reshard_after_forward: true
  fsdp_cpu_ram_efficient_loading: false
main_training_function: main
machine_rank: 0
main_process_ip: localhost
main_process_port: 6878
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

sayakpaul

Changes look neat to me!

Let's also update the README about this.

src/diffusers/training_utils.py

sayakpaul · 2025-12-22T05:18:30Z

src/diffusers/training_utils.py

+import torch.distributed as dist
+from torch.distributed.fsdp import CPUOffload, ShardingStrategy
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy


We should guard this like so:

if getattr(torch, "distributed", None) is not None: import torch.distributed as dist

Same for FSDP.

I've modified it, please take a look

Doesn't seem like the commits were pushed?

Please check it out

src/diffusers/training_utils.py

sayakpaul · 2025-12-22T05:19:24Z

src/diffusers/training_utils.py

+    if dist.is_initialized():
+        dist.barrier()


Why is this needed?

I've modified it, please take a look

HuggingFaceDocBuilderDev · 2025-12-22T05:28:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2025-12-22T05:41:19Z

@bot /style

github-actions · 2025-12-22T05:41:47Z

Style bot fixed some files and pushed the changes.

sayakpaul

Just a few more comments which are mostly minor. As also mentioned earlier, let's make a note of this in the README_flux2.md.

sayakpaul · 2025-12-24T04:57:41Z

examples/dreambooth/train_dreambooth_lora_flux2.py


 import numpy as np
 import torch
+import torch.distributed as dist


This should be guarded as well.

sayakpaul · 2025-12-24T05:00:43Z

examples/dreambooth/train_dreambooth_lora_flux2.py

-        if accelerator.is_main_process:
-            transformer_lora_layers_to_save = None
-            modules_to_save = {}
+        transformer_lora_layers_to_save = None


Let's simplify this block of code a bit:

transformer_cls = type(unwrap_model(transformer)) def _to_cpu_contiguous(sd): return { k: (v.detach().cpu().contiguous() if isinstance(v, torch.Tensor) else v) for k, v in sd.items() } # 1) Validate and pick the transformer model modules_to_save: dict[str, Any] = {} transformer_model = None for m in models: if isinstance(unwrap_model(m), transformer_cls): transformer_model = m modules_to_save["transformer"] = m else: raise ValueError(f"unexpected save model: {m.__class__}") if transformer_model is None: raise ValueError("No transformer model found in `models`.") # 2) Optionally gather FSDP state dict once state_dict = accelerator.get_state_dict(models) if is_fsdp else None # 3) Only main process materializes the LoRA state dict transformer_lora_layers_to_save = None if accelerator.is_main_process: peft_kwargs = {} if is_fsdp: peft_kwargs["state_dict"] = state_dict transformer_lora_layers_to_save = get_peft_model_state_dict( unwrap_model(transformer_model) if is_fsdp else transformer_model, **peft_kwargs, ) if is_fsdp: transformer_lora_layers_to_save = _to_cpu_contiguous(transformer_lora_layers_to_save) # make sure to pop weight so that corresponding model is not saved again if weights: weights.pop()

We can move _to_cpu_contiguous() to the training_utils.py module.

sayakpaul · 2025-12-24T05:01:54Z

examples/dreambooth/train_dreambooth_lora_flux2_img2img.py

-        if accelerator.is_main_process:
-            transformer_lora_layers_to_save = None
-            modules_to_save = {}
+        transformer_lora_layers_to_save = None


Same as above.

leisuzz force-pushed the fsdp branch 3 times, most recently from 559a7a3 to 343b12a Compare December 18, 2025 12:31

leisuzz force-pushed the fsdp branch from 343b12a to 3dbe0af Compare December 18, 2025 12:38

sayakpaul reviewed Dec 19, 2025

View reviewed changes

leisuzz force-pushed the fsdp branch from 3dbe0af to b49c946 Compare December 19, 2025 07:04

Add FSDP option for Flux2

c766e27

leisuzz force-pushed the fsdp branch from b49c946 to c766e27 Compare December 19, 2025 07:19

sayakpaul reviewed Dec 22, 2025

View reviewed changes

Merge branch 'main' into fsdp

0052b21

github-actions bot and others added 3 commits December 22, 2025 05:42

Apply style fixes

647c66a

Add FSDP option for Flux2

f931ec3

Add FSDP option for Flux2

8bce38c

sayakpaul reviewed Dec 24, 2025

View reviewed changes

Add FSDP option for Flux2 #12860

Are you sure you want to change the base?

Add FSDP option for Flux2 #12860

Conversation

leisuzz commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

leisuzz commented Dec 18, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leisuzz commented Dec 19, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 22, 2025

Uh oh!

sayakpaul commented Dec 22, 2025

Uh oh!

github-actions bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leisuzz commented Dec 18, 2025 •

edited

Loading

github-actions bot commented Dec 22, 2025 •

edited

Loading