grad_norm becomes nan when finetune 9b models

First, thanks for your great works. I've tried to finetune the Yi-Coder-9B-Chat models on my own dataset but here comes the problems.

## Problems

'grad_norm' becomes nan when I try to finetune the Yi-Coder-9B-Chat models

## Details Description

In the first step, the grad_norm becomes nan, and later the loss becomes zero due to the ''grad_norm' nan issues.

```
{'loss': 9.6782, 'grad_norm': nan, 'learning_rate': 0.0008461538461538462, 'epoch': 0.15}                                                                                                                                                                                                                                                                         
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0006923076923076923, 'epoch': 0.31}                                                                                                                                                                                                                                                                            
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0005384615384615384, 'epoch': 0.46}                                                                                                                                                                                                                                                                            
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.00038461538461538467, 'epoch': 0.62}                                                                                                                                                                                                                                                                           
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0002307692307692308, 'epoch': 0.77}                                                                                                                                                                                                                                                                            
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 7.692307692307693e-05, 'epoch': 0.92} 
```

But when I use the same code and change the model to CodeLlama-13b-Instruct-hf everything works as my expection.


## Reproduce Code

I've changed the dataset from my own dataset to public dataset Genshin_Character_instruction/Genshin_Character_instruction.json, it can be found the huggingface.
link: https://huggingface.co/datasets/YanFu0320/Genshin_Character_instruction

```
from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM, SFTConfig
import os

os.environ["NCCL_DEBUG"] = "INFO"
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "INFO"
os.environ["TORCH_SHOW_CPP_STACKTRACES"] = "1"


base_model_name = "MY_PATH_TO_MODEL/Yi-Coder-9B-Chat"

dataset = load_dataset(
    "json",
    data_files="MY_PATH_TO_DATASET/Genshin_Character_instruction/Genshin_Character_instruction.json",
    split="train",
)

print(dataset)


def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example["instruction"])):
        text = f"<|startoftext|>user {example['instruction'][i]} <|im_end|> \n <|startoftext|>assistant \n ### Answer: {example['output'][i]} <|im_end|>"
        output_texts.append(text)
    return output_texts


result_dir = "save_model"

training_args = SFTConfig(
    report_to="none",
    output_dir=result_dir,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-3,
    logging_steps=8,
    num_train_epochs=1,
    save_steps=200,
    bf16=True,
    gradient_checkpointing=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)


def find_all_linear_names(model):
    cls = torch.nn.Linear
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)


models = find_all_linear_names(base_model)

print(models)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=models,
)


tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)

response_template = "<|startoftext|>assistant"
collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template, tokenizer=tokenizer, mlm=False
)

max_seq_length = 256
trainer = SFTTrainer(
    model=base_model,
    formatting_func=formatting_prompts_func,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    data_collator=collator,
    args=training_args,
)

trainer.train()


OUTPUT_DIR = "save_genshin"
output_dir = os.path.join(result_dir, OUTPUT_DIR)

trainer.model.save_pretrained(output_dir)
trainer.model.config.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
```

## System and Env setting

System
```
Platform: AzureML
GPU: A100
```

Related package version
```
accelerate==1.0.1
bitsandbytes==0.42.0
deepspeed==0.15.2
openai==1.40.0
tokenizers==0.20.3
torch==2.4.0
torchvision==0.19.0
vllm==0.6.3.post1
transformers==4.45.2
trl==0.11.0
peft==0.11.0
flash-attn==2.6.2
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grad_norm becomes nan when finetune 9b models #12

Problems

Details Description

Reproduce Code

System and Env setting

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

grad_norm becomes nan when finetune 9b models #12

Description

Problems

Details Description

Reproduce Code

System and Env setting

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions