First, thanks for your great works. I've tried to finetune the Yi-Coder-9B-Chat models on my own dataset but here comes the problems.
Problems
'grad_norm' becomes nan when I try to finetune the Yi-Coder-9B-Chat models
Details Description
In the first step, the grad_norm becomes nan, and later the loss becomes zero due to the ''grad_norm' nan issues.
{'loss': 9.6782, 'grad_norm': nan, 'learning_rate': 0.0008461538461538462, 'epoch': 0.15}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0006923076923076923, 'epoch': 0.31}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0005384615384615384, 'epoch': 0.46}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.00038461538461538467, 'epoch': 0.62}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0002307692307692308, 'epoch': 0.77}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 7.692307692307693e-05, 'epoch': 0.92}
But when I use the same code and change the model to CodeLlama-13b-Instruct-hf everything works as my expection.
Reproduce Code
I've changed the dataset from my own dataset to public dataset Genshin_Character_instruction/Genshin_Character_instruction.json, it can be found the huggingface.
link: https://huggingface.co/datasets/YanFu0320/Genshin_Character_instruction
from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM, SFTConfig
import os
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "INFO"
os.environ["TORCH_SHOW_CPP_STACKTRACES"] = "1"
base_model_name = "MY_PATH_TO_MODEL/Yi-Coder-9B-Chat"
dataset = load_dataset(
"json",
data_files="MY_PATH_TO_DATASET/Genshin_Character_instruction/Genshin_Character_instruction.json",
split="train",
)
print(dataset)
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example["instruction"])):
text = f"<|startoftext|>user {example['instruction'][i]} <|im_end|> \n <|startoftext|>assistant \n ### Answer: {example['output'][i]} <|im_end|>"
output_texts.append(text)
return output_texts
result_dir = "save_model"
training_args = SFTConfig(
report_to="none",
output_dir=result_dir,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-3,
logging_steps=8,
num_train_epochs=1,
save_steps=200,
bf16=True,
gradient_checkpointing=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
def find_all_linear_names(model):
cls = torch.nn.Linear
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
names = name.split(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if "lm_head" in lora_module_names: # needed for 16-bit
lora_module_names.remove("lm_head")
return list(lora_module_names)
models = find_all_linear_names(base_model)
print(models)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules=models,
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
response_template = "<|startoftext|>assistant"
collator = DataCollatorForCompletionOnlyLM(
response_template=response_template, tokenizer=tokenizer, mlm=False
)
max_seq_length = 256
trainer = SFTTrainer(
model=base_model,
formatting_func=formatting_prompts_func,
train_dataset=dataset,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
data_collator=collator,
args=training_args,
)
trainer.train()
OUTPUT_DIR = "save_genshin"
output_dir = os.path.join(result_dir, OUTPUT_DIR)
trainer.model.save_pretrained(output_dir)
trainer.model.config.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
System and Env setting
System
Platform: AzureML
GPU: A100
Related package version
accelerate==1.0.1
bitsandbytes==0.42.0
deepspeed==0.15.2
openai==1.40.0
tokenizers==0.20.3
torch==2.4.0
torchvision==0.19.0
vllm==0.6.3.post1
transformers==4.45.2
trl==0.11.0
peft==0.11.0
flash-attn==2.6.2
First, thanks for your great works. I've tried to finetune the Yi-Coder-9B-Chat models on my own dataset but here comes the problems.
Problems
'grad_norm' becomes nan when I try to finetune the Yi-Coder-9B-Chat models
Details Description
In the first step, the grad_norm becomes nan, and later the loss becomes zero due to the ''grad_norm' nan issues.
But when I use the same code and change the model to CodeLlama-13b-Instruct-hf everything works as my expection.
Reproduce Code
I've changed the dataset from my own dataset to public dataset Genshin_Character_instruction/Genshin_Character_instruction.json, it can be found the huggingface.
link: https://huggingface.co/datasets/YanFu0320/Genshin_Character_instruction
System and Env setting
System
Related package version