Dear dFactory and LLaDA authors and developers,
Thanks for your great projects. I'm trying to follow the full training pipeline described in the README. However, when I tried to convert the checkpoints from the merged format. I could not find the hf_ckpt in the output dir as mentioned:
Important: Finding the Correct Input Path
The --input-path for the conversion script is the path to the saved Hugging Face checkpoint, not the root output directory you specified during training. The checkpoint is typically located in a subdirectory like:
TRAIN_OUTPUT_DIR/checkpoints/global_step_XXX/hf_ckpt/
After I checked the tasks/train_llada2_bd.py, I found the storage code on line 558:
if args.train.global_rank == 0 and args.train.save_hf_weights and save_checkpoint_path is not None:
hf_weights_path = os.path.join(save_checkpoint_path, "hf_ckpt")
model_state_dict = ckpt_to_state_dict(
save_checkpoint_path=save_checkpoint_path,
output_dir=args.train.output_dir,
ckpt_manager=args.train.ckpt_manager,
)
save_model_weights(hf_weights_path, model_state_dict, model_assets=model_assets)
logger.info_rank0(f"Huggingface checkpoint saved at {hf_weights_path} successfully!")
This code seems to be executed after the entire training is complete. Does this mean that all checkpoints are available only after the entire training is fully complete?or does that mean I can't use checkpoints for testing while training is still going on? Thanks for your kindly answers!
Dear dFactory and LLaDA authors and developers,
Thanks for your great projects. I'm trying to follow the full training pipeline described in the README. However, when I tried to convert the checkpoints from the merged format. I could not find the
hf_ckptin the output dir as mentioned:After I checked the
tasks/train_llada2_bd.py, I found the storage code on line 558:This code seems to be executed after the entire training is complete. Does this mean that all checkpoints are available only after the entire training is fully complete?or does that mean I can't use checkpoints for testing while training is still going on? Thanks for your kindly answers!