Skip to content

Question about storing hf_ckpt of when training #19

Description

@MPX0222

Dear dFactory and LLaDA authors and developers,

Thanks for your great projects. I'm trying to follow the full training pipeline described in the README. However, when I tried to convert the checkpoints from the merged format. I could not find the hf_ckpt in the output dir as mentioned:

Important: Finding the Correct Input Path

The --input-path for the conversion script is the path to the saved Hugging Face checkpoint, not the root output directory you specified during training. The checkpoint is typically located in a subdirectory like:

TRAIN_OUTPUT_DIR/checkpoints/global_step_XXX/hf_ckpt/

After I checked the tasks/train_llada2_bd.py, I found the storage code on line 558:

    if args.train.global_rank == 0 and args.train.save_hf_weights and save_checkpoint_path is not None:
        hf_weights_path = os.path.join(save_checkpoint_path, "hf_ckpt")
        model_state_dict = ckpt_to_state_dict(
            save_checkpoint_path=save_checkpoint_path,
            output_dir=args.train.output_dir,
            ckpt_manager=args.train.ckpt_manager,
        )
        save_model_weights(hf_weights_path, model_state_dict, model_assets=model_assets)
        logger.info_rank0(f"Huggingface checkpoint saved at {hf_weights_path} successfully!")

This code seems to be executed after the entire training is complete. Does this mean that all checkpoints are available only after the entire training is fully complete?or does that mean I can't use checkpoints for testing while training is still going on? Thanks for your kindly answers!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions