Skip to content

infinigence/STAlloc

Repository files navigation

STAlloc

Description

  • STAlloc is a memory tool used to reduce memory fragmentation and improve memory utilization.
  • STAlloc actively leverages the predictable memory allocation pattern of large model training to perform ahead-of-time memory allocation planning.

Compilation

  • cd Allocator && make
  • See Allocator/README.md for more details

Env

Env Value Description
STALLOC_MODE [Torch (default), Trace, Alloc] set the mode of STAlloc.
STALLOC_LIB_PATH path to STAlloc lib, required when using STALLOC_MODE is Trace or Alloc.
STALLOC_MODEL_INFO_PATH path to save model memory info, required when STALLOC_MODE is Trace or Alloc.
STALLOC_DYNAMIC [0 (default), 1] required in MoE model(without batchgemm) when STALLOC_MODE is Trace or Alloc.
STALLOC_LOG_LEVEL [0, 1, 2, 3 (default)] set the log level of STAlloc, the smaller the value, the more detailed the output content.
STALLOC_STATIC_FALLBACK [0 (default), 1] enable fallback in static-alloc, which will affect the performance
STALLOC_TRACE_FAST_MODE [0 (default), 1] use faster dynamic-allocator to trace, but this may lead to OOM, when the memory required by the model has reached the limit of GPU.

Preparation

pretrain_xxx.py

"""
At the beginning of pretrain.py, import stalloc
"""
from stalloc.utils.hook_model import hook_memory_model

def model_provider(...):
    ...
    #return model 
    return hook_memory_model(model, args)

train script

Add the following code to train script.

export STALLOC_MODE=Trace
export STALLOC_TRACE_FAST_MODE=1  # may lead to OOM
STALLOC_PATH=YourPath
export STALLOC_LIB_PATH=${STALLOC_PATH}/Allocator

MODEL_TAG=llama3-70b-tp8pp8mbs1gbs128-node${RANK}
MEMORY_SAVED_DIR=/workspace/allocator_case
export STALLOC_MODEL_INFO_PATH=${MEMORY_SAVED_DIR}/${MODEL_TAG}
if [ "$STALLOC_MODE" == "Trace" ]; then
    if [ -e "${STALLOC_MODEL_INFO_PATH}/trace" ]; then
       rm -rf ${STALLOC_MODEL_INFO_PATH}/trace
    fi
    mkdir -p ${STALLOC_MODEL_INFO_PATH}/trace
    mkdir -p ${STALLOC_MODEL_INFO_PATH}/output
elif [ "$STALLOC_MODE" == "Alloc" ]; then
    export STALLOC_LOG_LEVEL=1
    if [ ! -e "${STALLOC_MODEL_INFO_PATH}/output/plan" ]; then
       exit 1
    fi
fi

# !!! If you set "STALLOC_MODE=Trace", please make sure that "train-iter=3" and "eval-iter=1".

Usage

  • After preparation work is done, run the memory tools by following steps.

Step-1

  • Set export STALLOC_MODE=Torch and run the training script.
  • Check if the GPU memory fragmentation of torch is severe.
    • Find this form of line in log : dev0 : max_reserved:xx.xx, max_allocated:xx.xx, utilization:xx.xx%
    • The smaller the utilization, the more severe of the fragmented storage.
    • It is generally considered that the fragmentation problem is severe when utilization is less than 90%.
  • OOM in Torch Mode does not mean that Trace will also OOM.

Step-2

  • Set export STALLOC_MODE=Trace and run the training script.

Step-3

  • Generate Plan. run the following command
  • cd ${STALLOC_PATH}/Synthesizer
  • python main.py -model-memory-dir=XXXX
  • exec python mian.py --help for more details

Step-4

  • Set export STALLOC_MODE=Alloc and run the training script.

Others

If you are using other version of Megatron-LM, you may need to check the path of patch functions, and modify in utils/memory_patcher.py.

License

This project is licensed under the Apache License 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages