Skip to content

Does this have anything to do with the memory issue? #4

@minhphi1712

Description

@minhphi1712
`bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT --input-source /home/mca/MP_PSP/myenv/env_name/MSAGPT/INPUT --output-path /home/mca/MP_PSP/myenv/env_name/MSAGPT/output --max-gen-length 64`

> NCCL_DEBUG=VERSION NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 CUDA_LAUNCH_BLOCKING=0 torchrun --nproc_per_node 1 --master_port=19865 /home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py --bf16 --skip-init --mode finetune --rotary-embedding-2d --seed 12345 --sampling-strategy BaseStrategy --max-gen-length 128 --min-gen-length 0 --num-beams 4 --length-penalty 1.0 --no-repeat-ngram-size 0 --multiline_stream --temperature 0.8 --top_k 0 --top_p 0.9 --from_pretrained ./checkpoints/MSAGPT --input-source /home/mca/MP_PSP/myenv/env_name/MSAGPT/INPUT --output-path /home/mca/MP_PSP/myenv/env_name/MSAGPT/output --max-gen-length 64
> [2025-01-21 11:21:52,777] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
> [2025-01-21 11:21:54,854] [WARNING] No training data specified
> [2025-01-21 11:21:54,855] [WARNING] No train_iters (recommended) or epochs specified, use default 10k iters.
> [2025-01-21 11:21:54,855] [INFO] using world size: 1 and model-parallel size: 1 
> [2025-01-21 11:21:54,855] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
> [2025-01-21 11:21:54,855] [INFO] [RANK 0] > initializing model parallel with size 1
> [2025-01-21 11:21:54,856] [INFO] [comm.py:652:init_distributed] cdb=None
> [2025-01-21 11:21:54,856] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1004:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1125:configure] Activation Checkpointing Information
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1126:configure] ----Partition Activations False, CPU CHECKPOINTING False
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1127:configure] ----contiguous Memory Checkpointing False with 6 total layers
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1128:configure] ----Synchronization False
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1129:configure] ----Profiling time in checkpointing False
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 15063 and data parallel seed: 12345
> [2025-01-21 11:21:54,857] [INFO] [RANK 0] building MSAGPT model ...
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 15063 and data parallel seed: 12345
> [2025-01-21 11:21:55,104] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 2860508544
> [2025-01-21 11:21:56,144] [INFO] [RANK 0] CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 5.79 GiB total capacity; 5.30 GiB already allocated; 84.75 MiB free; 5.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
> [2025-01-21 11:21:56,144] [INFO] [RANK 0] global rank 0 is loading checkpoint ./checkpoints/MSAGPT/1/mp_rank_00_model_states.pt
> [2025-01-21 11:21:58,535] [INFO] [RANK 0] > successfully loaded ./checkpoints/MSAGPT/1/mp_rank_00_model_states.pt
> Traceback (most recent call last):
>   File "/home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py", line 44, in <module>
>     model = model.to('cuda')
>             ^^^^^^^^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
>     return self._apply(convert)
>            ^^^^^^^^^^^^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
>     module._apply(fn)
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
>     module._apply(fn)
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
>     module._apply(fn)
>   [Previous line repeated 2 more times]
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 820, in _apply
>     param_applied = fn(param)
>                     ^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
>     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 5.79 GiB total capacity; 5.30 GiB already allocated; 84.75 MiB free; 5.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6700) of binary: /home/mca/anaconda3/bin/python
> Traceback (most recent call last):
>   File "/home/mca/anaconda3/bin/torchrun", line 8, in <module>
>     sys.exit(main())
>              ^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
>     return f(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
>     run(args)
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
>     elastic_launch(
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
>     return launch_agent(self._config, self._entrypoint, list(args))
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
>     raise ChildFailedError(
> torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
> ============================================================
> /home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py FAILED
> ------------------------------------------------------------
> Failures:
>   <NO_OTHER_FAILURES>
> ------------------------------------------------------------
> Root Cause (first observed failure):
> [0]:
>   time      : 2025-01-21_11:22:01
>   host      : mca-lab6
>   rank      : 0 (local_rank: 0)
>   exitcode  : 1 (pid: 6700)
>   error_file: <N/A>
>   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
> ============================================================

nvidia-smi


> Tue Jan 21 11:24:48 2025       
> +---------------------------------------------------------------------------------------+
> | NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
> |-----------------------------------------+----------------------+----------------------+
> | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
> | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
> |                                         |                      |               MIG M. |
> |=========================================+======================+======================|
> |   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:01:00.0  On |                  N/A |
> |  0%   48C    P8              12W / 120W |    219MiB /  6144MiB |      3%      Default |
> |                                         |                      |                  N/A |
> +-----------------------------------------+----------------------+----------------------+
>                                                                                          
> +---------------------------------------------------------------------------------------+
> | Processes:                                                                            |
> |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
> |        ID   ID                                                             Usage      |
> |=======================================================================================|
> |    0   N/A  N/A      2307      G   /usr/lib/xorg/Xorg                           92MiB |
> |    0   N/A  N/A      2435      G   ...libexec/gnome-remote-desktop-daemon        1MiB |
> |    0   N/A  N/A      2473      G   /usr/bin/gnome-shell                         70MiB |
> |    0   N/A  N/A      4776      G   ...seed-version=20250119-180455.285000       51MiB |
> +---------------------------------------------------------------------------------------+

This is my INPUT file

7pno_D:GSGSGSGSGTNSLLNLRSRLAAKAAKEAASSNSENLYFQ---SGGTRLTNSLLNLRSRLAAKAAKEAASSNAT------STSGGTRLTNSLLNLRSRLAAKAIKEST----------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions