Skip to content

RecursionError: maximum recursion depth exceeded while calling a Python object #59

Description

@wangkevin02

When running the code on a single machine with multiple GPUs, I encountered an error, whether using DeepSpeed ZeRO or ZeRO-3. Specifically, the error occurred when initializing the distributed model using Accelerate, resulting in a recursive error. For context, I used Llama3 as both the critic model and policy model, aiming simply to run the entire code end-to-end.

Code Modifications

I only made changes to a portion of the data preprocessing code and did not alter the model class code.

Troubleshooting Steps Tried

I attempted to resolve the issue by changing the DeepSpeed version, but this did not resolve the problem.

Traceback (most recent call last):
  File "train_ppo.py", line 228, in <module>
    main(opt)
  File "train_ppo.py", line 220, in main
    trainer = PPOTrainer(opt, policy_model, ref_model, critic_model, reward_model, accelerator)
  File "/hy-tmp/My_MOSS-RLHF/ppo/ppo_trainer.py", line 116, in __init__
    self.model, self.optimizer, self.scheduler = self.accelerator.prepare(self.model, self.optimizer, self.scheduler)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344, in prepare
    result = self._prepare_deepspeed(*args)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1851, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/__init__.py", line 193, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
    self._configure_distributed_model(model)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_distributed_model
    self.module.bfloat16()
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 856, in bfloat16
    return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  [Previous line repeated 982 more times]
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 663, in _apply
    with torch.no_grad():
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 133, in __enter__
    torch.set_grad_enabled(False)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 228, in __init__
    self.prev = torch.is_grad_enabled()
RecursionError: maximum recursion depth exceeded while calling a Python object
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58428 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58430 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58431 closing signal SIGTERM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions