The root painter trainer has started to cast memory errors. It's difficult for me to debug or repeat, but after random number of epochs it will cast something like,
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/sporring/.local/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 49, in _pin_memory_loop
do_one_step()
File "/home/sporring/.local/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 26, in do_one_step
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/sporring/.local/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 305, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
Sometimes it recovers other times it crashes. Has anyone experienced something similar, and/or have a clue to how I can debug it?
The root painter trainer has started to cast memory errors. It's difficult for me to debug or repeat, but after random number of epochs it will cast something like,
Sometimes it recovers other times it crashes. Has anyone experienced something similar, and/or have a clue to how I can debug it?
Thanks, Jon