-
Notifications
You must be signed in to change notification settings - Fork 97
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I am attempting to run comet-train
with multiple GPUs.
Command (abbreviated):
CUDA_VISIBLE_DEVICES=0,1,2,3 comet-train ...
Config (abbreviated):
init_args:
accelerator: gpu
devices: 4
auto_scale_batch_size: True
auto_select_gpus: True
Output with error (abbreviated):
...
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Added key: store_based_barrier_key:1 to store for rank: 1
...
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Added key: store_based_barrier_key:1 to store for rank: 3
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
...
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
...
1,161.732 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:208: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
rank_zero_warn(
Traceback (most recent call last):
File "/opt/conda/bin/comet-train", line 8, in <module>
sys.exit(train_command())
File "/opt/conda/lib/python3.10/site-packages/comet/cli/train.py", line 209, in train_command
trainer.fit(model)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
mp.start_processes(
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function
results = function(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1204, in _run_train
self._run_sanity_check()
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_sanity_check
val_loop.run()
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
self.on_run_start(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 84, in on_run_start
self._data_fetcher = iter(data_fetcher)
File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 178, in __iter__
self.dataloader_iter = iter(self.dataloader)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
return self._get_iterator()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
w.start()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/conda/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/opt/conda/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/opt/conda/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/opt/conda/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/opt/conda/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'CometModel.val_dataloader.<locals>.<lambda>'
I'm using NVIDIA A10G GPUs and the following software versions:
- Python - 3.10.9
- COMET - upstream
- torch - 2.0.1
- pytorch-lightning - 1.9.5
- transformers - 4.29.0
- numpy - 1.24.3
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working