-
Notifications
You must be signed in to change notification settings - Fork 734
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I run the default code with python run.py trainer.gpus=4 +trainer.accelerator=ddp logger=tensorboard
when start the Slurm job by sbatch srun.sh, it stucks at the init step of ddp
[2021-12-19 17:50:02,938][src.utils.utils][INFO] - Disabling python warnings! <config.ignore_warnings=True>
[2021-12-19 17:50:02,983][src.train][INFO] - Instantiating datamodule <src.datamodules.mnist_datamodule.MNISTDataModule>
[2021-12-19 17:50:02,989][src.train][INFO] - Instantiating model <src.models.mnist_model.MNISTLitModel>
[2021-12-19 17:50:03,012][src.train][INFO] - Instantiating callback <pytorch_lightning.callbacks.ModelCheckpoint>
[2021-12-19 17:50:03,014][src.train][INFO] - Instantiating callback <pytorch_lightning.callbacks.EarlyStopping>
[2021-12-19 17:50:03,015][src.train][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichModelSummary>
[2021-12-19 17:50:03,015][src.train][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichProgressBar>
[2021-12-19 17:50:03,016][src.train][INFO] - Instantiating logger <pytorch_lightning.loggers.tensorboard.TensorBoardLogger>
[2021-12-19 17:50:03,017][src.train][INFO] - Instantiating trainer <pytorch_lightning.Trainer>
[2021-12-19 17:50:03,095][pytorch_lightning.utilities.distributed][INFO] - Multi-processing is handled by Slurm.
[2021-12-19 17:50:03,097][pytorch_lightning.utilities.distributed][INFO] - Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
[2021-12-19 17:50:03,097][pytorch_lightning.utilities.distributed][INFO] - GPU available: True, used: True
[2021-12-19 17:50:03,097][pytorch_lightning.utilities.distributed][INFO] - TPU available: False, using: 0 TPU cores
[2021-12-19 17:50:03,098][pytorch_lightning.utilities.distributed][INFO] - IPU available: False, using: 0 IPUs
[2021-12-19 17:50:03,098][src.train][INFO] - Logging hyperparameters!
[2021-12-19 17:50:03,116][src.train][INFO] - Starting training!
[2021-12-19 17:50:03,164][pytorch_lightning.utilities.distributed][INFO] - initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
I also tested by salloc first and run on the GPU node directly, it is not stuck.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working