Skip to content

slurm + ddp stuck #231

@Fengyee

Description

@Fengyee

I run the default code with python run.py trainer.gpus=4 +trainer.accelerator=ddp logger=tensorboard

when start the Slurm job by sbatch srun.sh, it stucks at the init step of ddp

[2021-12-19 17:50:02,938][src.utils.utils][INFO] - Disabling python warnings! <config.ignore_warnings=True>
[2021-12-19 17:50:02,983][src.train][INFO] - Instantiating datamodule <src.datamodules.mnist_datamodule.MNISTDataModule>
[2021-12-19 17:50:02,989][src.train][INFO] - Instantiating model <src.models.mnist_model.MNISTLitModel>
[2021-12-19 17:50:03,012][src.train][INFO] - Instantiating callback <pytorch_lightning.callbacks.ModelCheckpoint>
[2021-12-19 17:50:03,014][src.train][INFO] - Instantiating callback <pytorch_lightning.callbacks.EarlyStopping>
[2021-12-19 17:50:03,015][src.train][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichModelSummary>
[2021-12-19 17:50:03,015][src.train][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichProgressBar>
[2021-12-19 17:50:03,016][src.train][INFO] - Instantiating logger <pytorch_lightning.loggers.tensorboard.TensorBoardLogger>
[2021-12-19 17:50:03,017][src.train][INFO] - Instantiating trainer <pytorch_lightning.Trainer>
[2021-12-19 17:50:03,095][pytorch_lightning.utilities.distributed][INFO] - Multi-processing is handled by Slurm.
[2021-12-19 17:50:03,097][pytorch_lightning.utilities.distributed][INFO] - Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
[2021-12-19 17:50:03,097][pytorch_lightning.utilities.distributed][INFO] - GPU available: True, used: True
[2021-12-19 17:50:03,097][pytorch_lightning.utilities.distributed][INFO] - TPU available: False, using: 0 TPU cores
[2021-12-19 17:50:03,098][pytorch_lightning.utilities.distributed][INFO] - IPU available: False, using: 0 IPUs
[2021-12-19 17:50:03,098][src.train][INFO] - Logging hyperparameters!
[2021-12-19 17:50:03,116][src.train][INFO] - Starting training!
[2021-12-19 17:50:03,164][pytorch_lightning.utilities.distributed][INFO] - initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

I also tested by salloc first and run on the GPU node directly, it is not stuck.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions