-
Notifications
You must be signed in to change notification settings - Fork 734
Description
There have been numerous issues about using DDP with hydra:
#231 #289 #229 #226 #194 #352
Current state of things is well described here:
facebookresearch/hydra#2070
tl;dr:
You should be good when using current lightning-hydra-template with ddp_spawn:
# run ddp_spawn on 4 GPUs
python train.py trainer.strategy=ddp_spawn trainer.accelerator=gpu trainer.devices=4
# simulate ddp_spawn on CPU on 4 processes (for testing)
python train.py trainer.strategy=ddp_spawn trainer.accelerator=cpu trainer.devices=4
This works correctly with normal runs as well as multiruns as far as I'm aware.
(ddp_spawn works a bit slower than normal ddp and should be run with datamodule.num_workers=0 only)
Normal ddp computes correctly but generates multiple output directories.
I have not tested what happens when using SLURM.
For now, I don't see anything that can be done on the template part to fix this. This might change with future hydra releases.
Update (April 2023):
Nornal DDP seems to be working correctly with current lightning release (2.0.2). There are no longer multiple output directories.