Skip to content

Problems with DDP + hydra #393

@ashleve

Description

@ashleve

There have been numerous issues about using DDP with hydra:
#231 #289 #229 #226 #194 #352

Current state of things is well described here:
facebookresearch/hydra#2070

tl;dr:
You should be good when using current lightning-hydra-template with ddp_spawn:

# run ddp_spawn on 4 GPUs
python train.py trainer.strategy=ddp_spawn trainer.accelerator=gpu trainer.devices=4

# simulate ddp_spawn on CPU on 4 processes (for testing)
python train.py trainer.strategy=ddp_spawn trainer.accelerator=cpu trainer.devices=4

This works correctly with normal runs as well as multiruns as far as I'm aware.

(ddp_spawn works a bit slower than normal ddp and should be run with datamodule.num_workers=0 only)

Normal ddp computes correctly but generates multiple output directories.

I have not tested what happens when using SLURM.

For now, I don't see anything that can be done on the template part to fix this. This might change with future hydra releases.

Update (April 2023):
Nornal DDP seems to be working correctly with current lightning release (2.0.2). There are no longer multiple output directories.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingimportantHigh importance issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions