-
Notifications
You must be signed in to change notification settings - Fork 135
Description
This is a very desirable feature, especially to push the throughput of single-agent training to 200K FPS and beyond.
Plan: use NCCL and/or Torch DistributedDataParallel.
We can spawn one learner process per GPU and then split the data equally (e.g. learner #3 gets all trajectories with index % 3 == 0).
Then we average the gradients. This will also help to parallelize the batching since there will be multiple processes doing this.
An alternative is to spawn the learner process (one per policy) and then have it spawn child processes for individual GPUs. This can be easier to implement.
To take full advantage of this, we also need to support policy workers on multiple GPUs. This requires exchanging the parameter vectors between learner and policy worker through CPU memory, rather than shared GPU memory. This can be a step 1 of the implementation.