Skip to content

Multi-GPU learner #45

@alex-petrenko

Description

@alex-petrenko

This is a very desirable feature, especially to push the throughput of single-agent training to 200K FPS and beyond.

Plan: use NCCL and/or Torch DistributedDataParallel.
We can spawn one learner process per GPU and then split the data equally (e.g. learner #3 gets all trajectories with index % 3 == 0).
Then we average the gradients. This will also help to parallelize the batching since there will be multiple processes doing this.

An alternative is to spawn the learner process (one per policy) and then have it spawn child processes for individual GPUs. This can be easier to implement.

To take full advantage of this, we also need to support policy workers on multiple GPUs. This requires exchanging the parameter vectors between learner and policy worker through CPU memory, rather than shared GPU memory. This can be a step 1 of the implementation.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions