Multi-GPU learner

This is a very desirable feature, especially to push the throughput of single-agent training to 200K FPS and beyond.

Plan: use NCCL and/or Torch DistributedDataParallel.
We can spawn one learner process per GPU and then split the data equally (e.g. learner #3 gets all trajectories with index % 3 == 0).
Then we average the gradients. This will also help to parallelize the batching since there will be multiple processes doing this.

An alternative is to spawn the learner process (one per policy) and then have it spawn child processes for individual GPUs. This can be easier to implement.

To take full advantage of this, we also need to support policy workers on multiple GPUs. This requires exchanging the parameter vectors between learner and policy worker through CPU memory, rather than shared GPU memory. This can be a step 1 of the implementation. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU learner #45

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi-GPU learner #45

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions