-
Notifications
You must be signed in to change notification settings - Fork 482
Closed
Description
The 37 Implementation Details of PPO, a blog post published at ICLR, details a number of PPO implementation details to improve both efficiency and model performance. See also: Andrychowicz et al., Engstrom et al.
Some of these optimizations are minor and probably irrelevant, many are already implemented here, and some may provide performance boosts to trlx. This issue documents these details as a checklist, to track the progress of this repository towards the entire list.
- 1. Vectorized Architecture -
trlx
already does this. - 2. Weights and Biases Initialisation. Any layers initialised from scratch should use orthogonal initialization with scaling
sqrt(2)
and bias of 0, with policy network last layer scaled by0.01
after init. - 3. Adam Optimizer initialization. Andrychowicz et al. recommend
1e-7
as Adam epsilon (and actually find that the PyTorch default of1e-8
is the worst of the choices tested). - 4. Optimizer Weight Decay. Currently the code does not use the config value of
weight_decay: 1e-6
at all? It also uses Cosine Annealing instead of Linear, and decays not to 0 (recommended by Andrychowicz et al.) but to1.412e-4
by default. Maybe test linear to see if it makes a difference? - 5. Generalized Advantage Estimation. Correctly implemented in
trlx
. - 6. Mini-batch updates. In
trlx
this is being done inmake_experience
. - 7. Normalization of Advantages (at the mini-batch level). I believe this is being done, since I think
whiten
is called at mini-batch level? - 8. Clipped surrogate objective. Done in
trlx
. - 9. Value function loss clipping. Done in
trlx
. - 10. Overall loss and entropy bonus. Entropy is not used for regularization in
trlx
. OAI set it to 0 for mujoco anyway, and Andrychowicz et al. find that regularization does not help performance, so this may not be useful to implement. - 11. Global gradient clipping. The
trlx
grad_clip
config option does not appear to be connected to anything. Andrychowicz et al. find a small performance boost from ensuring the norm of gradients of all parameters does not exceed0.5
. - 12. KL approximation. Check that the unbiased estimator is being used.
- 13. Shared vs separate policy/value networks. Irrelevant in
trlx
due to the hydra heads implementation.
Other items in the blog post are environment/network specific to problems trlx
does not tackle. Andrychowicz also contains other hyperparameter choices not mentioned here which may be of interest.
maxreciprocate, Dahoas and andreaskoepf
Metadata
Metadata
Assignees
Labels
No labels