PPO Implementation Details - Checklist

[The 37 Implementation Details of PPO](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details), a blog post published at ICLR, details a number of PPO implementation details to improve both efficiency and model performance. See also: [Andrychowicz et al.](https://arxiv.org/abs/2006.05990), [Engstrom et al.](https://arxiv.org/abs/2005.12729)

Some of these optimizations are minor and probably irrelevant, many are already implemented here, and some may provide performance boosts to trlx. This issue documents these details as a checklist, to track the progress of this repository towards the entire list.

- [x] 1.  Vectorized Architecture  - `trlx` already does this.
- [ ] 2. Weights and Biases Initialisation. Any layers initialised from scratch should use orthogonal initialization with scaling `sqrt(2)` and bias of 0, with policy network last layer scaled by `0.01` after init.
- [x] 3. Adam Optimizer initialization. Andrychowicz et al. recommend `1e-7` as Adam epsilon (and actually find that the PyTorch default of `1e-8` is the worst of the choices tested).
- [x] 4. Optimizer Weight Decay. Currently the code does not use the config value of `weight_decay: 1e-6` at all? It also uses Cosine Annealing instead of Linear, and decays not to 0 (recommended by Andrychowicz et al.) but to `1.412e-4` by default. Maybe test linear to see if it makes a difference?
- [x] 5. Generalized Advantage Estimation. Correctly implemented in `trlx`.
- [x] 6. Mini-batch updates. In `trlx` this is being done in `make_experience`.
- [x] 7. Normalization of Advantages (at the mini-batch level). I believe this is being done, since I think `whiten` is called at mini-batch level?
- [x] 8. Clipped surrogate objective. Done in `trlx`.
- [x] 9. Value function loss clipping. Done in `trlx`.
- [ ] 10. Overall loss and entropy bonus. Entropy is not used for regularization in `trlx`. OAI set it to 0 for mujoco anyway, and Andrychowicz et al. find that regularization does not help performance, so this may not be useful to implement.
- [ ] 11. Global gradient clipping. The `trlx` `grad_clip` config option does not appear to be connected to anything. Andrychowicz et al. find a small performance boost from ensuring the norm of gradients of all parameters does not exceed `0.5`.
- [ ] 12. KL approximation. Check that the [unbiased estimator](http://joschu.net/blog/kl-approx.html) is being used.
- [x] 13. Shared vs separate policy/value networks. Irrelevant in `trlx` due to the hydra heads implementation.

Other items in the blog post are environment/network specific to problems `trlx` does not tackle. Andrychowicz also contains other hyperparameter choices not mentioned here which may be of interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PPO Implementation Details - Checklist #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PPO Implementation Details - Checklist #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions