Skip to content

PPO Implementation Details - Checklist #53

@herbiebradley

Description

@herbiebradley

The 37 Implementation Details of PPO, a blog post published at ICLR, details a number of PPO implementation details to improve both efficiency and model performance. See also: Andrychowicz et al., Engstrom et al.

Some of these optimizations are minor and probably irrelevant, many are already implemented here, and some may provide performance boosts to trlx. This issue documents these details as a checklist, to track the progress of this repository towards the entire list.

  • 1. Vectorized Architecture - trlx already does this.
  • 2. Weights and Biases Initialisation. Any layers initialised from scratch should use orthogonal initialization with scaling sqrt(2) and bias of 0, with policy network last layer scaled by 0.01 after init.
  • 3. Adam Optimizer initialization. Andrychowicz et al. recommend 1e-7 as Adam epsilon (and actually find that the PyTorch default of 1e-8 is the worst of the choices tested).
  • 4. Optimizer Weight Decay. Currently the code does not use the config value of weight_decay: 1e-6 at all? It also uses Cosine Annealing instead of Linear, and decays not to 0 (recommended by Andrychowicz et al.) but to 1.412e-4 by default. Maybe test linear to see if it makes a difference?
  • 5. Generalized Advantage Estimation. Correctly implemented in trlx.
  • 6. Mini-batch updates. In trlx this is being done in make_experience.
  • 7. Normalization of Advantages (at the mini-batch level). I believe this is being done, since I think whiten is called at mini-batch level?
  • 8. Clipped surrogate objective. Done in trlx.
  • 9. Value function loss clipping. Done in trlx.
  • 10. Overall loss and entropy bonus. Entropy is not used for regularization in trlx. OAI set it to 0 for mujoco anyway, and Andrychowicz et al. find that regularization does not help performance, so this may not be useful to implement.
  • 11. Global gradient clipping. The trlx grad_clip config option does not appear to be connected to anything. Andrychowicz et al. find a small performance boost from ensuring the norm of gradients of all parameters does not exceed 0.5.
  • 12. KL approximation. Check that the unbiased estimator is being used.
  • 13. Shared vs separate policy/value networks. Irrelevant in trlx due to the hydra heads implementation.

Other items in the blog post are environment/network specific to problems trlx does not tackle. Andrychowicz also contains other hyperparameter choices not mentioned here which may be of interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions