Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/algorithms/ppo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,11 +87,11 @@ Because the advantage is positive, the objective will increase if the action bec

.. math::

L(s,a,\theta_k,\theta) = \max\left(
L(s,a,\theta_k,\theta) = \min\left(
\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, (1 - \epsilon)
\right) A^{\pi_{\theta_k}}(s,a).

Because the advantage is negative, the objective will increase if the action becomes less likely---that is, if :math:`\pi_{\theta}(a|s)` decreases. But the max in this term puts a limit to how *much* the objective can increase. Once :math:`\pi_{\theta}(a|s) < (1-\epsilon) \pi_{\theta_k}(a|s)`, the max kicks in and this term hits a ceiling of :math:`(1-\epsilon) A^{\pi_{\theta_k}}(s,a)`. Thus, again: *the new policy does not benefit by going far away from the old policy*.
Because the advantage is negative, the objective will increase if the action becomes less likely---that is, if :math:`\pi_{\theta}(a|s)` decreases. But the min in this term puts a limit to how *much* the objective can increase. Once :math:`\pi_{\theta}(a|s) < (1-\epsilon) \pi_{\theta_k}(a|s)`, the min kicks in and this term hits a ceiling of :math:`(1-\epsilon) A^{\pi_{\theta_k}}(s,a)`. Thus, again: *the new policy does not benefit by going far away from the old policy*.

What we have seen so far is that clipping serves as a regularizer by removing incentives for the policy to change dramatically, and the hyperparameter :math:`\epsilon` corresponds to how far away the new policy can go from the old while still profiting the objective.

Expand Down Expand Up @@ -232,4 +232,4 @@ Other Public Implementations
.. _Baselines: https://github.com/openai/baselines/tree/master/baselines/ppo2
.. _ModularRL: https://github.com/joschu/modular_rl/blob/master/modular_rl/ppo.py
.. _rllab: https://github.com/rll/rllab/blob/master/rllab/algos/ppo.py
.. _`rllib (Ray)`: https://github.com/ray-project/ray/tree/master/python/ray/rllib/agents/ppo
.. _`rllib (Ray)`: https://github.com/ray-project/ray/tree/master/python/ray/rllib/agents/ppo