In PPO.ipynb, the position of action loss epoch and value loss epoch need to be swapped.

In PPO.ipynb, the position of action loss epoch and value loss epoch need to be swapped and I suggest that you'd better use RMSprop as the optimizer and reduce the learning rate to make these RL model easier to converge.