-
Notifications
You must be signed in to change notification settings - Fork 482
Description
🐛 Describe the bug
I've run ppo_sentiments.py, and an older version, and seeing that ratio is != 1 at step 0 (before optimizer step), at this line:
https://github.com/CarperAI/trlx/blob/main/trlx/model/nn/ppo_models.py#L165
https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ - reference regarding ratio = 1 at first epoch/mini-batch update:
Check if ratio=1: Check if the ratio are always 1s during the first epoch and first mini-batch update, when new and old policies are the same and therefore the ratio are 1s and has nothing to clip. If ratio are not 1s, it means there is a bug and the program has not reconstructed the probability distributions used in rollouts.
When making experience, the ratio you'd get here at the start of training (before optimization) is 1: https://github.com/CarperAI/trlx/blob/main/trlx/orchestrator/ppo_orchestrator.py#L130
Seems like a currently unknown cause/bug, which leads to unexpected ratio values.
Which trlX version are you using?
0.3.0
Additional system and package information
Python 3.8