Skip to content

Ratio != 1 at start of PPO training (during loss function calculation) #107

@daia99

Description

@daia99

🐛 Describe the bug

I've run ppo_sentiments.py, and an older version, and seeing that ratio is != 1 at step 0 (before optimizer step), at this line:
https://github.com/CarperAI/trlx/blob/main/trlx/model/nn/ppo_models.py#L165

https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ - reference regarding ratio = 1 at first epoch/mini-batch update:

Check if ratio=1: Check if the ratio are always 1s during the first epoch and first mini-batch update, when new and old policies are the same and therefore the ratio are 1s and has nothing to clip. If ratio are not 1s, it means there is a bug and the program has not reconstructed the probability distributions used in rollouts.

When making experience, the ratio you'd get here at the start of training (before optimization) is 1: https://github.com/CarperAI/trlx/blob/main/trlx/orchestrator/ppo_orchestrator.py#L130

Seems like a currently unknown cause/bug, which leads to unexpected ratio values.

Which trlX version are you using?

0.3.0

Additional system and package information

Python 3.8

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions