Ratio != 1 at start of PPO training (during loss function calculation)

### 🐛 Describe the bug

I've run ppo_sentiments.py, and an older version, and seeing that ratio is != 1 at step 0 (before optimizer step), at this line:
https://github.com/CarperAI/trlx/blob/main/trlx/model/nn/ppo_models.py#L165

https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ - reference regarding ratio = 1 at first epoch/mini-batch update:

> Check if ratio=1: Check if the ratio are always 1s during the first epoch and first mini-batch update, when new and old policies are the same and therefore the ratio are 1s and has nothing to clip. If ratio are not 1s, it means there is a bug and the program has not reconstructed the probability distributions used in rollouts.

When making experience, the ratio you'd get here at the start of training (before optimization) is 1: https://github.com/CarperAI/trlx/blob/main/trlx/orchestrator/ppo_orchestrator.py#L130

Seems like a currently unknown cause/bug, which leads to unexpected ratio values.

### Which trlX version are you using?

0.3.0

### Additional system and package information

Python 3.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ratio != 1 at start of PPO training (during loss function calculation) #107

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ratio != 1 at start of PPO training (during loss function calculation) #107

Description

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions