The mean_kl implementation is different from that of openai/lm-human-preference

### 🐛 Describe the bug

Currently, the `mean_kl` used to update `kl_ctl` is calculated from:
https://github.com/CarperAI/trlx/blob/92b68e4d8c5d59e6ba25d12fd9acfe10287be689/trlx/trainer/accelerate_ppo_trainer.py#L437-L438
which is a mean value over each token.
While in [openai/lm-human-preferences](https://github.com/openai/lm-human-preferences), the `mean_kl` is:
```python
            kl = data['logprobs'] - data['ref_logprobs']
            mean_kl = tf.reduce_mean(tf.reduce_sum(kl, axis=1))
```
(https://github.com/openai/lm-human-preferences/blob/bd3775f200676e7c9ed438c50727e7452b1a52c1/lm_human_preferences/train_policy.py#L220-L221)
which is not only the mean value upon each response, but also the same form of the kl used in reward.

Also, in anthropic's paper _Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback_, their kl value could be as large as 25 (as the square root is 5), which is hard to achieve for a token-wise mean kl.
<img width="842" alt="image" src="https://user-images.githubusercontent.com/10428324/232646546-4cd245b0-5f3e-4cb2-8b64-f41dcc02dccf.png">

I wonder if there is a specific reason why we use the current form of `mean_kl`? Thank you!

Gently ping @Dahoas @reciprocated 


### Which trlX version are you using?

_No response_

### Additional system and package information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The mean_kl implementation is different from that of openai/lm-human-preference #438

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	log_ratio = (logprobs - ref_logprobs) * attention_mask[:, :-1]
	self.mean_kl = (log_ratio.exp() - 1 - log_ratio).mean().to(device)

The mean_kl implementation is different from that of openai/lm-human-preference #438

Description

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions