Skip to content

Conversation

maxreciprocate
Copy link
Collaborator

This PR corrects updates for AdaptiveKLController, and adds logging of a square root of an estimate of D_KL(pi || pi_ref)

https://wandb.ai/sorry/trlx-references/reports/fix-kl-controller-v-main--VmlldzozNzM2Mzg0

PPO runs start to diverge against references precisely at the point when KL exceeds the controller's target and its scalar multiplier increases, whereas previously the same target value would not be reached.

@maxreciprocate maxreciprocate requested a review from Dahoas March 10, 2023 19:20
Copy link
Collaborator

@Dahoas Dahoas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@Dahoas Dahoas merged commit ded2e5e into main Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants