More flexible TrainableModel #51
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes a few changes to the API shape, specifically focused on
Model
andLocalAPI
.It introduces an abstraction where we separate out
PolicyModel
, which could be any LLM, and aTrainablePolicyModel
, which is specifically a policy that can be trained by our system. This will let us log trajectories from bothPolicyModel
andTrainablePolicyModel
in a unified way.It also adds a new
config
field toPolicyModel
. This is opaque to our system, but is something we can log to wandb and our file system in the future to track hparams associated with each model run. I'm using the config field in the following way:And then within my training and rollout functions, adjusting behavior based on the config above. I find this pattern helps me stay sane by tracking what properties each model was called with, and keeping a record around of old training runs (by keeping the old config in the codebase).