-
Couldn't load subscription status.
- Fork 322
Closed
Description
Problem
The current caching implementation causes issues when running propensity evaluations that rely on temperature sampling. After, e.g. running 30 samples (same input prompt) that each have a different response because of temperature sampling, only one of the responses is cached (the last I think). This means that when you rerun the evaluation, only one cached response is loaded instead of the intended diverse set of sampled outputs, which messes up the scoring completely.
This is a very annoying barrier for propensity evaluations since they typically sample many outputs for the same input to measure propensity.
Metadata
Metadata
Assignees
Labels
No labels