Resume inflight batches from unfinished samples #2563
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains:
What is the current behavior? (You can also link to an open issue here)
Currently in batch mode, if an eval is stopped mid sample there will be batches that are inflight that are not resumed when the eval is retried. This means that we need to resend all of the requests when running the sample again.
What is the new behavior?
Now the current set of inflight batches are saved in the eval stats of the eval log. When the log is retried it will check each of the inflight batches when the log was closed, each request from completed batches will be added into the cache, batches that are still in progress will be added to the current inflight batches. Adding completed requests to the cache will mean that when resuming the sample that request will be automatically filled without needing to create a new batch allowing whatever sample that was in progress to effectively resume from where it was (assuming the rest of the input is the same).
This was born out of a desire to run large batches asynchronously for simple 1 step evals such as big mcq datasets. The ideal use case would be run the eval, send off all the requests in a large batch, close the process, resume the next day and pick up the completed sample.
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
Changing the information stored in the logs might break something related to reading the logs but I have not checked.
As part of this change I removed the source field from cache key calculation. This will result in different keys compared to before meaning generate will need to be called again.
Other information:
TODO:
-- The problem with this is how do we marry up the requests from a batch with a specific generate call in a sample?
-- we could do something like, if we would call generate with a given input, epoch, ect and there is an inflight batch with the same input, epoch, ect then we dont make a new request and instead just wait for that inflight batch to be done.
-- This approach sounds like it will have problems but I can't describe why yet.