Skip to content

Conversation

@ProFrenchToast
Copy link
Contributor

@ProFrenchToast ProFrenchToast commented Oct 6, 2025

This PR contains:

  • New features
  • Changes to dev-tools e.g. CI config / github tooling
  • Docs
  • Bug fixes
  • Code refactor

What is the current behavior? (You can also link to an open issue here)

Currently in batch mode, if an eval is stopped mid sample there will be batches that are inflight that are not resumed when the eval is retried. This means that we need to resend all of the requests when running the sample again.

What is the new behavior?

Now the current set of inflight batches are saved in the eval stats of the eval log. When the log is retried it will check each of the inflight batches when the log was closed, each request from completed batches will be added into the cache, batches that are still in progress will be added to the current inflight batches. Adding completed requests to the cache will mean that when resuming the sample that request will be automatically filled without needing to create a new batch allowing whatever sample that was in progress to effectively resume from where it was (assuming the rest of the input is the same).

This was born out of a desire to run large batches asynchronously for simple 1 step evals such as big mcq datasets. The ideal use case would be run the eval, send off all the requests in a large batch, close the process, resume the next day and pick up the completed sample.

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

Changing the information stored in the logs might break something related to reading the logs but I have not checked.
As part of this change I removed the source field from cache key calculation. This will result in different keys compared to before meaning generate will need to be called again.

Other information:

TODO:

  1. plan how to resume batches that have not been completed
    -- The problem with this is how do we marry up the requests from a batch with a specific generate call in a sample?
    -- we could do something like, if we would call generate with a given input, epoch, ect and there is an inflight batch with the same input, epoch, ect then we dont make a new request and instead just wait for that inflight batch to be done.
    -- This approach sounds like it will have problems but I can't describe why yet.

Added the functionality to read the inflight batches from the log file and add any completed batches to the cache
this is done because the inflight batches don't record the source of a message and so will have a different key than the real messages. I don't see a reason we need the source in the key so I just removed it.
@ProFrenchToast
Copy link
Contributor Author

ProFrenchToast commented Oct 6, 2025

Added epoch to batch requests and added argument to trigger the resume feature.

Now I just need to plan what we are gonna do with the still in progress batches and how to test this feature.

@jjallaire
Copy link
Collaborator

Thank you for this! There are some non-obvious additional things that will need to be done here, and we may want to move to storing the intermediate batches somewhere besides the log file (e.g. a sqlite database in the user's data dir).

@epatey I think we should this this up once we are through the scanner pipeline work.

@ProFrenchToast We will plan on taking this from here (we could go back and forth on all of the related/required other changes but I think it will be more efficient for us to just do the work).

@ProFrenchToast
Copy link
Contributor Author

@ProFrenchToast We will plan on taking this from here (we could go back and forth on all of the related/required other changes but I think it will be more efficient for us to just do the work).

Yeah it was starting to get to a point where there would need to be some design decisions I didn't feel qualified to make. Happy to let you guys take over lol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants