Draft of read_eval_log skipping validation and using multiple process… #2235
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

Hey Ekin! I have played around with this today, love it and would speed up my main workflow a lot!
My only suggestion is that we go bigger and aim to give
read_eval_logsaskip_sample_validationfield so that users can get anEvalLogback. This is important to me because I have lots of analysis code that takeEvalLogs,EvalSamples,List[ChatMessage]as input, and I would probably cast to these types before downstream analysis in the future.Here's a suggestion of how we could adapt the approach you've taken to do this:
model_constructrather thanmodel_validateto populateEvalSamplewith the JSON dict.RecorderABC to featureskip_sample_validationflag. There's a todo here - I haven't made a call about what we do with theJSONRecorder, which straightforwardly loads everything from JSON. I suggest we skip validation for all fields in this case, and detail this in the docs.I've included a benchmarking test that measures execution time on a realistic (for me!) log file of 5 samples, a few hundred messages each, lots of big messages and events. The average speedup was x2.5. I expect this to be bigger for more and longer samples, and potentially slightly sub-1 for logs with a handful of short samples.
Let me know what you think! Can then add in tests and docs. It could also be that we include some logic so that multiprocessing is only used if there are more than k samples, since for smaller logs I suspect the start-up overhead isn't worth it.