Draft of read_eval_log skipping validation and using multiple process… #2235

bigmouthmodels · 2025-08-07T08:55:28Z

Hey Ekin! I have played around with this today, love it and would speed up my main workflow a lot!

My only suggestion is that we go bigger and aim to give read_eval_logs a skip_sample_validation field so that users can get an EvalLog back. This is important to me because I have lots of analysis code that take EvalLogs, EvalSamples, List[ChatMessage] as input, and I would probably cast to these types before downstream analysis in the future.

Here's a suggestion of how we could adapt the approach you've taken to do this:

Use model_construct rather than model_validate to populate EvalSample with the JSON dict.
Use multiprocessing to parallelise reading the samples from disk. It is annoying that the GIL prevents us from sharing the single ZipFile here, but that's life.
Update the Recorder ABC to feature skip_sample_validation flag. There's a todo here - I haven't made a call about what we do with the JSONRecorder, which straightforwardly loads everything from JSON. I suggest we skip validation for all fields in this case, and detail this in the docs.

I've included a benchmarking test that measures execution time on a realistic (for me!) log file of 5 samples, a few hundred messages each, lots of big messages and events. The average speedup was x2.5. I expect this to be bigger for more and longer samples, and potentially slightly sub-1 for logs with a handful of short samples.

Let me know what you think! Can then add in tests and docs. It could also be that we include some logic so that multiprocessing is only used if there are more than k samples, since for smaller logs I suspect the start-up overhead isn't worth it.

…es to read samples in parallel.

bigmouthmodels · 2025-08-07T09:13:41Z

Edit: looking at the benchmarking data you shared again (have attached for ref). The difference in speed between model_construct(json.load(f)) and json.load(f) is huge! Makes me think we probably do want to let the user skip constructing the ChatMessages and Events at all, like you've done.

One way that we could get the speed of json.load(f) while conforming to the existing ChatMessage and Event interfaces is to write an interface that functions like both of these, but simply grabs the requested field from the dictionary.

Draft of read_eval_log skipping validation and using multiple process…

90f6385

…es to read samples in parallel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft of read_eval_log skipping validation and using multiple process… #2235

Draft of read_eval_log skipping validation and using multiple process… #2235

Uh oh!

bigmouthmodels commented Aug 7, 2025

Uh oh!

bigmouthmodels commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Draft of read_eval_log skipping validation and using multiple process… #2235

Are you sure you want to change the base?

Draft of read_eval_log skipping validation and using multiple process… #2235

Uh oh!

Conversation

bigmouthmodels commented Aug 7, 2025

Uh oh!

bigmouthmodels commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant