Skip to content

Conversation

tboerstad
Copy link

This PR fixes a bug where evaluation will fail if any of the text/content responses is None from the API under test.
This happens because the parse_generations functions say that they return List[str], but will actually return a list[Optional[str]] when the response from the API is None.

This error scenario is more likely with reasoning models.
Some frameworks put the reasoning tokens in a different field, and without a high enough max_gen_toks the entire token budget will be spent on reasoning tokens. This leaves the text/content field as None.

I've hit this error scenario when testing gpt-oss-20b on vLLM with the gsm8k_cot_llama dataset.

@CLAassistant
Copy link

CLAassistant commented Sep 22, 2025

CLA assistant check
All committers have signed the CLA.

@baberabb
Copy link
Contributor

baberabb commented Oct 2, 2025

Thanks for the PR! Left a comment to add warning in case an empty response is unexpected. Also if you could run the pre-commit for the formatting:

pip install pre-commit
pre-commit run --all-files
``

- Add warning logs when API returns None/empty responses in parse_generations
- Helps users identify when reasoning models consume entire token budget
- Applied pre-commit formatting

Addresses review feedback from @baberabb
@tboerstad
Copy link
Author

Thanks for the feedback. I've added a warning, tested that it's being emitted, and also ran the pre-commit run --all-files .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants