-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[gpt-oss] add input/output usage in responses api when harmony context is leveraged #22667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpt-oss] add input/output usage in responses api when harmony context is leveraged #22667
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds token usage statistics (num_prompt_tokens
and num_output_tokens
) to HarmonyContext
and its subclasses, which was a missing feature. The changes correctly address the feature request.
My review focuses on improving the maintainability and correctness of the implementation. I've identified a case of code duplication that should be refactored to avoid future bugs. More critically, I've found a potential bug in the streaming context where the token processing and counting logic might be incorrect if more than one token is received in a single output, which can lead to silent errors and incorrect metrics. Addressing these points will make the implementation more robust.
2656338
to
daf160b
Compare
This pull request has merge conflicts that must be resolved before it can be |
daf160b
to
83bb51b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that when python / browser tool is enabled, there will be multiple rounds of adding new tool call output and generate new tokens. Can you help to handle this scenario correctly? To my understanding, the expect behavior should be sum the prompt length and output length for all rounds. But open to discuss.
And please keep the TODO for num_cached_tokens
and num_reasoning_tokens
7812392
to
66bd2f0
Compare
@heheda12345 good points. |
64a35e0
to
f184b23
Compare
I have updated the changes and it should now be robust to the "inner" multi-turns messages due to built-in tool calls. |
For streaming case, |
Wonder if we can count num_cached_tokens and num_reasoning_tokens too? For cached tokens, we can get the output.num_cached_tokens based on the first output to append. Also for the input_tokens, I think we should count based on the first output to append, but not the CoT input from the multi-turn output iiuc. Reference to oai documentation https://platform.openai.com/docs/api-reference/responses/object#responses/object-usage |
@gcalmettes Can you update the PR for streaming case? |
@heheda12345 yes, working on it now, I'll ping you once the changes have been made |
Signed-off-by: Guillaume Calmettes <[email protected]>
Signed-off-by: Guillaume Calmettes <[email protected]>
…ummed in the usage statistics Signed-off-by: Guillaume Calmettes <[email protected]>
f184b23
to
d726607
Compare
Signed-off-by: Guillaume Calmettes <[email protected]>
de08aee
to
d9509f6
Compare
@heheda12345 I have pushed a fix for the streaming case. Let me know what you think. |
@QierLi that is what I was wondering. They are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I prefer to count them as input token for every round, because these usage are mainly used for pricing, so it should reflect the compute resource consumption. |
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: FFFfff1FFFfff <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>
Purpose
Dedicated classes for GPT-OSS have been introduced in #22340 in order to handle the specifics of harmony. However currently the token usage statistics are not outputed when serving requests for gpt-oss.
This PR aims to add usage statistics
cc: @WoosukKwon
Test Plan
Test Result
(Optional) Documentation Update