Skip to content

Conversation

gcalmettes
Copy link
Contributor

@gcalmettes gcalmettes commented Aug 11, 2025

Purpose

Dedicated classes for GPT-OSS have been introduced in #22340 in order to handle the specifics of harmony. However currently the token usage statistics are not outputed when serving requests for gpt-oss.

This PR aims to add usage statistics

cc: @WoosukKwon

Test Plan

Test Result

(Optional) Documentation Update

@gcalmettes gcalmettes requested a review from aarnphm as a code owner August 11, 2025 18:02
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added frontend gpt-oss Related to GPT-OSS models labels Aug 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds token usage statistics (num_prompt_tokens and num_output_tokens) to HarmonyContext and its subclasses, which was a missing feature. The changes correctly address the feature request.

My review focuses on improving the maintainability and correctness of the implementation. I've identified a case of code duplication that should be refactored to avoid future bugs. More critically, I've found a potential bug in the streaming context where the token processing and counting logic might be incorrect if more than one token is received in a single output, which can lead to silent errors and incorrect metrics. Addressing these points will make the implementation more robust.

@gcalmettes gcalmettes force-pushed the feat/responses-api-tokens-usage branch from 2656338 to daf160b Compare August 12, 2025 06:45
Copy link

mergify bot commented Aug 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gcalmettes.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 12, 2025
@gcalmettes gcalmettes force-pushed the feat/responses-api-tokens-usage branch from daf160b to 83bb51b Compare August 12, 2025 06:50
@mergify mergify bot removed the needs-rebase label Aug 12, 2025
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that when python / browser tool is enabled, there will be multiple rounds of adding new tool call output and generate new tokens. Can you help to handle this scenario correctly? To my understanding, the expect behavior should be sum the prompt length and output length for all rounds. But open to discuss.

And please keep the TODO for num_cached_tokens and num_reasoning_tokens

@gcalmettes gcalmettes force-pushed the feat/responses-api-tokens-usage branch from 7812392 to 66bd2f0 Compare August 13, 2025 09:18
@gcalmettes
Copy link
Contributor Author

gcalmettes commented Aug 13, 2025

@heheda12345 good points.
I believe the summing was already done for the output tokens, but was indeed not at all handled for the prompt tokens.

@gcalmettes gcalmettes force-pushed the feat/responses-api-tokens-usage branch 2 times, most recently from 64a35e0 to f184b23 Compare August 13, 2025 15:21
@gcalmettes
Copy link
Contributor Author

I have updated the changes and it should now be robust to the "inner" multi-turns messages due to built-in tool calls.
I also have verified and indeed the full prompt (previous prompt + concatenation of output from previous messages) is resend each time with each RequestOutput when built-in tools are called and reasoned upon, so the sum has to be done.

@heheda12345
Copy link
Collaborator

For streaming case, append_output is called for each output token. Then I think sum up the number of every call to append_output is not correct.

@QierLi
Copy link

QierLi commented Aug 17, 2025

Wonder if we can count num_cached_tokens and num_reasoning_tokens too?

For cached tokens, we can get the output.num_cached_tokens based on the first output to append.
For reasoning tokens, they can be identified by parser.channel = "analysis".

Also for the input_tokens, I think we should count based on the first output to append, but not the CoT input from the multi-turn output iiuc.

Reference to oai documentation https://platform.openai.com/docs/api-reference/responses/object#responses/object-usage

@heheda12345
Copy link
Collaborator

@gcalmettes Can you update the PR for streaming case?

@gcalmettes
Copy link
Contributor Author

@heheda12345 yes, working on it now, I'll ping you once the changes have been made

@gcalmettes gcalmettes force-pushed the feat/responses-api-tokens-usage branch from f184b23 to d726607 Compare August 21, 2025 08:08
@gcalmettes gcalmettes force-pushed the feat/responses-api-tokens-usage branch from de08aee to d9509f6 Compare August 21, 2025 09:55
@gcalmettes
Copy link
Contributor Author

@heheda12345 I have pushed a fix for the streaming case. Let me know what you think.

@gcalmettes
Copy link
Contributor Author

Also for the input_tokens, I think we should count based on the first output to append, but not the CoT input from the multi-turn output iiuc.

@QierLi that is what I was wondering. They are input tokens from the model viewpoint (in the sense that they are ingested as prompt by the model), but not necessarily from the user viewpoint (the user did not generated those subsequent prompts used in the CoT). Would you categorize both the input/output tokens from the CoT as output tokens then ?

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@heheda12345
Copy link
Collaborator

I prefer to count them as input token for every round, because these usage are mainly used for pricing, so it should reflect the compute resource consumption.

@heheda12345 heheda12345 enabled auto-merge (squash) August 21, 2025 17:57
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 21, 2025
@heheda12345 heheda12345 merged commit 0ba1b54 into vllm-project:main Aug 22, 2025
47 checks passed
FFFfff1FFFfff pushed a commit to FFFfff1FFFfff/my_vllm that referenced this pull request Aug 25, 2025
…t is leveraged (vllm-project#22667)

Signed-off-by: Guillaume Calmettes <[email protected]>
Signed-off-by: FFFfff1FFFfff <[email protected]>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
juuice-lee pushed a commit to juuice-lee/vllm-moe.code that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
…t is leveraged (vllm-project#22667)

Signed-off-by: Guillaume Calmettes <[email protected]>
Signed-off-by: Xiao Yu <[email protected]>
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
…t is leveraged (vllm-project#22667)

Signed-off-by: Guillaume Calmettes <[email protected]>
Signed-off-by: Xiao Yu <[email protected]>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
dumb0002 pushed a commit to dumb0002/vllm that referenced this pull request Aug 28, 2025
2015aroras pushed a commit to 2015aroras/vllm that referenced this pull request Aug 29, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants