[gpt-oss] add input/output usage in responses api when harmony context is leveraged #22667

gcalmettes · 2025-08-11T18:02:35Z

Purpose

Dedicated classes for GPT-OSS have been introduced in #22340 in order to handle the specifics of harmony. However currently the token usage statistics are not outputed when serving requests for gpt-oss.

This PR aims to add usage statistics

cc: @WoosukKwon

Test Plan

Test Result

(Optional) Documentation Update

github-actions · 2025-08-11T18:02:44Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request adds token usage statistics (num_prompt_tokens and num_output_tokens) to HarmonyContext and its subclasses, which was a missing feature. The changes correctly address the feature request.

My review focuses on improving the maintainability and correctness of the implementation. I've identified a case of code duplication that should be refactored to avoid future bugs. More critically, I've found a potential bug in the streaming context where the token processing and counting logic might be incorrect if more than one token is received in a single output, which can lead to silent errors and incorrect metrics. Addressing these points will make the implementation more robust.

vllm/entrypoints/context.py

mergify · 2025-08-12T06:46:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gcalmettes.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345

Please note that when python / browser tool is enabled, there will be multiple rounds of adding new tool call output and generate new tokens. Can you help to handle this scenario correctly? To my understanding, the expect behavior should be sum the prompt length and output length for all rounds. But open to discuss.

And please keep the TODO for num_cached_tokens and num_reasoning_tokens

gcalmettes · 2025-08-13T11:04:06Z

@heheda12345 good points.
I believe the summing was already done for the output tokens, but was indeed not at all handled for the prompt tokens.

gcalmettes · 2025-08-13T15:26:07Z

I have updated the changes and it should now be robust to the "inner" multi-turns messages due to built-in tool calls.
I also have verified and indeed the full prompt (previous prompt + concatenation of output from previous messages) is resend each time with each RequestOutput when built-in tools are called and reasoned upon, so the sum has to be done.

heheda12345 · 2025-08-14T22:40:54Z

For streaming case, append_output is called for each output token. Then I think sum up the number of every call to append_output is not correct.

QierLi · 2025-08-17T08:44:01Z

Wonder if we can count num_cached_tokens and num_reasoning_tokens too?

For cached tokens, we can get the output.num_cached_tokens based on the first output to append.
For reasoning tokens, they can be identified by parser.channel = "analysis".

Also for the input_tokens, I think we should count based on the first output to append, but not the CoT input from the multi-turn output iiuc.

Reference to oai documentation https://platform.openai.com/docs/api-reference/responses/object#responses/object-usage

heheda12345 · 2025-08-20T22:49:25Z

@gcalmettes Can you update the PR for streaming case?

gcalmettes · 2025-08-21T08:02:03Z

@heheda12345 yes, working on it now, I'll ping you once the changes have been made

Signed-off-by: Guillaume Calmettes <[email protected]>

…ummed in the usage statistics Signed-off-by: Guillaume Calmettes <[email protected]>

Signed-off-by: Guillaume Calmettes <[email protected]>

gcalmettes · 2025-08-21T09:56:06Z

@heheda12345 I have pushed a fix for the streaming case. Let me know what you think.

gcalmettes · 2025-08-21T10:03:22Z

Also for the input_tokens, I think we should count based on the first output to append, but not the CoT input from the multi-turn output iiuc.

@QierLi that is what I was wondering. They are input tokens from the model viewpoint (in the sense that they are ingested as prompt by the model), but not necessarily from the user viewpoint (the user did not generated those subsequent prompts used in the CoT). Would you categorize both the input/output tokens from the CoT as output tokens then ?

heheda12345

LGTM!

heheda12345 · 2025-08-21T17:57:32Z

I prefer to count them as input token for every round, because these usage are mainly used for pricing, so it should reflect the compute resource consumption.

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: FFFfff1FFFfff <[email protected]>

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

gcalmettes requested a review from aarnphm as a code owner August 11, 2025 18:02

mergify bot added frontend gpt-oss Related to GPT-OSS models labels Aug 11, 2025

gemini-code-assist bot reviewed Aug 11, 2025

View reviewed changes

vllm/entrypoints/context.py Outdated Show resolved Hide resolved

vllm/entrypoints/context.py Outdated Show resolved Hide resolved

gcalmettes force-pushed the feat/responses-api-tokens-usage branch from 2656338 to daf160b Compare August 12, 2025 06:45

mergify bot added the needs-rebase label Aug 12, 2025

gcalmettes force-pushed the feat/responses-api-tokens-usage branch from daf160b to 83bb51b Compare August 12, 2025 06:50

mergify bot removed the needs-rebase label Aug 12, 2025

simon-mo assigned heheda12345 and WoosukKwon Aug 13, 2025

heheda12345 reviewed Aug 13, 2025

View reviewed changes

gcalmettes force-pushed the feat/responses-api-tokens-usage branch from 7812392 to 66bd2f0 Compare August 13, 2025 09:18

gcalmettes force-pushed the feat/responses-api-tokens-usage branch 2 times, most recently from 64a35e0 to f184b23 Compare August 13, 2025 15:21

heheda12345 mentioned this pull request Aug 16, 2025

[Bug]: GPT OSS 120B token usage is 0 on response API, even though it responded back #23007

Open

1 task

gcalmettes added 3 commits August 21, 2025 10:08

feat: add input/output usage in responses api with harmony

deac3a5

Signed-off-by: Guillaume Calmettes <[email protected]>

chore(gpt-oss): add back TODO for cached and reasoning tokens processing

f3c27ed

Signed-off-by: Guillaume Calmettes <[email protected]>

feat: ensure tokens generated during multi-round built-in tools are s…

d726607

…ummed in the usage statistics Signed-off-by: Guillaume Calmettes <[email protected]>

gcalmettes force-pushed the feat/responses-api-tokens-usage branch from f184b23 to d726607 Compare August 21, 2025 08:08

fix: only count prompt tokens once per message in streaming case

d9509f6

Signed-off-by: Guillaume Calmettes <[email protected]>

gcalmettes force-pushed the feat/responses-api-tokens-usage branch from de08aee to d9509f6 Compare August 21, 2025 09:55

heheda12345 approved these changes Aug 21, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) August 21, 2025 17:57

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 21, 2025

heheda12345 mentioned this pull request Aug 21, 2025

[Feature][Response API] Support num_cached_tokens and num_reasoning_tokens in ResponseUsage #23363

Open

1 task

heheda12345 merged commit 0ba1b54 into vllm-project:main Aug 22, 2025
47 checks passed

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

bab465d

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

juuice-lee pushed a commit to juuice-lee/vllm-moe.code that referenced this pull request Aug 28, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

4199857

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

adf1313

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

dumb0002 pushed a commit to dumb0002/vllm that referenced this pull request Aug 28, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

22609f7

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

2015aroras pushed a commit to 2015aroras/vllm that referenced this pull request Aug 29, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

1cff426

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

42c7789

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

7d9c221

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

5bd3790

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

c38185a

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[gpt-oss] add input/output usage in responses api when harmony contex…

f99bf30

…t is leveraged (vllm-project#22667) Signed-off-by: Guillaume Calmettes <[email protected]>

Uh oh!

[gpt-oss] add input/output usage in responses api when harmony context is leveraged #22667

[gpt-oss] add input/output usage in responses api when harmony context is leveraged #22667

Uh oh!

Conversation

gcalmettes commented Aug 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Aug 12, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

gcalmettes commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gcalmettes commented Aug 13, 2025

Uh oh!

heheda12345 commented Aug 14, 2025

Uh oh!

QierLi commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heheda12345 commented Aug 20, 2025

Uh oh!

gcalmettes commented Aug 21, 2025

Uh oh!

gcalmettes commented Aug 21, 2025

Uh oh!

gcalmettes commented Aug 21, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!

gcalmettes commented Aug 11, 2025 •

edited by github-actions bot

Loading

gcalmettes commented Aug 13, 2025 •

edited

Loading

QierLi commented Aug 17, 2025 •

edited

Loading