-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
e8e6b42
to
a28caaf
Compare
a28caaf
to
a4063c0
Compare
Thanks for adding this, can you add some tests to verify the fix? |
a4063c0
to
852ca12
Compare
Thanks for the feedback! I have already added tests to verify the fix. Please let me know if you need any additional tests or if there’s anything else I should improve. |
Thanks a lot! Looking forward to merge. |
…thinking (vllm-project#17357) Signed-off-by: mofanke <[email protected]>
852ca12
to
7d4031b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
vllm-project#17369) Signed-off-by: mofanke <[email protected]>
I think there might be an issue with this PR implementation. I used the following test case: deepseek_r1: vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 --guided-decoding-backend xgrammar vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser qwen3 --guided-decoding-backend xgrammar
client: from pydantic import BaseModel
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "Bearer skxx"
openai_api_base = "http://localhost:8000/v1"
class Step(BaseModel):
ground_truth_key_ideas: str
system_response_key_ideas: str
discussion: str
recall: float
precision: float
if __name__ == '__main__':
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# client.chat.completions.create
json_schema = Step.model_json_schema()
chat_response = client.beta.chat.completions.parse(
model="",
messages=[
{'role': 'system',
'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n "system_response_key_ideas": "{system_response_key_ideas}",\n "discussion": "{discussion}",\n "recall": "{recall} # note: the value you produce must be a single float value",\n "precision": "{precision} # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
{'role': 'user',
'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
],
temperature=0.0,
extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": json_schema},
)
print("Chat response:", chat_response)
s = Step.parse_raw(chat_response.choices[0].message.reasoning_content)
print("-----", s.system_response_key_ideas)
result: deepseek_r1: Chat response: ParsedChatCompletion[NoneType](id='chatcmpl-c8ac33157c6a46aa91adede0f1f36b06', choices=[ParsedChoice[NoneType](finish_reason='stop', index=0, logprobs=None, message=ParsedChatCompletionMessage[NoneType](content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, parsed=None, reasoning_content='{\n "ground_truth_key_ideas": "1. The action space in language modeling equals the vocabulary size, which is large (tens of thousands of tokens). 2. Real-world locomotion can be condensed to three axes (X, Y, Z) or their combinations. 3. The authors note that typical RL problems have action spaces an order of magnitude smaller than language modeling.",\n "system_response_key_ideas": "1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.",\n "discussion": "The system response aligns with the ground truth on the vocabulary size as the primary reason for the large action space in language modeling. Both mention the combinatorial complexity due to high vocabulary. However, the system response adds details about discrete vs. continuous action spaces and specific techniques to address the challenges, which are not present in the ground truth. The ground truth includes the point about real-world locomotion being condensed to three axes, which the system response does not explicitly mention.",\n "recall": 0.6,\n "precision": 0.75\n}'), stop_reason=None)], created=1746001853, model='Qwen/Qwen3-8B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=309, prompt_tokens=766, total_tokens=1075, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
----- 1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling. qwen3: Chat response: ParsedChatCompletion[NoneType](id='chatcmpl-7b079ebfa7ef4c9e87779bcb6cfffccd', choices=[ParsedChoice[NoneType](finish_reason='stop', index=0, logprobs=None, message=ParsedChatCompletionMessage[NoneType](content='{\n "ground_truth_key_ideas": "1. The action space in language modeling equals the vocabulary size, which is large (tens of thousands of tokens). 2. Real-world locomotion can be condensed to three axes (X, Y, Z) or their combinations. 3. The authors note that typical RL problems have action spaces an order of magnitude smaller than language modeling.",\n "system_response_key_ideas": "1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.",\n "discussion": "The system response aligns with the ground truth on the vocabulary size as the primary reason for the large action space in language modeling. Both mention the combinatorial complexity due to high vocabulary. However, the system response adds details about discrete vs. continuous action spaces and specific techniques to address the challenges, which are not present in the ground truth. The ground truth includes the point about real-world locomotion being condensed to three axes, which the system response does not explicitly mention.",\n "recall": 0.6,\n "precision": 0.75\n}', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, parsed=None, reasoning_content=None), stop_reason=None)], created=1746002026, model='Qwen/Qwen3-8B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=309, prompt_tokens=766, total_tokens=1075, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.12/site-packages/pydantic/main.py", line 1187, in parse_raw
obj = parse.load_str_bytes(
^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/pydantic/deprecated/parse.py", line 49, in load_str_bytes
return json_loads(b) # type: ignore
^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/vllm/test14.py", line 35, in <module>
s = Step.parse_raw(chat_response.choices[0].message.reasoning_content)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/pydantic/main.py", line 1214, in parse_raw
raise pydantic_core.ValidationError.from_exception_data(cls.__name__, [error])
pydantic_core._pydantic_core.ValidationError: 1 validation error for Step
__root__
the JSON object must be str, bytes or bytearray, not NoneType [type=type_error, input_value=None, input_type=NoneType]
|
The root cause is that it incorrectly assumes the current mode is not reasoning mode, but I have indeed enabled reasoning mode. However, the model's output was formatted into JSON by xgrammar, leading the qwen3-reasoning-parser to mistakenly believe that the current mode is not reasoning mode. vllm/vllm/reasoning/qwen3_reasoning_parser.py Lines 114 to 117 in a39203f
@DarkLight1337 @mofanke @YorkSu WDYT? |
vllm/vllm/model_executor/guided_decoding/xgrammar_decoding.py Lines 345 to 353 in ece5a8b
vllm/vllm/reasoning/deepseek_r1_reasoning_parser.py Lines 46 to 47 in ece5a8b
However, in the openai entrypoints, ReasoningParser only check if the model output contains vllm/vllm/entrypoints/openai/serving_chat.py Lines 607 to 608 in 1534d38
vllm/vllm/entrypoints/openai/serving_chat.py Lines 623 to 624 in 1534d38
vllm/vllm/entrypoints/openai/serving_chat.py Lines 684 to 685 in 1534d38
|
Try to run some example with guided_json and set |
Thanks for the PR, the commit copied from my fork looks a little outdated. For example, it still uses regex in the @chaunceyjiang You might be interested. |
vllm-project#17369) Signed-off-by: mofanke <[email protected]>
vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: Mu Huai <[email protected]>
vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: minpeter <[email protected]>
FIX (#17357)
add a new reasoning-parser qwen3
Code Attribution
gaocegege/vllm
project's deepseek_r1_reasoning_parser.pytest for request