[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

mofanke · 2025-04-29T09:05:54Z

add a new reasoning-parser qwen3

Code Attribution

logic adapted from gaocegege/vllm project's deepseek_r1_reasoning_parser.py
Original author: @gaocegege

python3 -m vllm.entrypoints.openai.api_server --model  Qwen3-32B  -tp 4 --enable-reasoning --reasoning-parser qwen3

test for request

response = client.chat.completions.create(
    model="Qwen3-32B",
    messages=[
        {"role": "user", "content": "who are u?"},
    ],
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}, "top_k": 20},
)

github-actions · 2025-04-29T09:06:03Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-04-29T11:10:43Z

Thanks for adding this, can you add some tests to verify the fix?

mofanke · 2025-04-29T14:03:05Z

Thanks for adding this, can you add some tests to verify the fix?

Thanks for the feedback! I have already added tests to verify the fix. Please let me know if you need any additional tests or if there’s anything else I should improve.

ItzAmirreza · 2025-04-29T14:13:44Z

Thanks a lot! Looking forward to merge.

…thinking (vllm-project#17357) Signed-off-by: mofanke <[email protected]>

DarkLight1337

Thanks, LGTM

vllm-project#17369) Signed-off-by: mofanke <[email protected]>

chaunceyjiang · 2025-04-30T08:34:14Z

I think there might be an issue with this PR implementation. I used the following test case:
The command --reasoning-parser deepseek_r1 works correctly, while --reasoning-parser deepseek_r1 fails to work as expected.
Clearly, the result from deepseek_r1 is the desired one.

deepseek_r1:

vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 --guided-decoding-backend xgrammar

vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser qwen3 --guided-decoding-backend xgrammar

client:

from pydantic import BaseModel
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "Bearer skxx"
openai_api_base = "http://localhost:8000/v1"

class Step(BaseModel):
    ground_truth_key_ideas: str 
    system_response_key_ideas: str
    discussion: str
    recall: float
    precision: float


if __name__ == '__main__':
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    # client.chat.completions.create
    json_schema = Step.model_json_schema()

    chat_response = client.beta.chat.completions.parse(
        model="",
        messages=[
            {'role': 'system',
            'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
            {'role': 'user',
            'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
        ],
        temperature=0.0,
        extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": json_schema},
    )
    print("Chat response:", chat_response)
    s = Step.parse_raw(chat_response.choices[0].message.reasoning_content)
    print("-----", s.system_response_key_ideas)

result:

deepseek_r1:

Chat response: ParsedChatCompletion[NoneType](id='chatcmpl-c8ac33157c6a46aa91adede0f1f36b06', choices=[ParsedChoice[NoneType](finish_reason='stop', index=0, logprobs=None, message=ParsedChatCompletionMessage[NoneType](content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, parsed=None, reasoning_content='{\n  "ground_truth_key_ideas": "1. The action space in language modeling equals the vocabulary size, which is large (tens of thousands of tokens). 2. Real-world locomotion can be condensed to three axes (X, Y, Z) or their combinations. 3. The authors note that typical RL problems have action spaces an order of magnitude smaller than language modeling.",\n  "system_response_key_ideas": "1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.",\n  "discussion": "The system response aligns with the ground truth on the vocabulary size as the primary reason for the large action space in language modeling. Both mention the combinatorial complexity due to high vocabulary. However, the system response adds details about discrete vs. continuous action spaces and specific techniques to address the challenges, which are not present in the ground truth. The ground truth includes the point about real-world locomotion being condensed to three axes, which the system response does not explicitly mention.",\n  "recall": 0.6,\n  "precision": 0.75\n}'), stop_reason=None)], created=1746001853, model='Qwen/Qwen3-8B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=309, prompt_tokens=766, total_tokens=1075, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
----- 1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.

qwen3:

Chat response: ParsedChatCompletion[NoneType](id='chatcmpl-7b079ebfa7ef4c9e87779bcb6cfffccd', choices=[ParsedChoice[NoneType](finish_reason='stop', index=0, logprobs=None, message=ParsedChatCompletionMessage[NoneType](content='{\n  "ground_truth_key_ideas": "1. The action space in language modeling equals the vocabulary size, which is large (tens of thousands of tokens). 2. Real-world locomotion can be condensed to three axes (X, Y, Z) or their combinations. 3. The authors note that typical RL problems have action spaces an order of magnitude smaller than language modeling.",\n  "system_response_key_ideas": "1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.",\n  "discussion": "The system response aligns with the ground truth on the vocabulary size as the primary reason for the large action space in language modeling. Both mention the combinatorial complexity due to high vocabulary. However, the system response adds details about discrete vs. continuous action spaces and specific techniques to address the challenges, which are not present in the ground truth. The ground truth includes the point about real-world locomotion being condensed to three axes, which the system response does not explicitly mention.",\n  "recall": 0.6,\n  "precision": 0.75\n}', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, parsed=None, reasoning_content=None), stop_reason=None)], created=1746002026, model='Qwen/Qwen3-8B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=309, prompt_tokens=766, total_tokens=1075, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.12/site-packages/pydantic/main.py", line 1187, in parse_raw
    obj = parse.load_str_bytes(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/lib/python3.12/site-packages/pydantic/deprecated/parse.py", line 49, in load_str_bytes
    return json_loads(b)  # type: ignore
           ^^^^^^^^^^^^^
  File "/root/anaconda3/lib/python3.12/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/vllm/test14.py", line 35, in <module>
    s = Step.parse_raw(chat_response.choices[0].message.reasoning_content)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/lib/python3.12/site-packages/pydantic/main.py", line 1214, in parse_raw
    raise pydantic_core.ValidationError.from_exception_data(cls.__name__, [error])
pydantic_core._pydantic_core.ValidationError: 1 validation error for Step
__root__
  the JSON object must be str, bytes or bytearray, not NoneType [type=type_error, input_value=None, input_type=NoneType]

chaunceyjiang · 2025-04-30T08:39:19Z

The root cause is that it incorrectly assumes the current mode is not reasoning mode, but I have indeed enabled reasoning mode. However, the model's output was formatted into JSON by xgrammar, leading the qwen3-reasoning-parser to mistakenly believe that the current mode is not reasoning mode.

vllm/vllm/reasoning/qwen3_reasoning_parser.py

Lines 114 to 117 in a39203f

    
           # Check if the model output contains the <think> tokens. 
        
           if (self.think_start_token not in model_output 
        
                   or self.think_end_token not in model_output): 
        
               return None, model_output

@DarkLight1337 @mofanke @YorkSu WDYT?

YorkSu · 2025-04-30T08:52:36Z

The root cause is that it incorrectly assumes the current mode is not reasoning mode, but I have indeed enabled reasoning mode. However, the model's output was formatted into JSON by xgrammar, leading the qwen3-reasoning-parser to mistakenly believe that the current mode is not reasoning mode.根本原因是它错误地假设当前模式不是推理模式，但我确实启用了推理模式。然而，模型的输出被 xgrammar 格式化成了 JSON，导致 qwen3-reasoning-parser 误认为当前模式不是推理模式。

vllm/vllm/model_executor/guided_decoding/xgrammar_decoding.py

Lines 345 to 353 in ece5a8b

    
           def __call__(self, input_ids: list[int], 
        
                        scores: torch.Tensor) -> torch.Tensor: 
        
               # Skip the structured logits processing if reasoning is not finished. 
        
               # reasoner is not None only when `--enable-reasoning` is set. 
        
               if self.reasoner is not None and \ 
        
               not self.reasoner.is_reasoning_end( 
        
                       input_ids): 
        
                   return scores

vllm/vllm/model_executor/guided_decoding/outlines_logits_processors.py

Lines 59 to 67 in ece5a8b

    
           def __call__(self, input_ids: List[int], 
        
                        scores: torch.Tensor) -> torch.Tensor: 
        
               """Use the FSM to bias the logits before sampling the next token.""" 
        
               # Skip the structured logits processing if reasoning is not finished. 
        
               # reasoner is not None only when `--enable-reasoning` is set. 
        
               if self._reasoner is not None: 
        
                   if not self._reasoner.is_reasoning_end(input_ids): 
        
                       return scores

vllm/vllm/reasoning/abs_reasoning_parsers.py

Line 36 in ece5a8b

def is_reasoning_end(self, input_ids: list[int]) -> bool:

is_reasoning_end is used by guided decoding backend to check reasoning stage. This Qwen3ReasoningParser don't implement this method.

vllm/vllm/reasoning/deepseek_r1_reasoning_parser.py

Lines 46 to 47 in ece5a8b

    
           def is_reasoning_end(self, input_ids: list[int]) -> bool: 
        
               return self.end_token_id in input_ids

However, in the openai entrypoints, ReasoningParser only check if the model output contains </think> currently. But if </think> were already present in the Prompt, output_tokens could not contains the token, so it will returns False. If we pass "chat_template_kwargs": {"enable_thinking": false}, chat_template add <think>\n\n</think>\n\n at the start of completion.

#17349 (comment)

vllm/vllm/entrypoints/openai/serving_chat.py

Lines 607 to 608 in 1534d38

    
           and not reasoning_parser.is_reasoning_end( 
        
               previous_token_ids)):

vllm/vllm/entrypoints/openai/serving_chat.py

Lines 623 to 624 in 1534d38

    
           if reasoning_parser.is_reasoning_end( 
        
                   list(output.token_ids)):

vllm/vllm/entrypoints/openai/serving_chat.py

Lines 684 to 685 in 1534d38

    
           if reasoning_parser.is_reasoning_end( 
        
                   list(output.token_ids)):

YorkSu · 2025-04-30T09:15:20Z

@chaunceyjiang

extra_body={"chat_template_kwargs": {"enable_thinking": False}, "guided_json": json_schema},

Try to run some example with guided_json and set enable_thinking to False, both r1 and qwen3 reasoning parser fails to work as expected.

gaocegege · 2025-05-01T01:39:13Z

Thanks for the PR, the commit copied from my fork looks a little outdated. For example, it still uses regex in the extract_reasoning_content. Could we use the latest deepseek r1 reasoning parser's logic? https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/deepseek_r1_reasoning_parser.py#L139

@chaunceyjiang You might be interested.

vllm-project#17369) Signed-off-by: mofanke <[email protected]>

vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: Mu Huai <[email protected]>

vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: minpeter <[email protected]>

mofanke force-pushed the fix_qwen3_thinking_parser branch 2 times, most recently from e8e6b42 to a28caaf Compare April 29, 2025 09:51

mergify bot added the documentation Improvements or additions to documentation label Apr 29, 2025

mofanke force-pushed the fix_qwen3_thinking_parser branch from a28caaf to a4063c0 Compare April 29, 2025 10:06

chaunceyjiang mentioned this pull request Apr 29, 2025

[Bugfix] add Qwen3ReasoningParser #17377

Closed

4XII-Khan mentioned this pull request Apr 29, 2025

[Bug]: qwen3 235B模型 enable_thinking 为False 时，返回的content 为空，但 reason_content 有值 QwenLM/Qwen3#1297

Closed

4 tasks

mofanke force-pushed the fix_qwen3_thinking_parser branch from a4063c0 to 852ca12 Compare April 29, 2025 13:58

[Bugfix] add qwen3 reasoning-parser fix content is None when disable …

7d4031b

…thinking (vllm-project#17357) Signed-off-by: mofanke <[email protected]>

mofanke force-pushed the fix_qwen3_thinking_parser branch from 852ca12 to 7d4031b Compare April 29, 2025 14:14

DarkLight1337 approved these changes Apr 29, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) April 29, 2025 14:17

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 29, 2025

DarkLight1337 merged commit a39203f into vllm-project:main Apr 29, 2025
45 of 47 checks passed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … (

c45d03e

vllm-project#17369) Signed-off-by: mofanke <[email protected]>

YorkSu mentioned this pull request Apr 30, 2025

[Bug]: Qwen3's answer was wrongly placed in reasoning_content #17349

Open

1 task

chaunceyjiang mentioned this pull request Apr 30, 2025

[Feature] The Qwen3 reasoning parser supports guided decoding #17466

Merged

chaunceyjiang mentioned this pull request May 1, 2025

[Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content #17515

Merged

radeksm pushed a commit to radeksm/vllm that referenced this pull request May 2, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … (

62a2ec4

vllm-project#17369) Signed-off-by: mofanke <[email protected]>

Zerohertz mentioned this pull request May 5, 2025

[Bug]: content is null when use "chat_template_kwargs": {"enable_thinking": false} in the request. #17609

Open

1 task

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … (

f8c17b8

vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: Mu Huai <[email protected]>

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … (

1f312dc

vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

15626471095 mentioned this pull request May 29, 2025

[v0.7.3.post1] FAQ / Feedback | 问题/反馈 vllm-project/vllm-ascend#1007

Open

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … (

2f8fb88

vllm-project#17369) Signed-off-by: mofanke <[email protected]> Signed-off-by: minpeter <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

Uh oh!

mofanke commented Apr 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025

Uh oh!

mofanke commented Apr 29, 2025

Uh oh!

ItzAmirreza commented Apr 29, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

chaunceyjiang commented Apr 30, 2025

Uh oh!

chaunceyjiang commented Apr 30, 2025

Uh oh!

YorkSu commented Apr 30, 2025 •

edited

Loading

Uh oh!

YorkSu commented Apr 30, 2025

Uh oh!

gaocegege commented May 1, 2025

Uh oh!

Uh oh!

Uh oh!

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

Uh oh!

Conversation

mofanke commented Apr 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Attribution

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025

Uh oh!

mofanke commented Apr 29, 2025

Uh oh!

ItzAmirreza commented Apr 29, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chaunceyjiang commented Apr 30, 2025

Uh oh!

chaunceyjiang commented Apr 30, 2025

Uh oh!

YorkSu commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YorkSu commented Apr 30, 2025

Uh oh!

gaocegege commented May 1, 2025

Uh oh!

Uh oh!

mofanke commented Apr 29, 2025 •

edited by github-actions bot

Loading

YorkSu commented Apr 30, 2025 •

edited

Loading