Skip to content

[Bugfix] add qwen3 reasoning-parser fix content is None when disable … #17369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 29, 2025

Conversation

mofanke
Copy link
Contributor

@mofanke mofanke commented Apr 29, 2025

FIX (#17357)

add a new reasoning-parser qwen3

Code Attribution

python3 -m vllm.entrypoints.openai.api_server --model  Qwen3-32B  -tp 4 --enable-reasoning --reasoning-parser qwen3 

test for request

response = client.chat.completions.create(
    model="Qwen3-32B",
    messages=[
        {"role": "user", "content": "who are u?"},
    ],
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}, "top_k": 20},
)

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mofanke mofanke force-pushed the fix_qwen3_thinking_parser branch 2 times, most recently from e8e6b42 to a28caaf Compare April 29, 2025 09:51
@mergify mergify bot added the documentation Improvements or additions to documentation label Apr 29, 2025
@mofanke mofanke force-pushed the fix_qwen3_thinking_parser branch from a28caaf to a4063c0 Compare April 29, 2025 10:06
@DarkLight1337
Copy link
Member

Thanks for adding this, can you add some tests to verify the fix?

@mofanke mofanke force-pushed the fix_qwen3_thinking_parser branch from a4063c0 to 852ca12 Compare April 29, 2025 13:58
@mofanke
Copy link
Contributor Author

mofanke commented Apr 29, 2025

Thanks for adding this, can you add some tests to verify the fix?

Thanks for the feedback! I have already added tests to verify the fix. Please let me know if you need any additional tests or if there’s anything else I should improve.

@ItzAmirreza
Copy link

Thanks a lot! Looking forward to merge.

@mofanke mofanke force-pushed the fix_qwen3_thinking_parser branch from 852ca12 to 7d4031b Compare April 29, 2025 14:14
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) April 29, 2025 14:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 29, 2025
@DarkLight1337 DarkLight1337 merged commit a39203f into vllm-project:main Apr 29, 2025
45 of 47 checks passed
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
@chaunceyjiang
Copy link
Contributor

I think there might be an issue with this PR implementation. I used the following test case:
The command --reasoning-parser deepseek_r1 works correctly, while --reasoning-parser deepseek_r1 fails to work as expected.
Clearly, the result from deepseek_r1 is the desired one.

deepseek_r1:

vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1 --guided-decoding-backend xgrammar
vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser qwen3 --guided-decoding-backend xgrammar

client:

from pydantic import BaseModel
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "Bearer skxx"
openai_api_base = "http://localhost:8000/v1"

class Step(BaseModel):
    ground_truth_key_ideas: str 
    system_response_key_ideas: str
    discussion: str
    recall: float
    precision: float


if __name__ == '__main__':
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    # client.chat.completions.create
    json_schema = Step.model_json_schema()

    chat_response = client.beta.chat.completions.parse(
        model="",
        messages=[
            {'role': 'system',
            'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
            {'role': 'user',
            'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
        ],
        temperature=0.0,
        extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": json_schema},
    )
    print("Chat response:", chat_response)
    s = Step.parse_raw(chat_response.choices[0].message.reasoning_content)
    print("-----", s.system_response_key_ideas)

result:

deepseek_r1:

Chat response: ParsedChatCompletion[NoneType](id='chatcmpl-c8ac33157c6a46aa91adede0f1f36b06', choices=[ParsedChoice[NoneType](finish_reason='stop', index=0, logprobs=None, message=ParsedChatCompletionMessage[NoneType](content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, parsed=None, reasoning_content='{\n  "ground_truth_key_ideas": "1. The action space in language modeling equals the vocabulary size, which is large (tens of thousands of tokens). 2. Real-world locomotion can be condensed to three axes (X, Y, Z) or their combinations. 3. The authors note that typical RL problems have action spaces an order of magnitude smaller than language modeling.",\n  "system_response_key_ideas": "1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.",\n  "discussion": "The system response aligns with the ground truth on the vocabulary size as the primary reason for the large action space in language modeling. Both mention the combinatorial complexity due to high vocabulary. However, the system response adds details about discrete vs. continuous action spaces and specific techniques to address the challenges, which are not present in the ground truth. The ground truth includes the point about real-world locomotion being condensed to three axes, which the system response does not explicitly mention.",\n  "recall": 0.6,\n  "precision": 0.75\n}'), stop_reason=None)], created=1746001853, model='Qwen/Qwen3-8B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=309, prompt_tokens=766, total_tokens=1075, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
----- 1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.

qwen3:

Chat response: ParsedChatCompletion[NoneType](id='chatcmpl-7b079ebfa7ef4c9e87779bcb6cfffccd', choices=[ParsedChoice[NoneType](finish_reason='stop', index=0, logprobs=None, message=ParsedChatCompletionMessage[NoneType](content='{\n  "ground_truth_key_ideas": "1. The action space in language modeling equals the vocabulary size, which is large (tens of thousands of tokens). 2. Real-world locomotion can be condensed to three axes (X, Y, Z) or their combinations. 3. The authors note that typical RL problems have action spaces an order of magnitude smaller than language modeling.",\n  "system_response_key_ideas": "1. The action space in language modeling is large due to high vocabulary size (e.g., 50k+ tokens). 2. This leads to combinatorial explosion in sequence generation. 3. Language models face discrete, high-cardinality decisions with combinatorial complexity. 4. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 5. Techniques like actor-critic methods and action space reduction address the challenges in language modeling.",\n  "discussion": "The system response aligns with the ground truth on the vocabulary size as the primary reason for the large action space in language modeling. Both mention the combinatorial complexity due to high vocabulary. However, the system response adds details about discrete vs. continuous action spaces and specific techniques to address the challenges, which are not present in the ground truth. The ground truth includes the point about real-world locomotion being condensed to three axes, which the system response does not explicitly mention.",\n  "recall": 0.6,\n  "precision": 0.75\n}', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, parsed=None, reasoning_content=None), stop_reason=None)], created=1746002026, model='Qwen/Qwen3-8B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=309, prompt_tokens=766, total_tokens=1075, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.12/site-packages/pydantic/main.py", line 1187, in parse_raw
    obj = parse.load_str_bytes(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/lib/python3.12/site-packages/pydantic/deprecated/parse.py", line 49, in load_str_bytes
    return json_loads(b)  # type: ignore
           ^^^^^^^^^^^^^
  File "/root/anaconda3/lib/python3.12/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/vllm/test14.py", line 35, in <module>
    s = Step.parse_raw(chat_response.choices[0].message.reasoning_content)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/lib/python3.12/site-packages/pydantic/main.py", line 1214, in parse_raw
    raise pydantic_core.ValidationError.from_exception_data(cls.__name__, [error])
pydantic_core._pydantic_core.ValidationError: 1 validation error for Step
__root__
  the JSON object must be str, bytes or bytearray, not NoneType [type=type_error, input_value=None, input_type=NoneType]

@chaunceyjiang
Copy link
Contributor

The root cause is that it incorrectly assumes the current mode is not reasoning mode, but I have indeed enabled reasoning mode. However, the model's output was formatted into JSON by xgrammar, leading the qwen3-reasoning-parser to mistakenly believe that the current mode is not reasoning mode.

# Check if the model output contains the <think> tokens.
if (self.think_start_token not in model_output
or self.think_end_token not in model_output):
return None, model_output

@DarkLight1337 @mofanke @YorkSu WDYT?

@YorkSu
Copy link

YorkSu commented Apr 30, 2025

The root cause is that it incorrectly assumes the current mode is not reasoning mode, but I have indeed enabled reasoning mode. However, the model's output was formatted into JSON by xgrammar, leading the qwen3-reasoning-parser to mistakenly believe that the current mode is not reasoning mode.根本原因是它错误地假设当前模式不是推理模式,但我确实启用了推理模式。然而,模型的输出被 xgrammar 格式化成了 JSON,导致 qwen3-reasoning-parser 误认为当前模式不是推理模式。

def __call__(self, input_ids: list[int],
scores: torch.Tensor) -> torch.Tensor:
# Skip the structured logits processing if reasoning is not finished.
# reasoner is not None only when `--enable-reasoning` is set.
if self.reasoner is not None and \
not self.reasoner.is_reasoning_end(
input_ids):
return scores

def __call__(self, input_ids: List[int],
scores: torch.Tensor) -> torch.Tensor:
"""Use the FSM to bias the logits before sampling the next token."""
# Skip the structured logits processing if reasoning is not finished.
# reasoner is not None only when `--enable-reasoning` is set.
if self._reasoner is not None:
if not self._reasoner.is_reasoning_end(input_ids):
return scores

def is_reasoning_end(self, input_ids: list[int]) -> bool:

is_reasoning_end is used by guided decoding backend to check reasoning stage. This Qwen3ReasoningParser don't implement this method.

def is_reasoning_end(self, input_ids: list[int]) -> bool:
return self.end_token_id in input_ids

However, in the openai entrypoints, ReasoningParser only check if the model output contains </think> currently. But if </think> were already present in the Prompt, output_tokens could not contains the token, so it will returns False. If we pass "chat_template_kwargs": {"enable_thinking": false}, chat_template add <think>\n\n</think>\n\n at the start of completion.

#17349 (comment)

and not reasoning_parser.is_reasoning_end(
previous_token_ids)):

if reasoning_parser.is_reasoning_end(
list(output.token_ids)):

if reasoning_parser.is_reasoning_end(
list(output.token_ids)):

@YorkSu
Copy link

YorkSu commented Apr 30, 2025

@chaunceyjiang

extra_body={"chat_template_kwargs": {"enable_thinking": False}, "guided_json": json_schema},

Try to run some example with guided_json and set enable_thinking to False, both r1 and qwen3 reasoning parser fails to work as expected.

@gaocegege
Copy link
Contributor

Thanks for the PR, the commit copied from my fork looks a little outdated. For example, it still uses regex in the extract_reasoning_content. Could we use the latest deepseek r1 reasoning parser's logic? https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/deepseek_r1_reasoning_parser.py#L139

@chaunceyjiang You might be interested.

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants