Skip to content

Qwen3-0.6B模型使用vllm推理时,think模式会出现截断问题 #1680

@Frode-vivi

Description

@Frode-vivi

Description

vllm使用--reasoning-parser解析think模式,默认模式(Think模式)输出时会出现think部分没有完全输出就截断,确定不是max_token问题,使用instruct模式,即添加配置"chat_template_kwargs": {"enable_thinking": True}时,输出正常不会截断;

观察result['choices'][0].get('finish_reason', 'unknown'),输出为stop,即发现是正常识别到结束标志。

唯一解决办法:软开关,即在messages[-1]中加入"/think"时,输出正常,不会被自动识别结束标志而提前截断;

不确定是vllm解析器问题,还是tokenizer_config关于结束标志配置问题

Reproduction

def generate_response(self, messages, temperature=None, max_tokens=None):
        """生成响应"""
        if temperature is None:
            temperature = llm_config.temperature
        if max_tokens is None:
            max_tokens = llm_config.max_tokens

        print(temperature, max_tokens)

        # 添加软开关时正常输出
        messages = messages[:-1] + [{"role": "user", "content": messages[-1]["content"] + "/think"}]

        print(messages)
        payload = {
            "model": self.model_name,
            "messages": messages,
            "temperature": temperature,
            "max_tokens": max_tokens,
            # "chat_template_kwargs": {"enable_thinking": False} # 使用Instruct模式,正常输出
        }

        try:
            response = requests.post(
                f"{self.base_url}/chat/completions",
                headers={"Content-Type": "application/json"},
                data=json.dumps(payload),
            )
            response.raise_for_status()
            result = response.json()

            content = result["choices"][0]["message"]["content"]
            print(f"content: {content}\n")
            # 使用vllm --reasoning-parser 解析器配置时,观察能否正确处理think结果
            print(
                f"reasoning_content: {result['choices'][0]['message'].get('reasoning_content', 'unknown')}"
            )
            # 观察结束标志,为stop,即识别到结束标志正常结束
            print(f"finish_reason: {result['choices'][0].get('finish_reason', 'unknown')}")
            # 去除think标签内容
            final_response = re.sub(
                r"<think>.*?</think>", "", content, flags=re.DOTALL
            ).strip()
            return final_response
        except requests.exceptions.RequestException as e:
            print(f"LLM调用失败: {e}")
            return None

Logs

输出被截断日志:
content: <think>
Okay, the user is responding to me saying I have a turtle named Timothy. They mentioned dancing at the club, running a dog obedience school, and eating sweets. Now they are confirming that they like dancing, working with dogs, and eating sweets. I need to make sure my response stays true to all their traits: like dancing at a club, running a dog school, eating sweets, and being a sweet toothy person. Let me think of a natural and friendly way to respond. Maybe mention the turtle and how it's good for them. Keep it conversational and positive. Let me check if I need to add any other details or keep the tone consistent. Yep, that should work.

finish_reason: stop

Environment Information

vllm版本: 0.11.0;

Known Issue

  • The issue hasn't been already addressed in Documentation, Issues, and Discussions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions