Fix message content field handling for tool calls and multimodal input #4029

QwertyJack · 2025-10-09T08:57:11Z

This commit fixes two issues with message content processing:

Missing content field in tool call responses: When assistant messages contain tool_calls, the content field may be missing or None, causing KeyError when checking isinstance(msg['content'], list).
Incomplete multimodal text merging: Previous implementation only extracted the first text block. Now correctly merges all text blocks with newline separator.

Changes:

Add _merge_message_content() module-level function in async_engine.py
Handle 4 cases: missing content, None content, string content, list content
Use '\n'.join() to merge text blocks (matches vLLM behavior)
Preserve all message fields (e.g., tool_calls, name, etc.)
Add comprehensive unit tests (19 test cases in test_content_merge.py)

Implementation based on vLLM's content normalization logic: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py

🤖 Generated with Claude Code

lvhan028 · 2025-10-13T08:22:09Z

lmdeploy/serve/async_engine.py

+    for block in msg['content']:
+        if isinstance(block, dict) and block.get('type') == 'text':
+            content_parts.append(block.get('text', ''))
+    merged_content = '\n'.join(content_parts)


I'm not sure if this aligns with the model's instructed behavior during training.
Because of this, we expect users to assemble the entire text in text[0] using image token placeholders.

Thanks for flagging that! The training-time “put every text chunk (including <IMAGE_TOKEN> markers) into content[0]” requirement is still respected for multimodal runs. As soon as a message contains any multimodal blocks, we never hit _merge_message_content; VLAsyncEngine._get_prompt_input intercepts those requests and forwards them down the vision path (lmdeploy/serve/vl_async_engine.py:74-118). Each vision model then pulls message['content'][0]['text'] and applies its own placeholder logic—e.g. DeepSeek2VisionModel.proc_single_message in lmdeploy/vl/model/deepseek_vl2.py:130-167.

The helper you called out only runs in the plain AsyncEngine path, where there are no image parts—either legacy text-only prompts or assistant tool-call responses that came back with content=None. Previously we just took the first text entry (msg['content'][0]['text']); now we join all text fragments so we don't silently drop extra text while still ignoring the multimodal blocks we can’t render. So the training contract for VLMs remains unchanged, and the new code simply makes the text-only fallbacks more robust.

lvhan028 · 2025-10-13T08:25:05Z

tests/test_lmdeploy/test_content_merge.py

+        }
+        result = _merge_message_content(msg)
+
+        assert result['content'] == 'Analyze this image:\nWhat do you see?'


where should image_token_placeholder be inserted in the result['content']?
"'Analyze this image:{image_placeholder}\nWhat do you see?'" or "'Analyze this image:\nWhat do you see?{image_placeholder}'"

Great question! That test is exercising the text-only fallback path for _merge_message_content. When the helper sees a list like [{type: 'text', ...}, {type: 'image_url', ...}, {type: 'text', ...}], it keeps only the textual slices and joins them with \n, dropping the non-text blocks entirely. We do that because the plain AsyncEngine has no idea which image placeholder (and which syntax) a given vision model expects.

As soon as the request actually requires vision handling, VLAsyncEngine._get_prompt_input takes over (lmdeploy/serve/vl_async_engine.py:74-118). From there each VLM implementation (e.g. DeepSeek2VisionModel.proc_single_message in lmdeploy/vl/model/deepseek_vl2.py:130-167) injects the appropriate placeholder into content[0]['text'] before we ever apply any merging. So _merge_message_content never decides whether {image_placeholder} appears after “Analyze this image:” or at the end of the paragraph—that logic stays with the VLM-specific code.

QwertyJack · 2025-10-13T12:35:36Z

Thanks @lvhan028 for the excellent review! GPT-5-Codex is really helpful at walking through code!

This commit fixes two issues with message content processing: 1. **Missing content field in tool call responses**: When assistant messages contain tool_calls, the content field may be missing or None, causing KeyError when checking isinstance(msg['content'], list). 2. **Incomplete multimodal text merging**: Previous implementation only extracted the first text block. Now correctly merges all text blocks with newline separator. Changes: - Add _merge_message_content() module-level function in async_engine.py - Handle 4 cases: missing content, None content, string content, list content - Convert None/missing content to empty string '' (prevents Jinja2 errors) - Use '\n'.join() to merge text blocks (matches vLLM behavior) - Preserve all message fields (e.g., tool_calls, name, etc.) - Add comprehensive unit tests (19 test cases in test_content_merge.py) Implementation based on vLLM's content normalization logic: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

QwertyJack force-pushed the fc branch from d6fd859 to 6e3f7db Compare October 9, 2025 09:58

windreamer mentioned this pull request Oct 10, 2025

[Feature] turbomind后端是否会支持guided_decoding #2771

Closed

QwertyJack force-pushed the fc branch from 6e3f7db to d4be176 Compare October 11, 2025 12:15

This was referenced Oct 11, 2025

[Bug] KeyError when processing assistant messages with tool_calls but missing content field #4035

Open

[Bug] Incomplete multimodal text merging only extracts first text block #4036

Open

lvhan028 reviewed Oct 13, 2025

View reviewed changes

QwertyJack force-pushed the fc branch from d4be176 to 4d07f82 Compare October 13, 2025 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix message content field handling for tool calls and multimodal input #4029

Fix message content field handling for tool calls and multimodal input #4029

QwertyJack commented Oct 9, 2025

Uh oh!

lvhan028 Oct 13, 2025

Uh oh!

QwertyJack Oct 13, 2025

Uh oh!

lvhan028 Oct 13, 2025

Uh oh!

QwertyJack Oct 13, 2025

Uh oh!

QwertyJack commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix message content field handling for tool calls and multimodal input #4029

Are you sure you want to change the base?

Fix message content field handling for tool calls and multimodal input #4029

Conversation

QwertyJack commented Oct 9, 2025

Uh oh!

lvhan028 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

QwertyJack Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

lvhan028 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

QwertyJack Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

QwertyJack commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants