-
Notifications
You must be signed in to change notification settings - Fork 609
Fix message content field handling for tool calls and multimodal input #4029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for block in msg['content']: | ||
if isinstance(block, dict) and block.get('type') == 'text': | ||
content_parts.append(block.get('text', '')) | ||
merged_content = '\n'.join(content_parts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this aligns with the model's instructed behavior during training.
Because of this, we expect users to assemble the entire text in text[0] using image token placeholders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging that! The training-time “put every text chunk (including <IMAGE_TOKEN>
markers) into content[0]
” requirement is still respected for multimodal runs. As soon as a message contains any multimodal blocks, we never hit _merge_message_content
; VLAsyncEngine._get_prompt_input
intercepts those requests and forwards them down the vision path (lmdeploy/serve/vl_async_engine.py:74-118
). Each vision model then pulls message['content'][0]['text']
and applies its own placeholder logic—e.g. DeepSeek2VisionModel.proc_single_message in lmdeploy/vl/model/deepseek_vl2.py:130-167
.
The helper you called out only runs in the plain AsyncEngine path, where there are no image parts—either legacy text-only prompts or assistant tool-call responses that came back with content=None
. Previously we just took the first text entry (msg['content'][0]['text']
); now we join all text fragments so we don't silently drop extra text while still ignoring the multimodal blocks we can’t render. So the training contract for VLMs remains unchanged, and the new code simply makes the text-only fallbacks more robust.
} | ||
result = _merge_message_content(msg) | ||
|
||
assert result['content'] == 'Analyze this image:\nWhat do you see?' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where should image_token_placeholder be inserted in the result['content']?
"'Analyze this image:{image_placeholder}\nWhat do you see?'" or "'Analyze this image:\nWhat do you see?{image_placeholder}'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! That test is exercising the text-only fallback path for _merge_message_content
. When the helper sees a list like [{type: 'text', ...}, {type: 'image_url', ...}, {type: 'text', ...}]
, it keeps only the textual slices and joins them with \n
, dropping the non-text blocks entirely. We do that because the plain AsyncEngine
has no idea which image placeholder (and which syntax) a given vision model expects.
As soon as the request actually requires vision handling, VLAsyncEngine._get_prompt_input
takes over (lmdeploy/serve/vl_async_engine.py:74-118
). From there each VLM implementation (e.g. DeepSeek2VisionModel.proc_single_message
in lmdeploy/vl/model/deepseek_vl2.py:130-167
) injects the appropriate placeholder into content[0]['text']
before we ever apply any merging. So _merge_message_content
never decides whether {image_placeholder}
appears after “Analyze this image:” or at the end of the paragraph—that logic stays with the VLM-specific code.
Thanks @lvhan028 for the excellent review! GPT-5-Codex is really helpful at walking through code! |
This commit fixes two issues with message content processing: 1. **Missing content field in tool call responses**: When assistant messages contain tool_calls, the content field may be missing or None, causing KeyError when checking isinstance(msg['content'], list). 2. **Incomplete multimodal text merging**: Previous implementation only extracted the first text block. Now correctly merges all text blocks with newline separator. Changes: - Add _merge_message_content() module-level function in async_engine.py - Handle 4 cases: missing content, None content, string content, list content - Convert None/missing content to empty string '' (prevents Jinja2 errors) - Use '\n'.join() to merge text blocks (matches vLLM behavior) - Preserve all message fields (e.g., tool_calls, name, etc.) - Add comprehensive unit tests (19 test cases in test_content_merge.py) Implementation based on vLLM's content normalization logic: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This commit fixes two issues with message content processing:
Missing content field in tool call responses: When assistant messages contain tool_calls, the content field may be missing or None, causing KeyError when checking isinstance(msg['content'], list).
Incomplete multimodal text merging: Previous implementation only extracted the first text block. Now correctly merges all text blocks with newline separator.
Changes:
Implementation based on vLLM's content normalization logic: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py
🤖 Generated with Claude Code