Skip to content

Conversation

QwertyJack
Copy link
Contributor

This commit fixes two issues with message content processing:

  1. Missing content field in tool call responses: When assistant messages contain tool_calls, the content field may be missing or None, causing KeyError when checking isinstance(msg['content'], list).

  2. Incomplete multimodal text merging: Previous implementation only extracted the first text block. Now correctly merges all text blocks with newline separator.

Changes:

  • Add _merge_message_content() module-level function in async_engine.py
  • Handle 4 cases: missing content, None content, string content, list content
  • Use '\n'.join() to merge text blocks (matches vLLM behavior)
  • Preserve all message fields (e.g., tool_calls, name, etc.)
  • Add comprehensive unit tests (19 test cases in test_content_merge.py)

Implementation based on vLLM's content normalization logic: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py

🤖 Generated with Claude Code

for block in msg['content']:
if isinstance(block, dict) and block.get('type') == 'text':
content_parts.append(block.get('text', ''))
merged_content = '\n'.join(content_parts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this aligns with the model's instructed behavior during training.
Because of this, we expect users to assemble the entire text in text[0] using image token placeholders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging that! The training-time “put every text chunk (including <IMAGE_TOKEN> markers) into content[0]” requirement is still respected for multimodal runs. As soon as a message contains any multimodal blocks, we never hit _merge_message_content; VLAsyncEngine._get_prompt_input intercepts those requests and forwards them down the vision path (lmdeploy/serve/vl_async_engine.py:74-118). Each vision model then pulls message['content'][0]['text'] and applies its own placeholder logic—e.g. DeepSeek2VisionModel.proc_single_message in lmdeploy/vl/model/deepseek_vl2.py:130-167.

The helper you called out only runs in the plain AsyncEngine path, where there are no image parts—either legacy text-only prompts or assistant tool-call responses that came back with content=None. Previously we just took the first text entry (msg['content'][0]['text']); now we join all text fragments so we don't silently drop extra text while still ignoring the multimodal blocks we can’t render. So the training contract for VLMs remains unchanged, and the new code simply makes the text-only fallbacks more robust.

}
result = _merge_message_content(msg)

assert result['content'] == 'Analyze this image:\nWhat do you see?'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where should image_token_placeholder be inserted in the result['content']?
"'Analyze this image:{image_placeholder}\nWhat do you see?'" or "'Analyze this image:\nWhat do you see?{image_placeholder}'"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! That test is exercising the text-only fallback path for _merge_message_content. When the helper sees a list like [{type: 'text', ...}, {type: 'image_url', ...}, {type: 'text', ...}], it keeps only the textual slices and joins them with \n, dropping the non-text blocks entirely. We do that because the plain AsyncEngine has no idea which image placeholder (and which syntax) a given vision model expects.

As soon as the request actually requires vision handling, VLAsyncEngine._get_prompt_input takes over (lmdeploy/serve/vl_async_engine.py:74-118). From there each VLM implementation (e.g. DeepSeek2VisionModel.proc_single_message in lmdeploy/vl/model/deepseek_vl2.py:130-167) injects the appropriate placeholder into content[0]['text'] before we ever apply any merging. So _merge_message_content never decides whether {image_placeholder} appears after “Analyze this image:” or at the end of the paragraph—that logic stays with the VLM-specific code.

@QwertyJack
Copy link
Contributor Author

Thanks @lvhan028 for the excellent review! GPT-5-Codex is really helpful at walking through code!

This commit fixes two issues with message content processing:

1. **Missing content field in tool call responses**: When assistant messages
   contain tool_calls, the content field may be missing or None, causing
   KeyError when checking isinstance(msg['content'], list).

2. **Incomplete multimodal text merging**: Previous implementation only
   extracted the first text block. Now correctly merges all text blocks
   with newline separator.

Changes:
- Add _merge_message_content() module-level function in async_engine.py
- Handle 4 cases: missing content, None content, string content, list content
- Convert None/missing content to empty string '' (prevents Jinja2 errors)
- Use '\n'.join() to merge text blocks (matches vLLM behavior)
- Preserve all message fields (e.g., tool_calls, name, etc.)
- Add comprehensive unit tests (19 test cases in test_content_merge.py)

Implementation based on vLLM's content normalization logic:
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants