update multi-modal docs (#538)

Jintao-Huang · web-flow · commit 4e262e511cd8 · 2024-03-12T19:12:19.000+08:00
diff --git a/README.md b/README.md
@@ -222,8 +222,8 @@ You can refer to the following scripts to customize your own training script.
   - Multi-Modal:
     - [qwen-vl](https://github.com/QwenLM/Qwen-VL) series: qwen-vl, qwen-vl-chat, qwen-vl-chat-int4.
     - [qwen-audio](https://github.com/QwenLM/Qwen-Audio) series: qwen-audio, qwen-audio-chat.
-    - [internlm-xcomposer2](https://github.com/InternLM/InternLM-XComposer) series: internlm-xcomposer2-7b-chat.
     - [deepseek-vl](https://github.com/deepseek-ai/DeepSeek-VL) series: deepseek-vl-1_3b-chat, deepseek-vl-7b-chat.
+    - [internlm-xcomposer2](https://github.com/InternLM/InternLM-XComposer) series: internlm-xcomposer2-7b-chat.
     - [yi-vl](https://github.com/01-ai/Yi) series: yi-vl-6b-chat, yi-vl-34b-chat.
     - [cogvlm](https://github.com/THUDM/CogVLM) series: cogvlm-17b-instruct, cogagent-18b-chat, cogagent-18b-instruct.
   - General:
diff --git a/README_CN.md b/README_CN.md
@@ -222,8 +222,8 @@ app_ui_main(infer_args)
   - 多模态:
     - [qwen-vl](https://github.com/QwenLM/Qwen-VL) 系列: qwen-vl, qwen-vl-chat, qwen-vl-chat-int4.
     - [qwen-audio](https://github.com/QwenLM/Qwen-Audio) 系列: qwen-audio, qwen-audio-chat.
-    - [internlm-xcomposer2](https://github.com/InternLM/InternLM-XComposer) 系列: internlm-xcomposer2-7b-chat.
     - [deepseek-vl](https://github.com/deepseek-ai/DeepSeek-VL) 系列: deepseek-vl-1_3b-chat, deepseek-vl-7b-chat.
+    - [internlm-xcomposer2](https://github.com/InternLM/InternLM-XComposer) 系列: internlm-xcomposer2-7b-chat.
     - [yi-vl](https://github.com/01-ai/Yi) 系列: yi-vl-6b-chat, yi-vl-34b-chat.
     - [cogvlm](https://github.com/THUDM/CogVLM) 系列: cogvlm-17b-instruct, cogagent-18b-chat, cogagent-18b-instruct.
   - 通用:
diff --git a/docs/source/LLM/index.md b/docs/source/LLM/index.md
@@ -11,8 +11,8 @@
 
 1. [Qwen-VL最佳实践](../Multi-Modal/qwen-vl最佳实践.md)
 2. [Qwen-Audio最佳实践](../Multi-Modal/qwen-auidio最佳实践.md)
-3. [Internlm2-Xcomposers最佳实践](../Multi-Modal/internlm-xcomposer2最佳实践.md)
-4. [Deepseek-VL最佳实践](../Multi-Modal/deepseek-vl最佳实践.md)
+3. [Deepseek-VL最佳实践](../Multi-Modal/deepseek-vl最佳实践.md)
+4. [Internlm2-Xcomposers最佳实践](../Multi-Modal/internlm-xcomposer2最佳实践.md)
 5. [Yi-VL最佳实践.md](../Multi-Modal/yi-vl最佳实践.md)
 6. [CogVLM最佳实践](../Multi-Modal/cogvlm最佳实践.md)
 
diff --git a/docs/source/Multi-Modal/cogvlm最佳实践.md b/docs/source/Multi-Modal/cogvlm最佳实践.md
@@ -68,6 +68,60 @@ poem:
 
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.cogvlm_17b_instruct
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
+query = 'How far is it from each city?'
+response, _ = inference(model, template, query, images=images)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = 'Which city is the farthest?'
+images = images
+gen = inference_stream(model, template, query, images=images)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, _ in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+"""
+query: How far is it from each city?
+response: From Mata, it is 14 km; from Yangjiang, it is 62 km; and from Guangzhou, it is 293 km.
+query: Which city is the farthest?
+response: The city 'Mata' is the farthest with a distance of 14 km.
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
 
 ## 微调
 多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
diff --git a/docs/source/Multi-Modal/deepseek-vl最佳实践.md b/docs/source/Multi-Modal/deepseek-vl最佳实践.md
@@ -76,6 +76,63 @@ poem:
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
 
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.deepseek_vl_7b_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
+query = '距离各城市多远？'
+response, history = inference(model, template, query, images=images)
+print(f'query: {query}')
+print(f'response: {response}')
+
+query = '距离最远的城市是哪？'
+images = images * 2
+response, history = inference(model, template, query, history, images=images)
+print(f'query: {query}')
+print(f'response: {response}')
+print(f'history: {history}')
+"""
+query: 距离各城市多远？
+response: 这个标志显示了从当前位置到以下城市的距离：
+
+- 马塔（Mata）：14公里
+- 阳江（Yangjiang）：62公里
+- 广州（Guangzhou）：293公里
+
+这些信息是根据图片中的标志提供的。
+query: 距离最远的城市是哪？
+response: 距离最远的那个城市是广州，根据标志所示，从当前位置到广州的距离是293公里。
+history: [('距离各城市多远？', '这个标志显示了从当前位置到以下城市的距离：\n\n- 马塔（Mata）：14公里\n- 阳江（Yangjiang）：62公里\n- 广州（Guangzhou）：293公里\n\n这些信息是根据图片中的标志提供的。'), ('距离最远的城市是哪？', '距离最远的那个城市是广州，根据标志所示，从当前位置到广州的距离是293公里。')]
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
+
 ## 微调
 多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
 
diff --git a/docs/source/Multi-Modal/index.md b/docs/source/Multi-Modal/index.md
@@ -0,0 +1,10 @@
+## Multi-Modal文档
+
+### Multi-Modal最佳实践系列
+
+1. [Qwen-VL最佳实践](../Multi-Modal/qwen-vl最佳实践.md)
+2. [Qwen-Audio最佳实践](../Multi-Modal/qwen-auidio最佳实践.md)
+3. [Deepseek-VL最佳实践](../Multi-Modal/deepseek-vl最佳实践.md)
+4. [Internlm2-Xcomposers最佳实践](../Multi-Modal/internlm-xcomposer2最佳实践.md)
+5. [Yi-VL最佳实践.md](../Multi-Modal/yi-vl最佳实践.md)
+6. [CogVLM最佳实践](../Multi-Modal/cogvlm最佳实践.md)
diff --git a/docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md b/docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md
@@ -70,6 +70,56 @@ poem:
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
 
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.internlm_xcomposer2_7b_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+query = """<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远？"""
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+
+query = '距离最远的城市是哪？'
+response, history = inference(model, template, query, history)
+print(f'query: {query}')
+print(f'response: {response}')
+print(f'history: {history}')
+"""
+query: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远？
+response:  马鞍山距离阳江62公里，广州距离广州293公里。
+query: 距离最远的城市是哪？
+response:  最远的距离是地球的两极，南极和北极。
+history: [('<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远？', ' 马鞍山距离阳江62公里，广州距离广州293公里。'), ('距离最远的城市是哪？', ' 最远的距离是地球的两极，南极和北极。')]
+"""
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
+
 ## 微调
 多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
 
diff --git a/docs/source/Multi-Modal/qwen-audio最佳实践.md b/docs/source/Multi-Modal/qwen-audio最佳实践.md
@@ -43,6 +43,56 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-audio-chat
 """
 ```
 
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.qwen_audio_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+query = """Audio 1:<audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>
+这段语音说了什么"""
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = '这段语音是男生还是女生'
+gen = inference_stream(model, template, query, history)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+"""
+query: Audio 1:<audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>
+这段语音说了什么
+response: 这段语音说了中文："今天天气真好呀"。
+query: 这段语音是男生还是女生
+response: 根据音色判断，这段语音是男性。
+history: [('Audio 1:<audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>\n这段语音说了什么', '这段语音说了中文："今天天气真好呀"。'), ('这段语音是男生还是女生', '根据音色判断，这段语音是男性。')]
+"""
+```
+
 
 ## 微调
 多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
diff --git a/docs/source/Multi-Modal/qwen-vl最佳实践.md b/docs/source/Multi-Modal/qwen-vl最佳实践.md
@@ -69,6 +69,62 @@ poem:
 
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.qwen_vl_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+query = """Picture 1:<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>
+距离各城市多远？"""
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = '距离最远的城市是哪？'
+gen = inference_stream(model, template, query, history)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+"""
+query: Picture 1:<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>
+距离各城市多远？
+response: 马路边距离马路边14公里；阳江边距离马路边62公里；广州边距离马路边293公里。
+query: 距离最远的城市是哪？
+response: 距离最远的城市是广州，距离马路边293公里。
+history: [('Picture 1:<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>\n距离各城市多远？', '马路边距离马路边14公里；阳江边距离马路边62公里；广州边距离马路边293公里。'), ('距离最远的城市是哪？', '距离最远的城市是广州，距离马路边293公里。')]
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
 
 ## 微调
 多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
diff --git a/docs/source/Multi-Modal/yi-vl最佳实践.md b/docs/source/Multi-Modal/yi-vl最佳实践.md