Skip to content

Commit e95db47

Browse files
authored
Internvl series models update (#1426)
1 parent 0d88ce1 commit e95db47

File tree

9 files changed

+381
-49
lines changed

9 files changed

+381
-49
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,12 +55,13 @@ You can contact us and communicate with us by adding our group:
5555
<img src="asset/discord_qr.jpg" width="200" height="200"> | <img src="asset/wechat.png" width="200" height="200">
5656

5757
## 🎉 News
58+
- 2024.07.17: Support newly released InternVL2 models: `model_type` are internvl2-1b, internvl2-40b, internvl2-llama3-76b. For best practices, refer to [here](docs/source_en/Multi-Modal/internvl-best-practice.md).
5859
- 2024.07.17: Support the training and inference of [NuminaMath-7B-TIR](https://huggingface.co/AI-MO/NuminaMath-7B-TIR). Use with model_type `numina-math-7b`.
5960
- 🔥2024.07.16: Support exporting for ollama and bitsandbytes. Use `swift export --model_type xxx --to_ollama true` or `swift export --model_type xxx --quant_method bnb --quant_bits 4`
6061
- 2024.07.08: Support cogvlm2-video-13b-chat. You can check the best practice [here](docs/source_en/Multi-Modal/cogvlm2-video-best-practice.md).
6162
- 2024.07.08: Support internlm-xcomposer2_5-7b-chat. You can check the best practice [here](docs/source_en/Multi-Modal/internlm-xcomposer2-best-practice.md).
6263
- 🔥2024.07.06: Support for the llava-next-video series models: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. You can refer to [llava-video best practice](docs/source_en/Multi-Modal/llava-video-best-practice.md) for more information.
63-
- 🔥2024.07.06: Support internvl2 series: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
64+
- 🔥2024.07.06: Support InternVL2 series: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
6465
- 2024.07.06: Support codegeex4-9b-chat.
6566
- 2024.07.04: Support internlm2_5-7b series: internlm2_5-7b, internlm2_5-7b-chat, internlm2_5-7b-chat-1m.
6667
- 2024.07.02: Support for using vLLM for accelerating inference and deployment of multimodal large models such as the llava series and phi3-vision models. You can refer to the [Multimodal & vLLM Inference Acceleration Documentation](docs/source_en/Multi-Modal/vllm-inference-acceleration.md) for more information.
@@ -606,7 +607,7 @@ The complete list of supported models and datasets can be found at [Supported Mo
606607
| Llava1.5<br>Llava1.6 | [Llava series models](https://github.com/haotian-liu/LLaVA) | English | 7B-34B | chat model |
607608
| Llava-Next<br>Llava-Next-Video | [Llava-Next series models](https://github.com/LLaVA-VL/LLaVA-NeXT) | Chinese<br>English | 7B-110B | chat model |
608609
| mPLUG-Owl | [mPLUG-Owl series models](https://github.com/X-PLUG/mPLUG-Owl) | English | 11B | chat model |
609-
| InternVL<br>Mini-Internvl<br>Internvl2 | [InternVL](https://github.com/OpenGVLab/InternVL) | Chinese<br>English | 2B-40B<br>including quantized version | chat model |
610+
| InternVL<br>Mini-InternVL<br>InternVL2 | [InternVL](https://github.com/OpenGVLab/InternVL) | Chinese<br>English | 1B-40B<br>including quantized version | chat model |
610611
| Llava-llama3 | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) | English | 8B | chat model |
611612
| Phi3-Vision | Microsoft | English | 4B | chat model |
612613
| PaliGemma | Google | English | 3B | chat model |

README_CN.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,12 +56,13 @@ SWIFT具有丰富全面的文档,请查看我们的文档网站:
5656

5757

5858
## 🎉 新闻
59+
- 2024.07.17: 支持InternVL2系列新模型: `model_type`分别为internvl2-1b, internvl2-40b, internvl2-llama3-76b. 最佳实践可以查看[这里](docs/source/Multi-Modal/internvl最佳实践.md).
5960
- 2024.07.17: 支持[NuminaMath-7B-TIR](https://www.modelscope.cn/models/AI-ModelScope/NuminaMath-7B-TIR)的训练和推理. model_type可以使用`numina-math-7b`.
6061
- 🔥2024.07.16: 支持ollama和bitsandbytes导出. 可以使用命令: `swift export --model_type xxx --to_ollama true`或者`swift export --model_type xxx --quant_method bnb --quant_bits 4`.
6162
- 2024.07.08: 支持cogvlm2-video-13b-chat. 最佳实践可以查看[这里](docs/source/Multi-Modal/cogvlm2-video最佳实践.md).
6263
- 2024.07.08: 支持internlm-xcomposer2_5-7b-chat. 最佳实践可以查看[这里](docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md).
6364
- 🔥2024.07.06: 支持llava-next-video系列模型: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. 可以查看[llava-video最佳实践](docs/source/Multi-Modal/llava-video最佳实践.md)了解更多.
64-
- 🔥2024.07.06: 支持internvl-2系列: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
65+
- 🔥2024.07.06: 支持InternVL-2系列: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
6566
- 2024.07.06: 支持codegeex4-9b-chat.
6667
- 2024.07.04: 支持internlm2_5-7b系列: internlm2_5-7b, internlm2_5-7b-chat, internlm2_5-7b-chat-1m.
6768
- 2024.07.02: 支持使用vllm对多模态大模型: llava系列, phi3-vision模型进行推理加速和部署. 可以查看[多模态&vLLM推理加速文档](docs/source/Multi-Modal/vLLM推理加速文档.md)获取更多信息.
@@ -600,7 +601,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
600601
| Llava1.5<br>Llava1.6 | [Llava系列模型](https://github.com/haotian-liu/LLaVA) | 英文 | 7B-34B | chat模型 |
601602
| Llava-Next<br>Llava-Next-Video | [Llava-Next系列模型](https://github.com/LLaVA-VL/LLaVA-NeXT) | 中文<br>英文 | 7B-110B | chat模型 |
602603
| mPLUG-Owl | [mPLUG-Owl系列模型](https://github.com/X-PLUG/mPLUG-Owl) | 英文 | 11B | chat模型 |
603-
| InternVL<br>Mini-Internvl<br>Internvl2 | [InternVL](https://github.com/OpenGVLab/InternVL) | 中文<br>英文 | 2B-40B<br>包含量化版本 | chat模型 |
604+
| InternVL<br>Mini-InternVL<br>InternVL2 | [InternVL](https://github.com/OpenGVLab/InternVL) | 中文<br>英文 | 1B-40B<br>包含量化版本 | chat模型 |
604605
| Llava-llama3 | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) | 英文 | 8B | chat model |
605606
| Phi3-Vision | 微软 | 英文 | 4B | chat model |
606607
| PaliGemma | Google | 英文 | 3B | chat model |

docs/source/LLM/支持的模型和数据集.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -354,11 +354,13 @@
354354
|internvl-chat-v1_5-int8|[AI-ModelScope/InternVL-Chat-V1-5-int8](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5-int8](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-int8)|
355355
|mini-internvl-chat-2b-v1_5|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)|
356356
|mini-internvl-chat-4b-v1_5|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5/summary)|qkv_proj|internvl-phi3|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)|
357+
|internvl2-1b|[OpenGVLab/InternVL2-1B](https://modelscope.cn/models/OpenGVLab/InternVL2-1B/summary)|q_proj, k_proj, v_proj|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B)|
357358
|internvl2-2b|[OpenGVLab/InternVL2-2B](https://modelscope.cn/models/OpenGVLab/InternVL2-2B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B)|
358359
|internvl2-4b|[OpenGVLab/InternVL2-4B](https://modelscope.cn/models/OpenGVLab/InternVL2-4B/summary)|qkv_proj|internvl2-phi3|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B)|
359360
|internvl2-8b|[OpenGVLab/InternVL2-8B](https://modelscope.cn/models/OpenGVLab/InternVL2-8B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B)|
360361
|internvl2-26b|[OpenGVLab/InternVL2-26B](https://modelscope.cn/models/OpenGVLab/InternVL2-26B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-26B](https://huggingface.co/OpenGVLab/InternVL2-26B)|
361362
|internvl2-40b|[OpenGVLab/InternVL2-40B](https://modelscope.cn/models/OpenGVLab/InternVL2-40B/summary)|q_proj, k_proj, v_proj|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-40B](https://huggingface.co/OpenGVLab/InternVL2-40B)|
363+
|internvl2-llama3-76b|[OpenGVLab/InternVL2-Llama3-76B](https://modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B/summary)|q_proj, k_proj, v_proj|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-Llama3-76B](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B)|
362364
|deepseek-vl-1_3b-chat|[deepseek-ai/deepseek-vl-1.3b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;||vision|[deepseek-ai/deepseek-vl-1.3b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-1.3b-chat)|
363365
|deepseek-vl-7b-chat|[deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;||vision|[deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)|
364366
|paligemma-3b-pt-224|[AI-ModelScope/paligemma-3b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-pt-224/summary)|q_proj, k_proj, v_proj|paligemma|&#x2714;|&#x2718;|transformers>=4.41|vision|[google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224)|

docs/source/Multi-Modal/internvl最佳实践.md

Lines changed: 109 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,41 @@
66
- [internvl-chat-v1_5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)
77
- [mini-internvl-chat-2b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)
88
- [mini-internvl-chat-4b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)
9+
- [internvl2-1b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-1B)
910
- [internvl2-2b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-2B)
1011
- [internvl2-4b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-4B)
1112
- [internvl2-8b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-8B)
1213
- [internvl2-26b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-26B)
14+
- [internvl2-40b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-40B)
15+
- [internvl2-llama3-76b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B)
1316

1417

1518
以下实践以`internvl-chat-v1_5`为例,你也可以通过指定`--model_type`切换为其他模型.
1619

20+
**FAQ**
21+
22+
1. **模型显示 `The request model does not exist!`**
23+
这种情况通常发生在尝试使用mini-internvl或InternVL2模型, 原因是modelscope上相应模型是申请制。解决这个问题,你需要登录modelscope, 并前往相应的模型页面进行**申请下载**, 申请成功后可以通过以下任意一种方式获取模型:
24+
- 使用`snap_download`将模型下载到本地(在模型文件中的模型下载中有相应代码), 然后使用`--model_id_or_path`指定本地模型文件路径
25+
-[modelscope账号主页](https://www.modelscope.cn/my/myaccesstoken)获取账号的SDK token, 使用参数`--hub_token`或者环境变量`MODELSCOPE_API_TOKEN`指定
26+
27+
也可以设置环境变量`USE_HF`, 从hugging face处下载模型
28+
29+
2. **多卡运行模型时, 为什么不同卡的分布不均匀, 导致OOM?**
30+
transformers的auto device map算法对多模态模型支持不友好, 这可能导致不同 GPU 卡之间的显存分配不均匀。
31+
- 可以通过参数`--device_max_memory`设置每张卡的显存使用, 比如四卡环境, 可以设置`--device_map_memory 15GB 15GB 15GB 15GB`
32+
- 或者通过`--device_map_config_path`显式指定device map
33+
34+
3. **InternVL2模型与前系列(InternVL-V1.5和Mini-InternVL)模型的区别**
35+
- InternVL2模型支持多轮多图推理和训练, 即多轮对话带有图片, 且单轮中支持文字图片交错,具体参考[自定义数据集](#自定义数据集)和推理的InternVL2部分。前系列模型支持多轮对话, 但只能有单轮带有图片
36+
- InternVL2模型支持视频输入, 具体格式参考[自定义数据集](#自定义数据集)
37+
38+
1739
## 目录
1840
- [环境准备](#环境准备)
1941
- [推理](#推理)
2042
- [微调](#微调)
43+
- [自定义数据集](#自定义数据集)
2144
- [微调后推理](#微调后推理)
2245

2346

@@ -49,7 +72,7 @@ CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type internvl-chat-v1_5 --dtype bf1
4972
```
5073

5174
输出: (支持传入本地路径或URL)
52-
```python
75+
```
5376
"""
5477
<<< 你是谁
5578
Input a media path or URL <<<
@@ -107,6 +130,64 @@ ui功能了。
107130
"""
108131
```
109132

133+
对于**InternVL2**系列模型, 支持多轮多图推理, 且单轮中支持图像文本交错, 需要在输入通过`<image>`确定图像位置
134+
多轮多图示例
135+
```shell
136+
CUDA_VISIBLE_DEVICES=0 swift infer --model_type internvl2-2b
137+
```
138+
139+
```
140+
<<< <image>描述图片
141+
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
142+
这是一只非常可爱的幼猫。它的毛色主要是白色,脸部和耳朵上有黑色和灰色的条纹。猫的眼睛大而明亮,瞳孔是蓝色的,显得非常清澈和无辜。它的耳朵竖立着,耳朵尖端是浅色的。这只猫看起来非常活泼,眼神中透露出一种好奇和友善的气质。背景模糊,可能是为了突出这只猫本身。整体画面给人一种温馨和可爱的感觉。
143+
--------------------------------------------------
144+
<<< <image>描述与上一张图片的差距
145+
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
146+
这张图片与上一张图片相比,有以下几个显著的差距:
147+
148+
1. **动物种类不同**:
149+
- 上一张图片中的动物是猫,而这张图片中的动物是羊。
150+
151+
2. **动物外观和特征不同**:
152+
- 猫有明显的竖立耳朵和蓝色的眼睛,而羊有竖立的长角和白色的羊毛。
153+
154+
3. **背景和环境不同**:
155+
- 上一张图片背景是模糊的,天空中有一些云朵,而这张图片背景是清晰的,背景中有绿色的山丘和蓝天白云。
156+
157+
4. **动物表情和姿态不同**:
158+
- 猫的表情显得非常警觉和好奇,而羊的表情显得非常温和和友好。
159+
160+
5. **整体风格和风格不同**:
161+
- 上一张图片有一种更加写实和细腻的绘画风格,而这张图片则有一种卡通和简化的风格。
162+
163+
这些差异使得这张图片与上一张图片在视觉上存在显著的差异。
164+
--------------------------------------------------
165+
<<< 两张图片的共同点是什么
166+
Input an image path or URL <<<
167+
两张图片的共同点如下:
168+
169+
1. **动物种类不同**:
170+
- 两张图片中的动物都是羊。
171+
172+
2. **背景和环境不同**:
173+
- 两张图片中的背景都是自然景观,包括绿色的山丘和蓝天白云。
174+
175+
3. **动物表情和姿态不同**:
176+
- 猫和羊的表情和姿态都不同,但都显得非常可爱和友好。
177+
178+
4. **整体风格和风格不同**:
179+
- 两张图片在风格上有所不同,但都具有卡通和简化的特点。
180+
181+
这些共同点使得两张图片在视觉上存在显著的差异,但它们都展示了可爱的动物形象。
182+
```
183+
184+
单轮多图示例
185+
```
186+
<<< image1: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img> image2: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img> What is the difference bewteen the two images?
187+
Input an image path or URL <<<
188+
The two images are of the same kitten, but the first image is a close-up shot, while the second image is a more distant, artistic illustration. The close-up image captures the kitten in detail, showing its fur, eyes, and facial features in sharp focus. In contrast, the artistic illustration is more abstract and stylized, with a blurred background and a different color palette. The distant illustration gives the kitten a more whimsical and dreamy appearance, while the close-up image emphasizes the kitten's realism and detail.
189+
```
190+
110191
示例图片如下:
111192

112193
cat:
@@ -134,6 +215,7 @@ ocr:
134215
```python
135216
import os
136217
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
218+
# os.environ['MODELSCOPE_API_TOKEN'] = 'Your API Token' # If the message "The request model does not exist!" appears.
137219

138220
from swift.llm import (
139221
get_model_tokenizer, get_template, inference,
@@ -142,6 +224,7 @@ from swift.llm import (
142224
from swift.utils import seed_everything
143225
import torch
144226

227+
145228
model_type = "internvl-chat-v1_5"
146229
template_type = get_default_template_type(model_type)
147230
print(f'template_type: {template_type}')
@@ -244,17 +327,39 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
244327
--sft_type full \
245328
```
246329

247-
330+
## 自定义数据集
248331
[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
249332

250-
(支持多轮对话, 但总的轮次对话只能包含一张图片, 支持传入本地路径或URL)
333+
(支持多轮对话, 图片支持传入本地路径或URL, 多张图片用逗号','分割)
251334

252335
```jsonl
253336
{"query": "55555", "response": "66666", "images": ["image_path"]}
254-
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
337+
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]}
255338
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path"]}
256339
```
257340

341+
(支持纯文本数据)
342+
```jsonl
343+
{"query": "55555", "response": "66666"}
344+
{"query": "eeeee", "response": "fffff", "history": []}
345+
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
346+
```
347+
348+
**InternVL2**模型支持多图多轮训练, 使用tag `<image>` 标明图片在对话中的位置, 如果数据集中没有tag `<image>`, 默认放在最后一轮query的开头
349+
```jsonl
350+
{"query": "Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.", "response": "xxxxxxxxx", "history": [["<image> Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], "images": ["image_path1", "image_path2", "image_path3"]}
351+
```
352+
或者用`<img>image_path</img>` 表示图像路径和图像位置
353+
""
354+
```jsonl
355+
{"query": "Image-1: <img>img_path</img>\n Image-2: <img>img_path2</img>\n Describe the two images in detail.", "response": "xxxxxxxxx", "history": [["<img>img_path3</img> Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], }
356+
```
357+
358+
**InternVL2**模型支持视频数据集训练, 无需标明tag
359+
```jsonl
360+
{"query": "Describe this video in detail. Don't repeat", "response": "xxxxxxxxx", "history": [], "videos": ["video_path"]}
361+
```
362+
258363
## 微调后推理
259364
直接推理:
260365
```shell

0 commit comments

Comments
 (0)