Skip to content

Commit dd9410e

Browse files
authored
support qwen1.5-moe model (#627)
1 parent 8812886 commit dd9410e

File tree

10 files changed

+164
-1
lines changed

10 files changed

+164
-1
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ To facilitate use by users unfamiliar with deep learning, we provide a Gradio we
3939
Additionally, we are expanding capabilities for other modalities. Currently, we support full-parameter training and LoRA training for AnimateDiff.
4040

4141
## 🎉 News
42+
- 🔥2024.03.29: Support **Qwen1.5-MoE** series: Qwen1.5-MoE-A2.7B, Qwen1.5-MoE-A2.7B-Chat, Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4.
4243
- 🔥2024.03.29: Support the fine-tuning and inference of **Grok-1** 300B MoE, please view details [here](https://github.com/modelscope/swift/tree/main/docs/source_en/LLM/Grok-1-best-practice.md).
4344
- 🔥2024.03.25: Supports inference and fine-tuning of TeleChat-12b model, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/telechat_12b/lora/sft.sh) to start training!
4445
- 🔥2024.03.20: Supports inference and fine-tuning for the **llava** series. For best practice, you can refer to [here](https://github.com/modelscope/swift/tree/main/docs/source/Multi-Modal/llava最佳实践.md).

README_CN.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ SWIFT支持近**200种LLM和MLLM**(多模态大模型)的训练、推理、
4040
此外,我们也在拓展其他模态的能力,目前我们支持了AnimateDiff的全参数训练和LoRA训练。
4141

4242
## 🎉 新闻
43+
- 🔥2024.03.29: 支持**Qwen1.5-MoE**系列: Qwen1.5-MoE-A2.7B, Qwen1.5-MoE-A2.7B-Chat, Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4.
4344
- 🔥2024.03.29: 支持**Grok-1**300B MoE模型的推理与微调, 最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/LLM/Grok训练和推理.md).
4445
- 🔥2024.03.25: 支持TeleChat-12b模型的训练和推理, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/telechat_12b/lora/sft.sh)来开始训练!.
4546
- 🔥2024.03.20: 支持**llava**系列的推理与微调, 最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/Multi-Modal/llava最佳实践.md).

docs/source/LLM/支持的模型和数据集.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,14 @@
3636
|qwen1half-7b|[qwen/Qwen1.5-7B](https://modelscope.cn/models/qwen/Qwen1.5-7B/summary)|q_proj, k_proj, v_proj|default-generation|✔|✔|transformers>=4.37|-|
3737
|qwen1half-14b|[qwen/Qwen1.5-14B](https://modelscope.cn/models/qwen/Qwen1.5-14B/summary)|q_proj, k_proj, v_proj|default-generation|✔|✔|transformers>=4.37|-|
3838
|qwen1half-72b|[qwen/Qwen1.5-72B](https://modelscope.cn/models/qwen/Qwen1.5-72B/summary)|q_proj, k_proj, v_proj|default-generation|✔|✔|transformers>=4.37|-|
39+
|qwen1half-moe-a2_7b|[qwen/Qwen1.5-MoE-A2.7B](https://modelscope.cn/models/qwen/Qwen1.5-MoE-A2.7B/summary)|q_proj, k_proj, v_proj|default-generation|✔|✔|transformers>=4.37|-|
3940
|qwen1half-0_5b-chat|[qwen/Qwen1.5-0.5B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|-|
4041
|qwen1half-1_8b-chat|[qwen/Qwen1.5-1.8B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-1.8B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|-|
4142
|qwen1half-4b-chat|[qwen/Qwen1.5-4B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|-|
4243
|qwen1half-7b-chat|[qwen/Qwen1.5-7B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|-|
4344
|qwen1half-14b-chat|[qwen/Qwen1.5-14B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|-|
4445
|qwen1half-72b-chat|[qwen/Qwen1.5-72B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|-|
46+
|qwen1half-moe-a2_7b-chat|[qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/qwen/Qwen1.5-MoE-A2.7B-Chat/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37|-|
4547
|qwen1half-0_5b-chat-int4|[qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|auto_gptq>=0.5, transformers>=4.37|-|
4648
|qwen1half-1_8b-chat-int4|[qwen/Qwen1.5-1.8B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-1.8B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|auto_gptq>=0.5, transformers>=4.37|-|
4749
|qwen1half-4b-chat-int4|[qwen/Qwen1.5-4B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|auto_gptq>=0.5, transformers>=4.37|-|
@@ -54,6 +56,7 @@
5456
|qwen1half-7b-chat-int8|[qwen/Qwen1.5-7B-Chat-GPTQ-Int8](https://modelscope.cn/models/qwen/Qwen1.5-7B-Chat-GPTQ-Int8/summary)|q_proj, k_proj, v_proj|qwen|✔|✘|auto_gptq>=0.5, transformers>=4.37|-|
5557
|qwen1half-14b-chat-int8|[qwen/Qwen1.5-14B-Chat-GPTQ-Int8](https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat-GPTQ-Int8/summary)|q_proj, k_proj, v_proj|qwen|✔|✘|auto_gptq>=0.5, transformers>=4.37|-|
5658
|qwen1half-72b-chat-int8|[qwen/Qwen1.5-72B-Chat-GPTQ-Int8](https://modelscope.cn/models/qwen/Qwen1.5-72B-Chat-GPTQ-Int8/summary)|q_proj, k_proj, v_proj|qwen|✔|✘|auto_gptq>=0.5, transformers>=4.37|-|
59+
|qwen1half-moe-a2_7b-chat-int4|[qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4](https://modelscope.cn/models/qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4/summary)|q_proj, k_proj, v_proj|qwen|✔|✘|auto_gptq>=0.5, transformers>=4.37|-|
5760
|qwen1half-0_5b-chat-awq|[qwen/Qwen1.5-0.5B-Chat-AWQ](https://modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat-AWQ/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37, autoawq|-|
5861
|qwen1half-1_8b-chat-awq|[qwen/Qwen1.5-1.8B-Chat-AWQ](https://modelscope.cn/models/qwen/Qwen1.5-1.8B-Chat-AWQ/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37, autoawq|-|
5962
|qwen1half-4b-chat-awq|[qwen/Qwen1.5-4B-Chat-AWQ](https://modelscope.cn/models/qwen/Qwen1.5-4B-Chat-AWQ/summary)|q_proj, k_proj, v_proj|qwen|✔|✔|transformers>=4.37, autoawq|-|
@@ -198,7 +201,7 @@
198201
|mamba-790m|[AI-ModelScope/mamba-790m-hf](https://modelscope.cn/models/AI-ModelScope/mamba-790m-hf/summary)|in_proj, x_proj, embeddings, out_proj|default-generation|✘|✘|transformers>=4.39.0|-|
199202
|mamba-1.4b|[AI-ModelScope/mamba-1.4b-hf](https://modelscope.cn/models/AI-ModelScope/mamba-1.4b-hf/summary)|in_proj, x_proj, embeddings, out_proj|default-generation|✘|✘|transformers>=4.39.0|-|
200203
|mamba-2.8b|[AI-ModelScope/mamba-2.8b-hf](https://modelscope.cn/models/AI-ModelScope/mamba-2.8b-hf/summary)|in_proj, x_proj, embeddings, out_proj|default-generation|✘|✘|transformers>=4.39.0|-|
201-
|telechat-12b|[TeleAI/telechat-12B](https://modelscope.cn/models/TeleAI/telechat-12B/summary)|self_attention.key_value, self_attention.query|telechat|✔|✘||-|
204+
|telechat-12b|[TeleAI/TeleChat-12B](https://modelscope.cn/models/TeleAI/TeleChat-12B/summary)|self_attention.key_value, self_attention.query|telechat|✔|✘||-|
202205
|grok-1|[colossalai/grok-1-pytorch](https://modelscope.cn/models/colossalai/grok-1-pytorch/summary)|q_proj, k_proj, v_proj|default-generation|✘|✘||-|
203206

204207

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Experimental environment: A100
2+
# 36GB GPU memory
3+
PYTHONPATH=../../.. \
4+
CUDA_VISIBLE_DEVICES=0 \
5+
python llm_infer.py \
6+
--ckpt_dir "output/qwen1half-moe-a2_7b/vx-xxx/checkpoint-xxx" \
7+
--load_dataset_config true \
8+
--use_flash_attn true \
9+
--max_new_tokens 2048 \
10+
--temperature 0.1 \
11+
--top_p 0.7 \
12+
--repetition_penalty 1. \
13+
--do_sample true \
14+
--merge_lora false \
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Experimental environment: A100
2+
# 42GB GPU memory
3+
PYTHONPATH=../../.. \
4+
CUDA_VISIBLE_DEVICES=0 \
5+
python llm_sft.py \
6+
--model_type qwen1half-moe-a2_7b \
7+
--sft_type lora \
8+
--tuner_backend swift \
9+
--dtype AUTO \
10+
--output_dir output \
11+
--dataset dureader-robust-zh \
12+
--train_dataset_sample -1 \
13+
--num_train_epochs 1 \
14+
--max_length 1024 \
15+
--check_dataset_strategy warning \
16+
--lora_rank 8 \
17+
--lora_alpha 32 \
18+
--lora_dropout_p 0.05 \
19+
--lora_target_modules ALL \
20+
--gradient_checkpointing true \
21+
--batch_size 1 \
22+
--weight_decay 0.1 \
23+
--learning_rate 1e-4 \
24+
--gradient_accumulation_steps 16 \
25+
--max_grad_norm 0.5 \
26+
--warmup_ratio 0.03 \
27+
--eval_steps 100 \
28+
--save_steps 100 \
29+
--save_total_limit 2 \
30+
--logging_steps 10 \
31+
--use_flash_attn true \
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Experimental environment: A100
2+
# 36GB GPU memory
3+
PYTHONPATH=../../.. \
4+
CUDA_VISIBLE_DEVICES=0 \
5+
python llm_infer.py \
6+
--ckpt_dir "output/qwen1half-moe-a2_7b-chat/vx-xxx/checkpoint-xxx" \
7+
--load_dataset_config true \
8+
--use_flash_attn true \
9+
--max_new_tokens 2048 \
10+
--temperature 0.1 \
11+
--top_p 0.7 \
12+
--repetition_penalty 1. \
13+
--do_sample true \
14+
--merge_lora false \
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Experimental environment: A100
2+
# 42GB GPU memory
3+
PYTHONPATH=../../.. \
4+
CUDA_VISIBLE_DEVICES=0 \
5+
python llm_sft.py \
6+
--model_type qwen1half-moe-a2_7b-chat \
7+
--sft_type lora \
8+
--tuner_backend swift \
9+
--dtype AUTO \
10+
--output_dir output \
11+
--dataset blossom-math-zh \
12+
--train_dataset_sample -1 \
13+
--num_train_epochs 1 \
14+
--max_length 1024 \
15+
--check_dataset_strategy warning \
16+
--lora_rank 8 \
17+
--lora_alpha 32 \
18+
--lora_dropout_p 0.05 \
19+
--lora_target_modules ALL \
20+
--gradient_checkpointing true \
21+
--batch_size 1 \
22+
--weight_decay 0.1 \
23+
--learning_rate 1e-4 \
24+
--gradient_accumulation_steps 16 \
25+
--max_grad_norm 0.5 \
26+
--warmup_ratio 0.03 \
27+
--eval_steps 100 \
28+
--save_steps 100 \
29+
--save_total_limit 2 \
30+
--logging_steps 10 \
31+
--use_flash_attn true \
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Experimental environment: A100
2+
CUDA_VISIBLE_DEVICES=0 \
3+
swift infer \
4+
--ckpt_dir "output/qwen1half-moe-a2_7b-chat-int4/vx-xxx/checkpoint-xxx" \
5+
--load_dataset_config true \
6+
--use_flash_attn true \
7+
--max_new_tokens 2048 \
8+
--temperature 0.1 \
9+
--top_p 0.7 \
10+
--repetition_penalty 1. \
11+
--do_sample true \
12+
--merge_lora false \
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Experimental environment: A100
2+
# 17GB GPU memory
3+
4+
CUDA_VISIBLE_DEVICES=0 \
5+
swift sft \
6+
--model_type qwen1half-moe-a2_7b-chat-int4 \
7+
--sft_type lora \
8+
--output_dir output \
9+
--dataset blossom-math-zh \
10+
--train_dataset_sample -1 \
11+
--num_train_epochs 3 \
12+
--max_length 2048 \
13+
--lora_rank 8 \
14+
--lora_alpha 32 \
15+
--lora_dropout_p 0.05 \
16+
--lora_target_modules ALL \
17+
--gradient_checkpointing true \
18+
--batch_size 1 \
19+
--weight_decay 0.1 \
20+
--learning_rate 1e-4 \
21+
--gradient_accumulation_steps 16 \
22+
--max_grad_norm 0.5 \
23+
--warmup_ratio 0.03 \
24+
--eval_steps 100 \
25+
--save_steps 100 \
26+
--save_total_limit 2 \
27+
--logging_steps 10 \
28+
--use_flash_attn true \

swift/llm/utils/model.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,12 +61,14 @@ class ModelType:
6161
qwen1half_7b = 'qwen1half-7b'
6262
qwen1half_14b = 'qwen1half-14b'
6363
qwen1half_72b = 'qwen1half-72b'
64+
qwen1half_moe_a2_7b = 'qwen1half-moe-a2_7b'
6465
qwen1half_0_5b_chat = 'qwen1half-0_5b-chat'
6566
qwen1half_1_8b_chat = 'qwen1half-1_8b-chat'
6667
qwen1half_4b_chat = 'qwen1half-4b-chat'
6768
qwen1half_7b_chat = 'qwen1half-7b-chat'
6869
qwen1half_14b_chat = 'qwen1half-14b-chat'
6970
qwen1half_72b_chat = 'qwen1half-72b-chat'
71+
qwen1half_moe_a2_7b_chat = 'qwen1half-moe-a2_7b-chat'
7072

7173
# qwen1.5 gptq
7274
qwen1half_0_5b_chat_int4 = 'qwen1half-0_5b-chat-int4'
@@ -81,6 +83,7 @@ class ModelType:
8183
qwen1half_7b_chat_int8 = 'qwen1half-7b-chat-int8'
8284
qwen1half_14b_chat_int8 = 'qwen1half-14b-chat-int8'
8385
qwen1half_72b_chat_int8 = 'qwen1half-72b-chat-int8'
86+
qwen1half_moe_a2_7b_chat_int4 = 'qwen1half-moe-a2_7b-chat-int4'
8487

8588
# qwen1.5 awq
8689
qwen1half_0_5b_chat_awq = 'qwen1half-0_5b-chat-awq'
@@ -982,6 +985,14 @@ def cross_entropy_forward(self, inputs: Tensor,
982985
support_flash_attn=True,
983986
support_vllm=True,
984987
requires=['transformers>=4.37'])
988+
@register_model(
989+
ModelType.qwen1half_moe_a2_7b,
990+
'qwen/Qwen1.5-MoE-A2.7B',
991+
LoRATM.qwen1half,
992+
TemplateType.default_generation,
993+
support_flash_attn=True,
994+
support_vllm=True,
995+
requires=['transformers>=4.37'])
985996
@register_model(
986997
ModelType.deepseek_coder_1_3b,
987998
'deepseek-ai/deepseek-coder-1.3b-base',
@@ -1404,6 +1415,14 @@ def get_model_tokenizer_aqlm(model_dir: str,
14041415
support_flash_attn=True,
14051416
support_vllm=True,
14061417
requires=['transformers>=4.37'])
1418+
@register_model(
1419+
ModelType.qwen1half_moe_a2_7b_chat,
1420+
'qwen/Qwen1.5-MoE-A2.7B-Chat',
1421+
LoRATM.qwen1half,
1422+
TemplateType.qwen,
1423+
support_flash_attn=True,
1424+
support_vllm=True,
1425+
requires=['transformers>=4.37'])
14071426
def get_model_tokenizer_qwen1half(model_dir: str,
14081427
torch_dtype: Dtype,
14091428
model_kwargs: Dict[str, Any],
@@ -1540,6 +1559,15 @@ def get_model_tokenizer_qwen1half(model_dir: str,
15401559
torch_dtype=torch.float16,
15411560
function_kwargs={'bits': 8},
15421561
support_flash_attn=True)
1562+
@register_model(
1563+
ModelType.qwen1half_moe_a2_7b_chat_int4,
1564+
'qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4',
1565+
LoRATM.qwen1half,
1566+
TemplateType.qwen,
1567+
requires=['auto_gptq>=0.5', 'transformers>=4.37'],
1568+
torch_dtype=torch.float16,
1569+
function_kwargs={'bits': 4},
1570+
support_flash_attn=True)
15431571
def get_model_tokenizer_qwen1half_intx(model_dir: str,
15441572
torch_dtype: Dtype,
15451573
model_kwargs: Dict[str, Any],

0 commit comments

Comments
 (0)