Skip to content

[model] support Qwen3-235B-A22B-Instruct-250718 #5033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 21, 2025

Conversation

Jintao-Huang
Copy link
Collaborator

@Jintao-Huang Jintao-Huang commented Jul 21, 2025

Model Finetuning

https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/qwen3_235b.sh

We introduce self-cognition finetuning of Qwen3-235B-A22B-Instruct-2507 using Megatron & LoRA integrated with ms-swift. You will need 8 GPUs with 80GiB memory each.

Before starting the finetuning process, please ensure your environment is properly set up.

For instructions on installing Megatron-related dependencies, please refer to the Megatron-SWIFT training documentation (Docker images are also available):
https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html

git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

The finetuning dataset should be prepared in the following format (the "system" field is optional). You can specify it in the training script using --dataset <dataset_path>.

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "The capital of Zhejiang is Hangzhou."}]}
  1. Convert HF-format weights to Megatron format and test the conversion accuracy:
# 8 * 80GiB
# To test the conversion accuracy, please set --test_convert_precision true. Note that this requires 1.3T of memory resources.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift export \
    --model Qwen/Qwen3-235B-A22B-Instruct-2507 \
    --to_mcore true \
    --torch_dtype bfloat16 \
    --output_dir Qwen3-235B-A22B-Instruct-2507-mcore
image
  1. Self-cognition finetuning (full-parameter training) of Qwen3-235B-A22B-Instruct-2507-mcore requires 8 H20 GPUs with 90GiB memory each, and achieves a training speed of 3.5 seconds per iteration.
# Memory usage: 8 x 90GiB
# Training speed: 3.5s/it
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
    --load Qwen3-235B-A22B-Instruct-2507-mcore \
    --dataset 'swift/Chinese-Qwen3-235B-2507-Distill-data-110k-SFT#2000' \
              'swift/self-cognition#1000' \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --split_dataset_ratio 0.01 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-3 \
    --micro_batch_size 2 \
    --global_batch_size 16 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --max_epochs 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-4 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --save megatron_output/Qwen3-235B-A22B-Instruct-2507 \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 2048 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --attention_backend flash \
    --model_author swift \
    --model_name swift-robot

Training memory usage:
image

Training log:
image

If you need to run it on 8 GPUs with 80GiB memory each, you can use the following configuration:

# Memory usage: 8 * 78GiB
# Training speed: 9.5s/it
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
    --load Qwen3-235B-A22B-Instruct-2507-mcore \
    --dataset 'swift/Chinese-Qwen3-235B-2507-Distill-data-110k-SFT#2000' \
              'swift/self-cognition#1000' \
    --optimizer_cpu_offload true \
    --use_precision_aware_optimizer true \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --split_dataset_ratio 0.01 \
    --expert_model_parallel_size 2 \
    --pipeline_model_parallel_size 4 \
    --decoder_first_pipeline_num_layers 23 \
    --decoder_last_pipeline_num_layers 23 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-3 \
    --micro_batch_size 8 \
    --global_batch_size 16 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --max_epochs 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-4 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --save megatron_output/Qwen3-235B-A22B-Instruct-2507 \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 2048 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --attention_backend flash \
    --model_author swift \
    --model_name swift-robot
image
  1. Convert Megatron-format weights to HF format and test the conversion accuracy:
# 8 * 80GiB
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift export \
    --mcore_adapters megatron_output/Qwen3-235B-A22B-Instruct-2507/vx-xxx \
    --to_hf true \
    --torch_dtype bfloat16 \
    --output_dir megatron_output/Qwen3-235B-A22B-Instruct-2507/vx-xxx-hf
  1. After training is completed, use the following command for inference:
# 8 * 80GiB
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
    --model output/vx-xxx/checkpoint-xxx \
    --infer_backend vllm \
    --stream true \
    --temperature 0 \
    --vllm_tensor_parallel_size 8 \
    --vllm_max_model_len 8192 \
    --max_new_tokens 2048
image 截屏2025-07-23 21 33 26
  1. Push the model to ModelScope:
swift export \
    --model output/vx-xxx/checkpoint-xxx \
    --push_to_hub true \
    --hub_model_id '<your-model-id>' \
    --hub_token '<your-sdk-token>'

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Qwen3-235B-A22B-Instruct-250718 model by updating the model registration and documentation. The core changes are correct. However, I have a couple of suggestions for improvement:

  1. Code Organization: In swift/llm/model/model/qwen.py, the new model is added in a separate ModelGroup. It would be more consistent to add it to the existing ModelGroup for qwen3_moe models.

Addressing these points will improve the maintainability and clarity of the codebase.

Comment on lines 555 to 557
ModelGroup([
Model('Qwen/Qwen3-235B-A22B-Instruct-250718', 'Qwen/Qwen3-235B-A22B-Instruct-250718'),
]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code organization and maintainability, it would be clearer to add the new model Qwen/Qwen3-235B-A22B-Instruct-250718 to the existing ModelGroup for qwen3_moe models, rather than creating a new ModelGroup for a single model. You can add it under the # instruct comment in the first ModelGroup.

@Jintao-Huang Jintao-Huang merged commit e64984a into modelscope:main Jul 21, 2025
2 checks passed
@modelscope modelscope deleted a comment from gemini-code-assist bot Jul 24, 2025
@modelscope modelscope deleted a comment from gemini-code-assist bot Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants