Skip to content

Releases: modelscope/ms-swift

v3.7.0

07 Aug 07:05
Compare
Choose a tag to compare

中文版

新特性

  1. GRPO:
    a. 支持GSPO算法,在GRPO训练中使用参数--importance_sampling_level sequence,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
    b. GRPO server mode 支持多机 rollout,支持传入多个 vllm_server_host/port,参考脚本:https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
    c. GRPO rollout 兼容 GYM 环境规范(感谢开发者Mouse的贡献),参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/GYM%E7%8E%AF%E5%A2%83%E8%AE%AD%E7%BB%83.html
    d. GRPO 支持 entropy_mask 来过滤低熵token损失计算,同时logger支持记录熵值动态,参考文档https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
    e. 支持多轮算法DeepEyes训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
    f. GRPO 支持--truncation_strategy delete,删除输入长度超过max_length的数据,并重新采样。
  2. Megatron-SWIFT:
    a. 支持使LoRA训练,现支持CPT/SFT/DPO,显著加速MoE训练速度。
    - 文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html#lora
    - 训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
    b. 支持loss scale,方便Agent训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
    c. 默认megatron-core版本升级至0.13。
    d. 支持bshd格式,方便自定义attention_mask。
    e. 日志优化:新增GPU占用、剩余训练时间等信息打印,并输出logging.jsonl存储训练日志。
    f. 模型加载与转换速度优化,并增加模型加载进度条。
  3. 训练:
    a. 支持Flash-Attention-3(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
    b. 新增--new_speical_tokens参数,方便新增特殊tokens。训练脚本参考: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
    c. 新增--cached_dataset参数,支持CPT/SFT的离线tokenize。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
    d. 序列Packing模块重构。加速Packing速度,并对多模态packing的磁盘存储问题优化。
    e. 支持Qwen2.5-VL混合模态数据(即单条数据中含多种模态) + deepspeed训练。
    f. 多模态模型训练支持 loss_scale。
    g. rope_scaling 支持传入字典,此外支持设置 max_model_len 对 rope_scaling 的 factor 自动调整。
    h. 支持DeepSpeed-AutoTP(该技术不支持LoRA)。
    i. 多模态Packing兼容 transformers>=4.53;序列并行兼容 transformers>=4.52。
    j. resume_only_model默认将进行数据跳过,并使用ignore_data_skip参数进行控制。
    k. MoE模型训练支持 router_aux_loss_coef 参数。
    l. template新增max_length裁剪保护机制,不对图像/视频等tokens进行裁剪。
    m. tuner_backend unsloth 支持moe模型、device_map和DDP。
    n. embedding训练支持liger_kernel。
  4. RLHF:
    a. 支持MPO训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
    b. 多模态DPO支持了拒绝图片输入,在数据集中加入rejected_images列。
  5. 推理部署:
    a. 支持embedding系列模型的推理部署,包括pt/vllm/sglang的infer_backend。部署脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
    b. InferEngine支持return_details参数,以输出prompt_token_ids和token_ids。
    c. vLLM推理引擎兼容更多多模态模型:ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4。
    d. vLLM参数重构,参数名前加入vllm_前缀。GRPO模块复用vLLM参数。
  6. 导出:
    a. QLoRA支持Merge-LoRA,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
    b. 支持MoE/多模态模型的FP8/BNB量化,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize

新模型

  1. 纯文本模型:
    a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, Qwen/Qwen3-4B-[Instruct/Thinking]-2507系列(含Megatron-SWIFT),训练脚本参考:#5033
    b. openai-mirror/gpt-oss-20b系列,最佳实践参考:#5277
    c. ZhipuAI/GLM-4.5系列(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
    d. Hunyuan-7B-Instruct系列,最佳实践参考:#5236
    e. mistralai/Devstral-Small-2505
  2. 多模态模型:
    a. OpenBMB/MiniCPM-V-4,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh

English Version

New Features

  1. GRPO
    a. Added support for the GSPO algorithm. Use --importance_sampling_level sequence during GRPO training. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
    b. GRPO “server mode” now supports multi-node rollout; pass in multiple vllm_server_host/port. Example script: https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
    c. GRPO rollout is now GYM-compatible (thanks to contributor Mouse). Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html
    d. Added entropy_mask for filtering low-entropy tokens during loss computation, and the logger now tracks entropy dynamics. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
    e. Added support for the multi-round DeepEyes algorithm. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
    f. GRPO supports --truncation_strategy delete: remove samples whose input length exceeds max_length and resample.
  2. Megatron-SWIFT
    a. Added LoRA training (CPT/SFT/DPO) to significantly accelerate MoE training.
    - Docs: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html#lora-training
    - Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
    b. Added loss-scaling to simplify Agent training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
    c. Default megatron-core upgraded to 0.13.
    d. Added bshd tensor format to facilitate custom attention_mask.
    e. Logging improvements: prints GPU memory, estimated remaining time, and writes logging.jsonl.
    f. Faster model loading & conversion plus a progress bar.
  3. Training
    a. Added Flash-Attention-3 support (including Megatron-SWIFT). Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
    b. New --new_special_tokens flag for adding special tokens. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
    c. New --cached_dataset flag for offline tokenization in CPT/SFT. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
    d. Re-implemented the sequence-packing module for faster packing and better multimodal disk I/O.
    e. Qwen2.5-VL hybrid-modal data (multiple modalities in a single sample) + DeepSpeed training supported.
    f. Multimodal training now supports loss-scaling.
    g. rope_scaling now accepts a dict; max_model_len can auto-adjust the scaling factor.
    h. Added DeepSpeed-AutoTP (not compatible with LoRA).
    i. Multimodal packing is compatible with transformers ≥ 4.53; sequence parallelism with transformers ≥ 4.52.
    j. With resume_only_model, data skipping is enabled by default; control via ignore_data_skip.
    k. MoE training supports router_aux_loss_coef.
    l. Template files get a max_length clipping safeguard (no clipping of image/video tokens).
    m. tuner_backend unsloth now supports MoE models, device_map, and DDP.
    n. Embedding training supports liger_kernel.
  4. RLHF
    a. Added MPO training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
    b. Multimodal DPO can now reject image inputs by adding a rejected_images column.
  5. Inference & Deployment
    a. Added deployment for embedding models across pt/vllm/sglang back-ends. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
    b. InferEngine supports return_details to output prompt_token_ids and token_ids.
    c. vLLM back-end now supports more multimodal models: ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4.
    d. vLLM arguments refactored: all start with the vllm_ prefix. GRPO module reuses the same options.
  6. Export
    a. QLoRA now supports Merge-LoRA. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
    b. Added FP8 / BNB quantization for MoE and multimodal models. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize

New Models

  1. Text-only
    a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, and Qwen/Qwen3-4B-[Instruct/Thinking]-2507 (Megatron-SWIFT supported). Training script: #5033
    b. openai-mirror/gpt-oss-20b family. Best-practice: #5277
    c. ZhipuAI/GLM-4.5 family (Megatron-SWIFT supported). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
    d. Hunyuan-7B-Instruct family. Best-practice: #5236
    e. mistralai/Devstral-Small-2505
  2. Multimodal
    a. OpenBMB/MiniCPM-V-4. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh

What's Changed

Read more

Patch release v3.6.4

02 Aug 06:35
Compare
Choose a tag to compare

Patch release v3.6.3

29 Jul 06:24
Compare
Choose a tag to compare

Patch release v3.6.2

18 Jul 08:18
Compare
Choose a tag to compare

Patch release v3.6.1

11 Jul 02:14
Compare
Choose a tag to compare

v3.6.0

08 Jul 03:35
Compare
Choose a tag to compare

中文版

新特性

  1. Megatron-SWIFT:
    a. 支持更多的 MoE 模型结构,包括:DeepseekV3ForCausalLM、Dots1ForCausalLM 和 Ernie4_5_MoeForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
    b. 支持更多的 Dense 模型结构,包括:MiMoForCausalLM、InternLM3ForCausalLM 和 Ernie4_5_ForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
    c. 支持 DPO 训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
    d. 支持 FP8 训练。
    e. 支持更多 rope scaling 类型,包括:default、linear、yarn、dynamic、longrope、llama3 等。
    f. --test_convert_precision参数优化,方便测试 mcore 与 huggingface 模型权重转换精度。
  2. GRPO:
    a. GRPO 多轮训练重构,支持使用 AsyncEngine 加速多轮推理,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
    b. offload_model 参数额外对参考模型进行卸载。
    c. 优化 sleep_level 和 offload_model 参数下的显存管理。
    d. reward_funcs 增加了 trainer_state 入参,方便获取当前训练步数和总步数。
  3. 训练:
    a. 支持 reranker 训练,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
    b. CPT/SFT/DPO/GRPO 纯文本大模型训练支持 ring-attention 切分序列长度,降低显存占用。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
    c. channel loss 在CPT/SFT训练时,兼容 padding_free 与 packing。 感谢招商银行技术团队的贡献。
    d. remove_unused_columns 参数优化。设置为 False,则将额外数据集传递至 Trainer 内,方便自定义损失函数。
    e. split_dataset_ratio参数默认值从0.01修改为0,默认不再进行验证集切分,需要手动设置--split_dataset_ratio或者--val_dataset
    f. 多模态模型 packing/padding_free 损失对齐问题修复。详见此PR:#4838
    g. swanlab 支持训练完成后的飞书通知回调。
  4. RLHF:
    a. 纯文本/多模态模型支持 GKD 训练,部分场景下支持 padding_free 和 packing,训练脚本如下:
    i. 大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
    ii. 多模态大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
    b. reward model 训练支持 margin 参数支持,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm
  5. 全链路:
    a. 支持使用 SGLang 推理引擎对 ms-swift 推理/部署/评测/ui模块进行加速,设置--infer_backend sglang即可。推理脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
    b. 支持 FP8 量化,量化脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh
  6. Web-UI:
    a. 支持 SFT/RLHF/GRPO 在不同 Tab 页面训练,支持保存训练命令行。
    b. Web-UI 界面支持数据采样。

新模型

  1. 多模态模型:
    a. ZhipuAI/GLM-4.1V-9B-Thinking系列
    b. Kwai-Keye/Keye-VL-8B-Preview
    c. moonshotai/Kimi-VL-A3B-Thinking-2506
    d. google/gemma-3n-E2B-it系列
  2. 纯文本模型:
    a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT系列
    b. rednote-hilab/dots.llm1.inst系列
    c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
    d. MiniMax/MiniMax-M1-80k系列(推理)
    e. moonshotai/Kimi-Dev-72B
    f. cognitivecomputations/DeepSeek-R1-0528-AWQ

English Version

New Features

  1. Megatron-SWIFT:
    a. Support for more MoE model architectures, including: DeepseekV3ForCausalLM, Dots1ForCausalLM, and Ernie4_5_MoeForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
    b. Support for more Dense model architectures, including: MiMoForCausalLM, InternLM3ForCausalLM, and Ernie4_5_ForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
    c. DPO training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
    d. FP8 training supported.
    e. More rope scaling types supported, including: default, linear, yarn, dynamic, longrope, llama3, etc.
    f. --test_convert_precision parameter optimized for easier testing of weight conversion precision between mcore and huggingface models.
  2. GRPO:
    a. GRPO multi-turn training refactored, supporting accelerated multi-turn inference with AsyncEngine. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
    b. The offload_model parameter now also offloads the reference model.
    c. Optimized GPU memory management under sleep_level and offload_model parameters.
    d. Added trainer_state as an input parameter to reward_funcs, making it easier to obtain the current and total training steps.
  3. Training:
    a. Reranker training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
    b. CPT/SFT/DPO/GRPO pure-text large model training supports ring-attention sequence length partitioning, reducing memory usage. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
    c. Channel loss in CPT/SFT training is compatible with padding_free and packing. Thanks to the technical team at China Merchants Bank for their contribution.
    d. Optimized remove_unused_columns parameter. When set to False, extra dataset columns are passed to the Trainer for custom loss functions.
    e. The default value for split_dataset_ratio changed from 0.01 to 0, so the validation set is not split by default. You now need to manually set --split_dataset_ratio or --val_dataset.
    f. Fixed loss alignment issue between packing/padding_free for multimodal models. For details, see this PR: #4838
    g. Swanlab now supports Feishu (Lark Suite) notification callback after training is completed.
  4. RLHF:
    a. Pure-text and multimodal models support GKD training, with some scenarios supporting padding_free and packing. Training scripts:
    i. Large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
    ii. Multimodal large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
    b. Reward model training now supports the margin parameter. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm
  5. Full Pipeline:
    a. SGLang inference engine can be used to accelerate ms-swift inference/deployment/evaluation/ui modules, by setting --infer_backend sglang. Inference script reference: https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
    b. FP8 quantization supported. Quantization script reference: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh
  6. Web-UI:
    a. Supports SFT/RLHF/GRPO training on different Tab pages, and saves training command lines.
    b. Web-UI interface supports data sampling.

New Models

  1. Multimodal Models:
    a. ZhipuAI/GLM-4.1V-9B-Thinking series
    b. Kwai-Keye/Keye-VL-8B-Preview
    c. moonshotai/Kimi-VL-A3B-Thinking-2506
    d. google/gemma-3n-E2B-it series
  2. Pure Text Models:
    a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT series
    b. rednote-hilab/dots.llm1.inst series
    c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
    d. MiniMax/MiniMax-M1-80k series (inference)
    e. moonshotai/Kimi-Dev-72B
    f. cognitivecomputations/DeepSeek-R1-0528-AWQ

What's Changed

Read more

Patch release v3.5.3

27 Jun 05:12
Compare
Choose a tag to compare

Patch release v3.5.2

20 Jun 14:49
Compare
Choose a tag to compare

Patch release v3.5.1

13 Jun 14:24
Compare
Choose a tag to compare

v3.5.0

08 Jun 16:51
Compare
Choose a tag to compare

中文版

新特性

  1. GRPO:
    a. 代码重构,使用参数vllm_mode指定。参数说明详见参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#id1:~:text=vllm_mode%20server%20%E5%8F%82%E6%95%B0,colocate%20mode%20%E7%94%9F%E6%95%88%E3%80%82
    b. GRPO长文本优化,支持ulysses序列并行,显著降低长文本训练显存占用,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/long_text/sequence_parallel_grpo.sh
    c. 新增sync_ref_model参数,支持训练中同步参考模型权重。
    d. 支持 liger kernel loss,使用参数 use_liger_kernel,降低显存占用。
    e. External mode 支持 move_model_batches,降低zero3同步权重时的显存峰值。
    f. 集成 INTELLECT-2 的 Two-Sided Clipping 算法,使用参数 delta。
    g. 支持奖励函数返回 None,适用于多任务训练,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#id7
    h. Internal mode 支持 vllm_server_base_url,传入外部 vLLM 服务器url。
    i. 插件拓展:支持 QwenLong-L1 奖励模型插件。
    j. 新增 steps_per_generation/generation_batch_size 参数,支持自定义采样批量大小。
    k. Web-UI支持GRPO训练。
    l. 以下参数将在 v3.6 移除:tensor_parallel_size / vllm_device / vllm_max_num_seqs / num_infer_workers。
  2. 训练:
    a. CPT/SFT/DPO/GRPO 支持 padding free。通过将批次数据展平避免数据填充(padding),显著降低显存并加速训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/padding_free
    b. 多模态训练增强。支持使用 vit_lr 和 aligner_lr 参数独立控制 ViT 和 Aligner 模块的学习率。支持通过 vit_gradient_checkpointing 参数单独控制 vit 模块的 gradient checkpointing,性能基准测试参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh
    c. CPT/SFT支持使用 channel loss 对不同 channel 数据集分别统计损失值。感谢招商银行技术团队的贡献。
    d. CPT/SFT/DPO支持 use_logits_to_keep参数,降低显存占用,提升训练速度。
    e. Qwen2.5-VL/Omni 支持传入图像目录进行视频训练。
  3. 推理部署:
    a. swift infer批处理优化,新增 write_batch_size 参数,用于控制批处理推理结果写入result_path的间隔。
    b. vllm 推理引擎默认使用 V1 engine,并支持TP和DP结合的推理模式,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/dp_tp.sh
  4. Megatron-SWIFT:
    a. 非流式数据集支持通过 max_epochs 自动计算 train_iters。
    b. 提供 extra_megatron_kwargs 参数,支持未写入ms-swift的megatron参数传入。

新模型

  1. Qwen/Qwen3-Embedding-0.6B系列,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/embedding/train_emb.sh
  2. deepseek-ai/DeepSeek-R1-0528-Qwen3-8B系列,最佳实践参考https://mp.weixin.qq.com/s/-hhfGiiGTqXUybwPH525gw
  3. iic/QwenLong-L1-32B
  4. XiaomiMiMo/MiMo-7B-RL-0530、XiaomiMiMo/MiMo-VL-7B-SFT系列
  5. OpenBMB/MiniCPM4-0.5B系列

English Version

New Features

  1. GRPO:
    a. Code refactored, specified via the vllm_mode parameter. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html#arguments-and-execution-script:~:text=vllm_mode%20server%20parameter,in%20colocate%20mode.
    b. GRPO long-text optimization with Ulysses sequence parallelism, significantly reducing GPU memory usage during long-text training. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/long_text/sequence_parallel_grpo.sh
    c. Added sync_ref_model parameter to synchronize reference model weights during training.
    d. Supports Liger Kernel Loss via use_liger_kernel parameter, reducing GPU memory consumption.
    e. External mode supports move_model_batches to lower peak GPU memory during ZeRO-3 weight synchronization.
    f. Integrated INTELLECT-2’s Two-Sided Clipping algorithm using the delta parameter.
    g. Supports reward functions returning None, applicable for multi-task training. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html#multi-task-training
    h. Internal mode supports vllm_server_base_url for passing external vLLM server URLs.
    i. Plugin extension: Added QwenLong-L1 reward model plugin.
    j. Added steps_per_generation and generation_batch_size parameters for customizing sampling batch size.
    k. Web-UI supports GRPO training.
    l. The following parameters will be deprecated in v3.6: tensor_parallel_size, vllm_device, vllm_max_num_seqs, num_infer_workers.
  2. Training:
    a. CPT/SFT/DPO/GRPO support padding-free training. By flattening batch data to avoid padding, GPU memory usage is reduced and training speed is improved. Script: https://github.com/modelscope/ms-swift/tree/main/examples/train/padding_free
    b. Multimodal training enhancements: Supports separate learning rates for ViT and Aligner modules via vit_lr and aligner_lr parameters. Added vit_gradient_checkpointing to independently control gradient checkpointing for ViT modules. Benchmark: https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh
    c. CPT/SFT support channel_loss to separately calculate loss for different channel datasets. Thanks to the contributions from the technical team at China Merchants Bank.
    d. CPT/SFT/DPO support use_logits_to_keep to reduce GPU memory usage and accelerate training.
    e. Qwen2.5-VL/Omni support video training by passing image directories.
  3. Inference & Deployment:
    a. Optimized swift infer batching with new write_batch_size parameter to control inference result write intervals to result_path.
    b. vLLM inference engine now defaults to V1 engine and supports hybrid Tensor Parallelism (TP) and Data Parallelism (DP). Script: https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/dp_tp.sh
  4. Megatron-SWIFT:
    a. Non-streaming datasets automatically calculate train_iters via max_epochs.
    b. Added extra_megatron_kwargs to pass unlisted Megatron parameters into ms-swift.

New Models

  1. Qwen/Qwen3-Embedding-0.6B series. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/train/embedding/train_emb.sh
  2. deepseek-ai/DeepSeek-R1-0528-Qwen3-8B series. Best practices: https://mp.weixin.qq.com/s/-hhfGiiGTqXUybwPH525gw
  3. iic/QwenLong-L1-32B
  4. XiaomiMiMo/MiMo-7B-RL-0530 & XiaomiMiMo/MiMo-VL-7B-SFT series
  5. OpenBMB/MiniCPM4-0.5B series

What's Changed

Read more