[WIP]Support Q-Galore (modelscope#1440)

tastelikefeet · web-flow · commit cd8500b2561b · 2024-07-19T14:23:25.000+08:00
diff --git a/README.md b/README.md
@@ -55,6 +55,7 @@ You can contact us and communicate with us by adding our group:
 <img src="asset/discord_qr.jpg" width="200" height="200">  |  <img src="asset/wechat.png" width="200" height="200">
 
 ## 🎉 News
+- 2024.07.19: Support [Q-Galore](https://arxiv.org/abs/2407.08296), this algorithm can reduce the training memory cost by 60% (qwen-7b-chat, full, 80G -> 35G), use `swift sft --model_type xxx --use_galore true --galore_quantization true` to begin!
 - 2024.07.17: Support newly released InternVL2 models: `model_type` are internvl2-1b, internvl2-40b, internvl2-llama3-76b. For best practices, refer to [here](docs/source_en/Multi-Modal/internvl-best-practice.md).
 - 2024.07.17: Support the training and inference of [NuminaMath-7B-TIR](https://huggingface.co/AI-MO/NuminaMath-7B-TIR). Use with model_type `numina-math-7b`.
 - 🔥2024.07.16: Support exporting for ollama and bitsandbytes. Use `swift export --model_type xxx --to_ollama true` or `swift export --model_type xxx --quant_method bnb --quant_bits 4`
@@ -454,7 +455,7 @@ swift sft \
 NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 swift pt \
-    --model_type qwen1half-7b-chat \
+    --model_type qwen1half-7b \
     --dataset chinese_c4#10000 \
     --num_train_epochs 1 \
     --sft_type full \
diff --git a/README_CN.md b/README_CN.md
@@ -56,6 +56,7 @@ SWIFT具有丰富全面的文档，请查看我们的文档网站:
 
 
 ## 🎉 新闻
+- 🔥2024.07.19: 支持[Q-Galore](https://arxiv.org/abs/2407.08296)算法, 该算法可以减少显存使用约60% (qwen-7b-chat, full, 80G -> 35G), 使用命令行:`swift sft --model_type xxx --use_galore true --galore_quantization true`来开始训练!
 - 2024.07.17: 支持InternVL2系列新模型: `model_type`分别为internvl2-1b, internvl2-40b, internvl2-llama3-76b. 最佳实践可以查看[这里](docs/source/Multi-Modal/internvl最佳实践.md).
 - 2024.07.17: 支持[NuminaMath-7B-TIR](https://www.modelscope.cn/models/AI-ModelScope/NuminaMath-7B-TIR)的训练和推理. model_type可以使用`numina-math-7b`.
 - 🔥2024.07.16: 支持ollama和bitsandbytes导出. 可以使用命令: `swift export --model_type xxx --to_ollama true`或者`swift export --model_type xxx --quant_method bnb --quant_bits 4`.
@@ -448,7 +449,7 @@ swift sft \
 NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 swift pt \
-    --model_type qwen1half-7b-chat \
+    --model_type qwen1half-7b \
     --dataset chinese_c4#10000 \
     --num_train_epochs 1 \
     --sft_type full \
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -181,6 +181,13 @@
 - `--galore_proj_type: str` : 默认值`std`, GaLore矩阵分解类型.
 - `--galore_optim_per_parameter: bool` : 默认值False, 是否给每个Galore目标Parameter设定一个单独的optimizer.
 - `--galore_with_embedding: bool` : 默认值False, 是否对embedding应用GaLore.
+- `--galore_quantization` 是否使用q-galore. 默认值`False`.
+- `--galore_proj_quant`: 是否对SVD分解矩阵做量化, 默认`False`.
+- `--galore_proj_bits`: SVD量化bit数.
+- `--galore_proj_group_size`: SVD量化分组数.
+- `--galore_cos_threshold`: 投影矩阵更新的cos相似度阈值. 默认值0.4.
+- `--galore_gamma_proj`: 在投影矩阵逐渐相似后会拉长更新间隔, 本参数为每次拉长间隔的系数, 默认值2.
+- `--galore_queue_size`: 计算投影矩阵相似度的队列长度, 默认值5.
 
 ### LISA微调参数
 
diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -183,6 +183,13 @@
 - `--galore_proj_type: str` : Default `std`, GaLore matrix decomposition type.
 - `--galore_optim_per_parameter: bool` : Default False, whether to set a separate optimizer for each Galore target Parameter.
 - `--galore_with_embedding: bool` : Default False, whether to apply GaLore to embedding.
+- `--galore_quantization`: Whether to use q-galore. Default value `False`.
+- `--galore_proj_quant`: Whether to quantize the SVD decomposition matrix, default `False`.
+- `--galore_proj_bits`: Number of bits for SVD quantization.
+- `--galore_proj_group_size`: Number of groups for SVD quantization.
+- `--galore_cos_threshold`: Cosine similarity threshold for updating the projection matrix. Default value 0.4.
+- `--galore_gamma_proj`: When the projection matrix gradually becomes similar, this parameter is the coefficient for extending the update interval each time, default value 2.
+- `--galore_queue_size`: Queue length for calculating projection matrix similarity, default value 5.
 
 ### LISA Fine-tuning Parameters
 
diff --git a/scripts/benchmark/config/tuner.json b/scripts/benchmark/config/tuner.json
@@ -138,6 +138,39 @@
           "sft_type": "full"
         }
       },
+      {
+        "name": "full+galore128+quantize",
+        "requirements":{
+          "gpu": "1",
+          "ddp": "1"
+        },
+        "args": {
+          "sft_type": "full",
+          "use_galore": "true",
+          "galore_rank": "128",
+          "galore_update_proj_gap": "200",
+          "galore_optim_per_parameter": "false",
+          "galore_with_embedding": "false",
+          "galore_quantization": "true"
+        }
+      },
+      {
+        "name": "full+galore128+quantize+proj_quant",
+        "requirements":{
+          "gpu": "1",
+          "ddp": "1"
+        },
+        "args": {
+          "sft_type": "full",
+          "use_galore": "true",
+          "galore_rank": "128",
+          "galore_update_proj_gap": "200",
+          "galore_optim_per_parameter": "false",
+          "galore_with_embedding": "false",
+          "galore_quantization": "true",
+          "galore_proj_quant": "true"
+        }
+      },
       {
         "name": "full+galore128",
         "requirements":{
diff --git a/swift/llm/tuner.py b/swift/llm/tuner.py
@@ -284,6 +284,13 @@ def prepare_model(model, args: SftArguments):
             galore_scale=args.galore_scale,
             proj_type=args.galore_proj_type,
             optim_per_parameter=args.galore_optim_per_parameter,
+            quantize=args.galore_quantization,
+            proj_quant=args.galore_proj_quant,
+            proj_bits=args.galore_proj_bits,
+            proj_group_size=args.galore_proj_group_size,
+            cos_threshold=args.galore_cos_threshold,
+            gamma_proj=args.galore_gamma_proj,
+            queue_size=args.galore_queue_size,
         )
 
     callbacks = []
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
@@ -561,6 +561,13 @@ class SftArguments(ArgumentsBase):
     galore_proj_type: str = 'std'
     galore_optim_per_parameter: bool = False
     galore_with_embedding: bool = False
+    galore_quantization: bool = False
+    galore_proj_quant: bool = False
+    galore_proj_bits: int = 4
+    galore_proj_group_size: int = 256
+    galore_cos_threshold: float = 0.4
+    galore_gamma_proj: int = 2
+    galore_queue_size: int = 5
 
     # adalora
     adalora_target_r: int = 8
diff --git a/swift/trainers/optimizers/galore/utils.py b/swift/trainers/optimizers/galore/utils.py
@@ -1,4 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
+import importlib
 from dataclasses import dataclass
 from typing import Any, Dict, List, Tuple, Union
 
@@ -41,6 +42,13 @@ class GaLoreConfig:
     galore_scale: float = 1.0
     proj_type: str = 'std'
     optim_per_parameter: bool = False
+    quantize: bool = False
+    proj_quant: bool = False
+    proj_bits: int = 4
+    proj_group_size: int = 256
+    cos_threshold: float = 0.4
+    gamma_proj: int = 2
+    queue_size: int = 5
 
 
 class GaloreOptimizerWrapper(Optimizer):
@@ -82,6 +90,7 @@ def create_optimizer_and_scheduler(model: nn.Module, args: TrainingArguments, co
 
         logger.info(f'Enable GaLore for weights in module: {module_name}')
         galore_params.append(module.weight)
+
     id_galore_params = [id(p) for p in galore_params]
     galore_defaults = {
         'rank': config.rank,
@@ -90,9 +99,17 @@ def create_optimizer_and_scheduler(model: nn.Module, args: TrainingArguments, co
         'proj_type': config.proj_type,
         **defaults
     }
-    optim_cls, optim_kwargs = get_optimizer(args)
-
-    if config.optim_per_parameter:
+    if config.quantize:
+        galore_defaults['quant'] = config.proj_quant
+        galore_defaults['quant_n_bit'] = config.proj_bits
+        galore_defaults['quant_group_size'] = config.proj_group_size
+        galore_defaults['cos_threshold'] = config.cos_threshold
+        galore_defaults['gamma_proj'] = config.gamma_proj
+        galore_defaults['queue_size'] = config.queue_size
+    optim_cls, optim_kwargs = get_optimizer(args, config)
+
+    if config.optim_per_parameter and not config.quantize:
+        # q-galore does not support optim_per_parameter
         optimizer_dict = {}
         galore_defaults['update_proj_gap'] = galore_defaults['update_proj_gap'] * 2
         for p in model.parameters():
@@ -150,7 +167,7 @@ def create_optimizer_and_scheduler(model: nn.Module, args: TrainingArguments, co
         return optim, scheduler
 
 
-def get_optimizer(args: TrainingArguments) -> Tuple[Any, Any]:
+def get_optimizer(args: TrainingArguments, config: GaLoreConfig) -> Tuple[Any, Any]:
     # parse args.optim_args
     optim_args = {}
     if args.optim_args:
@@ -169,7 +186,18 @@ def get_optimizer(args: TrainingArguments) -> Tuple[Any, Any]:
         optimizer_cls = GaLoreAdafactor
         optimizer_kwargs.update({'scale_parameter': False, 'relative_step': False})
     elif args.optim in ('adamw_hf', 'adamw_torch'):
-        from .adamw import GaLoreAdamW
+        if config.quantize:
+            assert importlib.util.find_spec("q_galore_torch") is not None, \
+                'Please install q-galore by `pip install q_galore_torch`'
+            from swift.utils import get_dist_setting
+            _, _, world_size, _ = get_dist_setting()
+            if world_size > 1:
+                # from q_galore_torch import QGaLoreAdamW8bit_simulate as GaLoreAdamW
+                from q_galore_torch import QGaLoreAdamW8bit as GaLoreAdamW
+            else:
+                from q_galore_torch import QGaLoreAdamW8bit as GaLoreAdamW
+        else:
+            from .adamw import GaLoreAdamW
         optimizer_cls = GaLoreAdamW
         optimizer_kwargs.update(adam_kwargs)
     elif 'adamw' in args.optim and '8bit' in args.optim: