Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 18 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,14 +120,6 @@
- [**在线转换**](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/online_conversion_zh):Colab用户可利用本项目提供的notebook进行在线转换并量化模型
- [**手动转换**](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/manual_conversion_zh):离线方式转换,生成不同格式的模型,以便进行量化或进一步精调

以下是完整模型在FP16精度和4-bit量化后的大小。如果选择手动合并,请确保本机有足够的内存和磁盘空间。

| 模型版本 | 7B |
| :------------ | :-----: |
| FP16模型 | 12.9 GB |
| 8-bit量化模型 | 6.8 GB |
| 4-bit量化模型 | 3.7 GB |


## 推理与部署

Expand Down Expand Up @@ -181,6 +173,24 @@ Alpaca系列模型之间对比:

C-Eval推理代码请参考本项目 >>> [📚 GitHub Wiki](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/ceval_zh)

### 量化效果评测

以Chinese-LLaMA-2-7B为例,对比不同精度下的模型大小、PPL(困惑度)、C-Eval效果,方便用户了解量化精度损失。PPL以4K上下文大小计算,C-Eval汇报的是valid集合上zero-shot和5-shot结果。

| 精度 | 模型大小 | PPL | C-Eval |
| :-------- | :------: | :----: | :---------: |
| FP16 | 12.9 GB | 8.1797 | 28.2 / 36.0 |
| 8-bit量化 | 6.8 GB | 8.2884 | 26.8 / 35.4 |
| 4-bit量化 | 3.7 GB | 8.8581 | 25.5 / 32.8 |

特别地,以下是在llama.cpp下不同量化方法的评测数据,供用户参考,速度以ms/tok计。具体细节见[Wiki](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/llamacpp_zh#关于量化方法选择及推理速度)。

| | F16 | Q4_0 | Q4_1 | Q4_K | Q5_0 | Q5_1 | Q5_K | Q6_K | Q8_0 |
| --------- | -----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: |
| PPL | 8.640 | 8.987 | 9.175 | 8.836 | 8.730 | 8.776 | 8.707 | 8.671 | 8.640 |
| Size | 12.91G | 3.69G | 4.08G | 3.92G | 4.47G | 4.86G | 4.59G | 5.30G | 6.81G |
| CPU Speed | 117 | 39 | 44 | 43 | 48 | 51 | 50 | 54 | 65 |
| GPU Speed | 53 | 17 | 18 | 20 | n/a | n/a | 25 | 26 | n/a |

## 训练与精调

Expand Down
27 changes: 19 additions & 8 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,14 +115,6 @@ As the LoRA models cannot be used separately, they must be merged with the origi
- [**Online Conversion**](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/online_conversion_en): Colab users can use the notebook provided by this project for online conversion and model quantization
- [**Manual Conversion**](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/manual_conversion_en): Offline method of conversion, generating different formats of models for quantization or further fine-tuning

Below are the sizes of the full models in FP16 precision and 4-bit quantization. If you choose manual conversion, please ensure that your machine has enough memory and disk space.

| Model Version | 7B |
| :-------------------- | :-----: |
| FP16 Model | 12.9 GB |
| 8-bit Quantized Model | 6.8 GB |
| 4-bit Quantized Model | 3.7 GB |

## Inference and Deployment

The models in this project mainly support the following quantization, inference, and deployment methods.
Expand Down Expand Up @@ -174,6 +166,25 @@ It is important to note that the comprehensive assessment of the capabilities of

For C-Eval inference code, please refer to >>> [📚 GitHub Wiki](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/ceval_en)

### Quantization Evaluation

To understand the quality loss brought by quantization, taking Chinese-LLaMA-2-7B as an example, we report the model size, PPL, C-eval results under different quantization levels. PPL is calculated under 4K context, and we report zero-shot and 5-shot results on C-Eval valid set.

| Precision | Model Size | PPL | C-Eval |
| :-------- | :--------: | :----: | :---------: |
| FP16 | 12.9 GB | 8.1797 | 28.2 / 36.0 |
| 8-bit | 6.8 GB | 8.2884 | 26.8 / 35.4 |
| 4-bit | 3.7 GB | 8.8581 | 25.5 / 32.8 |

Specifically, the followings are the benchmark for different quantization methods in llama.cpp. The speed is presented with ms/tok. For details, see our [Wiki](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/llamacpp_en#quantization-method-and-inference-speed).

| | F16 | Q4_0 | Q4_1 | Q4_K | Q5_0 | Q5_1 | Q5_K | Q6_K | Q8_0 |
| --------- | -----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: |
| PPL | 8.640 | 8.987 | 9.175 | 8.836 | 8.730 | 8.776 | 8.707 | 8.671 | 8.640 |
| Size | 12.91G | 3.69G | 4.08G | 3.92G | 4.47G | 4.86G | 4.59G | 5.30G | 6.81G |
| CPU Speed | 117 | 39 | 44 | 43 | 48 | 51 | 50 | 54 | 65 |
| GPU Speed | 53 | 17 | 18 | 20 | n/a | n/a | 25 | 26 | n/a |

## Training and Fine-tuning

Please refer to the corresponding Wiki for information on pre-training (Chinese LLaMA-2 training) and instruction fine-tuning (Chinese Alpaca-2 training).
Expand Down