ymcui · ymcui · Aug 4, 2023 · Aug 4, 2023
diff --git a/README.md b/README.md
@@ -120,14 +120,6 @@
 - [**在线转换**](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/online_conversion_zh)：Colab用户可利用本项目提供的notebook进行在线转换并量化模型
 - [**手动转换**](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/manual_conversion_zh)：离线方式转换，生成不同格式的模型，以便进行量化或进一步精调
 
-以下是完整模型在FP16精度和4-bit量化后的大小。如果选择手动合并，请确保本机有足够的内存和磁盘空间。
-
-| 模型版本      |   7B    |
-| :------------ | :-----: |
-| FP16模型      | 12.9 GB |
-| 8-bit量化模型 | 6.8 GB  |
-| 4-bit量化模型 | 3.7 GB  |
-
 
 ## 推理与部署
 
@@ -181,6 +173,24 @@ Alpaca系列模型之间对比：
 
 C-Eval推理代码请参考本项目 >>> [📚 GitHub Wiki](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/ceval_zh)
 
+### 量化效果评测
+
+以Chinese-LLaMA-2-7B为例，对比不同精度下的模型大小、PPL（困惑度）、C-Eval效果，方便用户了解量化精度损失。PPL以4K上下文大小计算，C-Eval汇报的是valid集合上zero-shot和5-shot结果。
+
+| 精度      | 模型大小 |  PPL   |   C-Eval    |
+| :-------- | :------: | :----: | :---------: |
+| FP16      | 12.9 GB  | 8.1797 | 28.2 / 36.0 |
+| 8-bit量化 |  6.8 GB  | 8.2884 | 26.8 / 35.4 |
+| 4-bit量化 |  3.7 GB  | 8.8581 | 25.5 / 32.8 |
+
+特别地，以下是在llama.cpp下不同量化方法的评测数据，供用户参考，速度以ms/tok计。具体细节见[Wiki](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/llamacpp_zh#关于量化方法选择及推理速度)。
+
+| | F16       | Q4_0   | Q4_1  | Q4_K  | Q5_0  | Q5_1  | Q5_K  | Q6_K  | Q8_0  |
+| --------- | -----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: |
+| PPL       | 8.640  | 8.987 | 9.175 | 8.836 | 8.730 | 8.776 | 8.707 | 8.671 | 8.640 |
+| Size      | 12.91G | 3.69G | 4.08G | 3.92G | 4.47G | 4.86G | 4.59G | 5.30G | 6.81G |
+| CPU Speed | 117    | 39    | 44    | 43    | 48    | 51    | 50    | 54    | 65    |
+| GPU Speed | 53     | 17    | 18    | 20    | n/a   | n/a  | 25    | 26    | n/a  |
 
 ## 训练与精调
 

diff --git a/README_EN.md b/README_EN.md
@@ -115,14 +115,6 @@ As the LoRA models cannot be used separately, they must be merged with the origi
 - [**Online Conversion**](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/online_conversion_en): Colab users can use the notebook provided by this project for online conversion and model quantization
 - [**Manual Conversion**](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/manual_conversion_en): Offline method of conversion, generating different formats of models for quantization or further fine-tuning
 
-Below are the sizes of the full models in FP16 precision and 4-bit quantization. If you choose manual conversion, please ensure that your machine has enough memory and disk space.
-
-| Model Version         |   7B    |
-| :-------------------- | :-----: |
-| FP16 Model            | 12.9 GB |
-| 8-bit Quantized Model | 6.8 GB  |
-| 4-bit Quantized Model | 3.7 GB  |
-
 ## Inference and Deployment
 
 The models in this project mainly support the following quantization, inference, and deployment methods.
@@ -174,6 +166,25 @@ It is important to note that the comprehensive assessment of the capabilities of
 
 For C-Eval inference code, please refer to >>> [📚 GitHub Wiki](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/ceval_en)
 
+### Quantization Evaluation
+
+To understand the quality loss brought by quantization, taking Chinese-LLaMA-2-7B as an example, we report the model size, PPL, C-eval results under different quantization levels. PPL is calculated under 4K context, and we report zero-shot and 5-shot results on C-Eval valid set.
+
+| Precision | Model Size |  PPL   |   C-Eval    |
+| :-------- | :--------: | :----: | :---------: |
+| FP16      |  12.9 GB   | 8.1797 | 28.2 / 36.0 |
+| 8-bit     |   6.8 GB   | 8.2884 | 26.8 / 35.4 |
+| 4-bit     |   3.7 GB   | 8.8581 | 25.5 / 32.8 |
+
+Specifically, the followings are the benchmark for different quantization methods in llama.cpp. The speed is presented with ms/tok. For details, see our [Wiki](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/llamacpp_en#quantization-method-and-inference-speed).
+
+|           |    F16 |  Q4_0 |  Q4_1 |  Q4_K |  Q5_0 |  Q5_1 |  Q5_K |  Q6_K |  Q8_0 |
+| --------- | -----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: |
+| PPL       |  8.640 | 8.987 | 9.175 | 8.836 | 8.730 | 8.776 | 8.707 | 8.671 | 8.640 |
+| Size      | 12.91G | 3.69G | 4.08G | 3.92G | 4.47G | 4.86G | 4.59G | 5.30G | 6.81G |
+| CPU Speed |    117 |    39 |    44 |    43 |    48 |    51 |    50 |    54 |    65 |
+| GPU Speed |     53 |    17 |    18 |    20 |   n/a |   n/a |    25 |    26 |   n/a |
+
 ## Training and Fine-tuning
 
 Please refer to the corresponding Wiki for information on pre-training (Chinese LLaMA-2 training) and instruction fine-tuning (Chinese Alpaca-2 training).