llama.cpp量化选项、推理速度对比 #195

ymcui · 2023-04-21T13:11:07Z

ymcui
Apr 21, 2023
Maintainer

llama.cpp中提供了多种量化方式。下表中给出了最新版本中支持的量化参数及其相关对比，供参考。

已更新至Wiki：https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/llama.cpp量化部署

关于量化参数

量化程序./quantize中的最后一个参数，其默认值为2，即使用q4_0量化模式。下表给出了其他方式的效果对比。测试中使用了默认-t参数（默认值：4），推理模型为中文Alpaca-7B，测试环境M1 Max。测试命令更多关于量化参数可参考llama.cpp#PPL。

参数	对应量化算法	推理速度	模型大小	小样本数据PPL	备注
2	q4_0	57ms/token	4.31G	25.7	默认
3	q4_1	102ms/token	5.17G	24.5	-
5（ARM only）	q4_2	85ms/token	4.31G	24.8	实验性，需要等稳定版本
6	q4_3	156ms/token	5.17G	22.9	实验性，需要等稳定版本
-	f16	88ms/token	13.77G	21.8	非量化版本

关于量化模型预测速度

关于速度方面，-t参数并不是越大越好，要根据自己的处理器进行适配。下表给出了M1 Max芯片（8大核2小核）的推理速度对比。可以看到，与核心数一致的时候速度最快，超过这个数值之后速度反而变慢。

参数	推理速度（7B-q4_0）	推理速度（13B-q4_0）
1	230ms/token	434ms/token
2	110ms/token	208ms/token
4	58ms/token	111ms/token
6	44ms/token	80ms/token
8	36ms/token	64ms/token
10	112ms/token	202ms/token

sunyuhan19981208 · 2023-05-17T09:26:57Z

sunyuhan19981208
May 17, 2023

哇，在mac上面这么快，我用的V100，40层都加载在gpu里面都没你快，不知道是啥选项搞得有问题

2 replies

ymcui May 17, 2023
Maintainer Author

llama.cpp针对arm neon做了特别优化，所以会出现有些人用服务器好几十核心的cpu都没有苹果M系列芯片快的情况。

xiebaiyuan Jul 11, 2025

其实还有一点, apple sillion 统一内存架构, 内存带宽的利用要好的多.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama.cpp量化选项、推理速度对比 #195

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

llama.cpp量化选项、推理速度对比 #195

Uh oh!

ymcui Apr 21, 2023 Maintainer

关于量化参数

关于量化模型预测速度

Replies: 1 comment · 2 replies

Uh oh!

sunyuhan19981208 May 17, 2023

Uh oh!

ymcui May 17, 2023 Maintainer Author

Uh oh!

xiebaiyuan Jul 11, 2025

ymcui
Apr 21, 2023
Maintainer

Replies: 1 comment 2 replies

sunyuhan19981208
May 17, 2023

ymcui May 17, 2023
Maintainer Author