Quantization differences for quantize dynamic & MatMulNBits quantizer #25196

mgiessing · 2025-06-27T08:41:52Z

mgiessing
Jun 27, 2025

My question is what is the difference between this type of quantization (for int8)

Using quantize dynamic (which is int8 AFAIU):

import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = 'qwen2.5_0.5_ort_genai_fp32/model.onnx'
model_quant = 'model_quant.onnx'
quantized_model = quantize_dynamic(model_fp32, model_quant)

Using the MatMulNBits quantizer, which can also be enabled to quantize to int8?

from onnxruntime.quantization.matmul_nbits_quantizer import (
    MatMulNBitsQuantizer,
    QuantFormat,
    DefaultWeightOnlyQuantConfig,
)

model_fp32 = 'qwen2.5_0.5_ort_genai_fp32/model.onnx'
model_int8 = 'model_int8.onnx'

int8_algo_config = DefaultWeightOnlyQuantConfig(bits=8)

quant = MatMulNBitsQuantizer(
    model=model_fp32,
    block_size=32,
    is_symmetric=True,
    accuracy_level=4,
    nodes_to_exclude=None,
    quant_format=QuantFormat.QOperator,
    op_types_to_quantize=None,
    algo_config=int8_algo_config,
)
quant.process()

quant.model.save_model_to_file(
  model_int8,
  True)

Also, I do see a different size of the models (based of the original Qwen2.5-0.5B fp32 => 2400M):

Both should be in int8 but the size differs a lot:

For the model_quant.onnx => 610M
For the model_int8.onnx => 1018M

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantization differences for quantize dynamic & MatMulNBits quantizer #25196

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Quantization differences for quantize dynamic & MatMulNBits quantizer #25196

Uh oh!

mgiessing Jun 27, 2025

Using quantize dynamic (which is int8 AFAIU):

Using the MatMulNBits quantizer, which can also be enabled to quantize to int8?

Replies: 0 comments

mgiessing
Jun 27, 2025