-| **[W8A8-INT8](../examples/quantization_w8a8_int8.md)** | 8-bit integer (INT8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Weight quantization can be both per-tensor or per-channel for INT8. W8A8-INT8 is good for general performance and compression, especially for server and batch inference. Activation quantization is carried out during inference on vLLM. Activations can also be static or dynamic. Additionally, INT8 activations can also be asymmetric. W8A8-INT8 helps improve speed in high QPS scenarios or during offline serving with vLLM. | Requires calibration dataset for weight quantization. | Supports all NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators. Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). | W8A8-INT8 is good for general performance and compression, especially for server and batch inference. Activation quantization is carried out during inference on vLLM. Useful for speed ups in high QPS regimes or offline serving on vLLM. | Requires calibration dataset for weight quantization and static per-tensor activation quantization. | Supports all NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators. Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). |
0 commit comments