Skip to content

Commit b345391

Browse files
committed
Dipika's comment
1 parent 760bfdd commit b345391

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

docs/guides/compression_schemes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Post-training quantization is performed to reduce the precision of quantizable w
77
| Scheme | Description | Data required? | vLLM Hardware Compatibility |
88
|--------|-------------|----------------|-----------------------------|
99
| **[W8A8-FP8](../examples/quantization_w8a8_fp8.md)** | 8-bit floating point (FP8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token or static per-tensor quantization to compress activations to 8 bits. Weights scales that are generated on a per-channel basis or generated on a per-tensor basis are also possible. Channel-wise weights quantization with dynamic per-token activations is the the most performant option. W8A8-FP8 does not require a calibration dataset. Activation quantization is carried out during inference on vLLM. Good for general performance and compression, especially for server and batch inference. | No calibration dataset is required, unless you are doing static per-tensor activation quantization. | Latest NVIDIA GPUs (Hopper and later) and latest AMD GPUs. Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell) |Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell) |
10-
| **[W8A8-INT8](../examples/quantization_w8a8_int8.md)** | 8-bit integer (INT8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Weight quantization can be both per-tensor or per-channel for INT8. W8A8-INT8 is good for general performance and compression, especially for server and batch inference. Activation quantization is carried out during inference on vLLM. Activations can also be static or dynamic. Additionally, INT8 activations can also be asymmetric. W8A8-INT8 helps improve speed in high QPS scenarios or during offline serving with vLLM. | Requires calibration dataset for weight quantization. | Supports all NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators. Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). | W8A8-INT8 is good for general performance and compression, especially for server and batch inference. Activation quantization is carried out during inference on vLLM. Useful for speed ups in high QPS regimes or offline serving on vLLM. | Requires calibration dataset for weight quantization and static per-tensor activation quantization. | Supports all NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators. Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). |
10+
| **[W8A8-INT8](../examples/quantization_w8a8_int8.md)** | 8-bit integer (INT8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Weight quantization can be both per-tensor or per-channel for INT8. W8A8-INT8 is good for general performance and compression, especially for server and batch inference. Activation quantization is carried out during inference on vLLM. Activations can be static or dynamic. Additionally, INT8 activations can also be asymmetric. W8A8-INT8 helps improve speed in high QPS scenarios or during offline serving with vLLM. W8A8-INT8 is good for general performance and compression, especially for server and batch inference. | Requires calibration dataset for weight quantization and static per-tensor activation quantization. | Supports all NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators. Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). |
1111
| **[W4A16](../examples/quantization_w4a16/README.md)** | Quantizes only weights to 4-bit integer (INT4) precision, retaining activations in 16-bit floating point (FP16) precision. W4A16 provides ~3.7X smaller weights but requires 16-bit arithmetic operations. W4A16 also supports asymmetric weight quantization. W4A16 provides maximum compression for latency-sensitive applications with limited memory, and useful speed ups in low QPS regimes with more weight compression. The linked example leverages the GPTQ algorithm to decrease quantization loss, but other algorithms like [AWQ](../examples/awq/awq_one_shot.py) can also be leveraged for W4A16 quantization. Recommended for any GPU types. | Requires a calibration dataset. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |
1212
| **W8A16** | Encodes model weights in 8‑bit integers and activations in 16‑bit integers. W8A16 compression delivers smaller model output size than FP32 and is faster at inferencing on hardware with native 8‑bit integer units. Lower power and memory bandwidth compared to floating‑point.| Requires a calibration dataset. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |
1313
| **NVFP4** | 4-bit floating point encoding format introduced with the NVIDIA Blackwell GPU architecture. NVFP4 maintains numerical accuracy across a wide dynamic range of tensor values by using high-precision scale encoding and a two-level micro-block scaling strategy. NVFP4 compression generates a global scale for each tensor, along with local quantization scales for groups of 16 elements. Global scale and local quantization scales are generated for weights and activations. You cannot change the group size. | Requires a calibration dataset. | All NVIDIA Blackwell GPUs or later |

0 commit comments

Comments
 (0)