|
16 | 16 |
|
17 | 17 | Big updates have landed in LLM Compressor! Check out these exciting new features:
|
18 | 18 |
|
| 19 | +* **FP4 Weight Only Quantization Support:** Quantize weights to FP4 and seamlessly run the compressed model in vLLM. Model weights are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/1b6287a4b21c16e0842f32fadecb20bb4c0d4862/src/compressed_tensors/quantization/quant_scheme.py#L103). See an example [here](examples/quantization_w4a16_fp4/llama3_example.py). |
19 | 20 | * **Axolotl Sparse Finetuning Integration:** Easily finetune sparse LLMs through our seamless integration with Axolotl. [Learn more here](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
|
20 | 21 | * **AutoAWQ Integration:** Perform low-bit weight-only quantization efficiently using AutoAWQ, now part of LLM Compressor. *Note: This integration should be considered experimental for now. Enhanced support, including for MoE models and improved handling of larger models via layer sequential pipelining, is planned for upcoming releases.* [See the details](https://github.com/vllm-project/llm-compressor/pull/1177).
|
21 | 22 | * **Day 0 Llama 4 Support:** Meta utilized LLM Compressor to create the [FP8-quantized Llama-4-Maverick-17B-128E](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8), optimized for vLLM inference using [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format.
|
22 | 23 |
|
23 | 24 | ### Supported Formats
|
24 | 25 | * Activation Quantization: W8A8 (int8 and fp8)
|
25 |
| -* Mixed Precision: W4A16, W8A16 |
| 26 | +* Mixed Precision: W4A16, W8A16, NVFP4A16 |
26 | 27 | * 2:4 Semi-structured and Unstructured Sparsity
|
27 | 28 |
|
28 | 29 | ### Supported Algorithms
|
@@ -50,8 +51,9 @@ pip install llmcompressor
|
50 | 51 | Applying quantization with `llmcompressor`:
|
51 | 52 | * [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)
|
52 | 53 | * [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)
|
| 54 | +* [Weight only quantization to `fp4`](examples/quantization_w4a16_fp4/llama3_example.py) |
53 | 55 | * [Weight only quantization to `int4` using GPTQ](examples/quantization_w4a16/README.md)
|
54 |
| -* [Weight only quantization to `int4` using AWQ](examples/awq/awq_one_shot.py) |
| 56 | +* [Weight only quantization to `int4` using AWQ](examples/awq/README.md) |
55 | 57 | * [Quantizing MoE LLMs](examples/quantizing_moe/README.md)
|
56 | 58 | * [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)
|
57 | 59 | * [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)
|
|
0 commit comments