-
Notifications
You must be signed in to change notification settings - Fork 193
Description
Describe the bug
The result of running gptq on a model with llmcompressor doesnt provide the expected results, atleast not like auto_gptq. i think there might be a bug in the implementation.
Expected behavior
A much higher accuracy on several metrics after running gptq
Environment
Include all relevant environment information:
- Ubuntu 22.04.3 LTS
- Python 3.10.12
- LLM Compressor 0.4.0:
- torch 2.4.0, transformers 4.48.2
- Cuda 12.2
To Reproduce
Use model "meta-llama/Llama-3.1-8B-Instruct"
Run quantization only with the 2of4_w4a16_group-128_recipe.yaml example to quantize to 4 bit
Run quantization with auto_gptq to 4 bit witht the same parameters.
evaluate with lm_eval
llmcomrpessor results:
arc-c: 0.35
winogrande: 0.6
wikitext: 9.2
auto_gptq results:
arc-c: 0.49
winogrande: 0.72
wikitext: 9.12
base_results:
arc-c: 0.51
winogrande: 0.73
wikitext: 8.6420
As we can see there is a big difference while both running GPTQ method.