Skip to content

csguoh/OBR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

Hang Guo, Yawei Li, Luca Benini

visitors arXiv HuggingFace

Abstract:
Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.

⭐If this work is helpful for you, please help star this repo. Thanks!🤗

✨ Highlights

1️⃣ The First to Enable W4A4KV4+50% Sparsity LLMs

Performance under different bit-widths and efficiency comparison (click to expand)

2️⃣ Strong Performance on WikiText Perplexity and Zero-shot Evaluation

Quantitative Results on the WikiText perplexity and zero-shot accuracy (click to expand)

3️⃣ Promising Efficiency again Dense INT4 Baselines

Comparison on Runtime, FLOPs, and TOPS (click to expand)

📑 Contents

🆕 News

  • 2025-09-16: arXiv paper available.
  • 2025-09-16: This repo is released and we have open sourced all our code and weights!

☑️ TODO

  • arXiv version available
  • Release code
  • Further improvements

👀 Pipeline

The core idea of our Optimal Brain Restoration (OBR) is to intervene between pruning and quantization operators by computing an optimal compensation, thereby reconciling the conflicting requirements of these two methods on weight distributions.

🎈Core Algorithm

For learning purpose, one may quickly locate the core code of our OBR algo. in this line. Since our OBR is a general technology, other LLMs also potentially apply.

🙌Evironment Setup

conda create -n obr python=3.10
conda activate obr
cd ./OBR
pip install torch torchvision
pip install -r requirements.txt
cd ..
git clone https://github.com/Dao-AILab/fast-hadamard-transform.git
cd fast-hadamard-transform
pip install .

💪Get Started

Since our OBR works on rotated LLM weights, we includes existing state-of-the-art rotation scheme including QuaRot, SpinQuant, and FlatQuant. Therefore, it is necessary to download the rotation matrix first before using our OBR.

OBR for QuaRot

Here, we use the Llama2-7B as an example.

First cd into the Infinity folder

cd ./QuaRot

For the QuaRot method, since the Hadamard rotation is not learnable, we do not need to download the matrix and the rotation can be on-line generated in the code.

One can use the following command line to get a W4A4KV4+50% unstructured sparse Llama2-7B model.

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py  --rotate --a_bits 4 --v_bits 4 --k_bits 4 --w_bits 4 --w_clip --ppl_eval

We also prepared the OBR-caliberated model in our HF repo. So one can also directly reproduce the performance reported in the paper by adding the --load_qmodel_path HangGuo/Llama2-7B-QuaRot-OBR-GPTQ-W4A4KV4S50.

OBR for SpinQuant

Here, we use the Llama2-7B as an example.

First cd into the Infinity folder

cd ./SpinQuant

Download the rotation matrix released by the official SpinQuant repo here. Please be read this note from SpinQuant authors to make sure a correct version before your downloading.

Then one can implement our OBR-GPTQ algo. using this command line. This will result in the W4A4KV4 + 50% unstructured sparsity Llama2-7B model.

python main.py --input_model meta-llama/Llama-2-7b-hf --do_train False --do_eval True --per_device_eval_batch_size 4 --model_max_length 2048 --fp16 False --bf16 True --save_safetensors False --w_bits 4 --a_bits 4 --k_bits 4 --v_bits 4 --w_clip --a_asym --k_asym --v_asym --k_groupsize 128 --v_groupsize 128 --rotate --optimized_rotation_path /data/guohang/OBR/SpinQuant/rotations/GPTQ/llama27b/R.bin --ppl_eval

We also prepared the OBR-caliberated model in our HF repo. So one can also directly reproduce the performance reported in the paper by adding the --load_qmodel_path <HF_repo_name>.

OBR for FlatQuant

Here, we use the Qwen2.5-Instruct-7B as an example.

IMPORTANT: For experiments with Qwen2.5 series, please change the transformers library via pip install transformers==4.45.0, since an old version has no implementation of QWen2.5 model.

First cd into the Infinity folder

cd ./FlatQuant

Download the official rotation matrix here from the FlatQuant repo and put it in your local path.

Run the following line to generate a W4A4KV4+50% unstructured sparse Qwen2.5-7B model.

python main.py --model Qwen/Qwen2.5-7B-Instruct --w_bits 4 --a_bits 4 --k_bits 4 --k_asym --k_groupsize 128 --v_bits 4 --v_asym --v_groupsize 128 --cali_bsz 4 --epoch 15 --flat_lr 5e-3 --lwc --lac --cali_trans --add_diag --output_dir ./outputs --save_matrix --ppl_eval --reload_matrix --matrix_path /data2/guohang/pretrained/flatquant_rotation

We also prepared the OBR-caliberated model in our HF repo. So one can also directly reproduce the performance reported in the paper by adding the --load_qmodel_path HangGuo/QWen2.5-7B-FlatQuant-OBR-GPTQ-W4A4KV4S50.

Key Args

For compatibility, we keep the name of OBR-related args the same in all rotation frameworks. Basically, the meaning of some key args are as follows.

  • obr_rtn: use the OBR_RTN algorithm, default False (i.e. use the OBR_GPTQ).

  • obr_alpha: the group partition ratio for Retain set and Eviction Set in OBR_GPTQ algorithm, default 0.5.

  • sparsity_ratio: the sparse ratio in pruning, default 0.5.

  • prune_n: the N in N:M semi-structured pruning, default 0 (i.e. use unstructured pruning)

  • prune_m: the M in N:M semi-structured pruning, default 0 (i.e. use unstructured pruning)

  • ppl_eval : whether to evaluate the perplexity on WikiText2.

  • lm_eval : whether to evaluate the 0-shot accuracy.

🥰 Citation

Please cite us if our work is useful for your research.

@article{guo2025optimal,
  title={Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs},
  author={Guo, Hang and Li, Yawei and Benini, Luca},
  journal={arXiv preprint arXiv:2509.11177},
  year={2025}
}

License

Since this work is based on the previous works including QuaRot, SpinQuant, and FlatQuant., users should follow the license of the corresponding backbone models like QuaRot(Apache2.0 License), SpinQuant(CC-BY-NC 4.0) and FlatQuant(MIT License).

Contact

If you have any questions, feel free to contact me at [email protected]

About

The first W4A4KV4 quantized + 50% sparse LLMs!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published