Lawrence Liu1, Inesh Chakrabarti1, Yixiao Li2, Mengdi Wang3, Tuo Zhao2, Lin F. Yang1
1 UCLA, 2 Georgia Tech, 3 Princeton University
NoWag was accepted at COLM 2025! Looking forward to seeing you in Montreal!
This is the official implementation of NoWag, a unified framework for shape-preserving compression of large language models.
- Normalized Weight and Activation Guided Compression (NoWag) is a family of computationally efficient pruning (NoWag-P) and quantization (NoWag-VQ) algorithms for LLMs with a shared normalization method and optimization objective.
- NoWag-P and NoWag-VQ perform competitively against SOTA pruning and compression algorithms.
- NoWag-VQ demonstrating reduced calibration data dependence compared with SOTA VQ Methods
- Python 3.13.2+
- Miniconda/Anaconda
- Cuda
Clone the repository and install the required dependencies:
git clone [email protected]:LawrenceRLiu/NoWAG.git
cd NoWag
conda env create -f env.yml
conda activate NoWagOne shot compression is a method that allows for the compression of large language models in a single pass, without the need for iterative training or fine-tuning. This is the most computationally efficient method of compression.
We use the average l2 norm of the sample activations, also know as the diagonals of the hessians, to provide data awareness during one shot compression. This can be done by running scripts/generate_hessians.py, or alternatively, we have provided a bash script to generate the hessians for the Llama 2 7B/13B/70B and the Llama-3 8B/70B models using the same seed and
calibration data as in the paper. The script can be found in scripts/generate_hessians.bash.
To perform one-shot compression, run the NoWag.py script. We use Hydra for configuration management, so you can specify the model to compress, compression method, parameters, etc. By default the script will run NoWag-P on the Llama 2 7B model.
Currently NoWag supports two paradigms of shape preserving compression:
-
Pruning (NoWag-P): This method prunes the model weights based on their importance. To run add
compress=pruneto the command line. This is the default method used in theNoWag.pyscript. We support both unstructured and N:M pruning. The default is unstructured pruning. To run N:M pruning, add+compress.kwargs.pattern=[$N:$M]to the command line. We have provided a bash script to run NoWag-P on the Llama 2 7B/13B/70B and the Llama-3 8B/70B models using the same seed and calibration data as in the paper. The script can be found inscripts/prune.bash. -
Vector Quantization (NoWag-VQ): This method quantizes the model weights using vector quantization techniques. We have provided a bash script to run NoWag-VQ on the Llama 2 7B/13B/70B and the Llama-3 8B models using the same seed and calibration data as in the paper. The script can be found in
scripts/quantize.bash. We are working on implementing the cuda kernel for inference, the current results are from simulation only, and cannot be used for inference.
We also examine the performance of \methodVQ beyond the ``one-shot'' compression regime. Existing literature has proposed several methods for post quantization finetuning. One popular method is finetuning the remaining continuous parameters of each transformer block to minimize the block output errors. We implemented this method in our codebase, to run this method,
run finetune_layerwise.py with `run_name` specified to be the same as the `run_name` used in the `NoWag.py` script.
.
├── models/ # Model definitions
├── config/ # Configuration files
├── scripts/ # Utility scripts
├── src # src code
├── NoWag.py # Main script for NoWag compression
├── finetune_layerwise.py # Script for layerwise fine tuning
└── README.md # Project documentation
If you use this framework in your research, please cite:
@article{liu2025nowag,
title={NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models},
author={Liu, Lawrence and Chakrabarti, Inesh and Li, Yixiao and Wang, Mengdi and Zhao, Tuo and Yang, Lin F},
journal={arXiv preprint arXiv:2504.14569},
year={2025}
}
This project is licensed under the GNU GPL v3 License. See the LICENSE file for details. Use of Llama models is governed by the Meta license available here.
For questions or issues, please contact [email protected].
