NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Lawrence Liu¹, Inesh Chakrabarti¹, Yixiao Li², Mengdi Wang³, Tuo Zhao², Lin F. Yang¹
¹ UCLA, ² Georgia Tech, ³ Princeton University

NoWag was accepted at COLM 2025! Looking forward to seeing you in Montreal!

Overview

This is the official implementation of NoWag, a unified framework for shape-preserving compression of large language models.

Features

Normalized Weight and Activation Guided Compression (NoWag) is a family of computationally efficient pruning (NoWag-P) and quantization (NoWag-VQ) algorithms for LLMs with a shared normalization method and optimization objective.
NoWag-P and NoWag-VQ perform competitively against SOTA pruning and compression algorithms.
NoWag-VQ demonstrating reduced calibration data dependence compared with SOTA VQ Methods

Requirements

Python 3.13.2+
Miniconda/Anaconda
Cuda

Installation

Clone the repository and install the required dependencies:

git clone [email protected]:LawrenceRLiu/NoWAG.git
cd NoWag
conda env create -f env.yml
conda activate NoWag

Usage

One Shot Compression

One shot compression is a method that allows for the compression of large language models in a single pass, without the need for iterative training or fine-tuning. This is the most computationally efficient method of compression.

We use the average l2 norm of the sample activations, also know as the diagonals of the hessians, to provide data awareness during one shot compression. This can be done by running scripts/generate_hessians.py, or alternatively, we have provided a bash script to generate the hessians for the Llama 2 7B/13B/70B and the Llama-3 8B/70B models using the same seed and calibration data as in the paper. The script can be found in scripts/generate_hessians.bash.

To perform one-shot compression, run the NoWag.py script. We use Hydra for configuration management, so you can specify the model to compress, compression method, parameters, etc. By default the script will run NoWag-P on the Llama 2 7B model. Currently NoWag supports two paradigms of shape preserving compression:

Pruning (NoWag-P): This method prunes the model weights based on their importance. To run add compress=prune to the command line. This is the default method used in the NoWag.py script. We support both unstructured and N:M pruning. The default is unstructured pruning. To run N:M pruning, add +compress.kwargs.pattern=[$N:$M] to the command line. We have provided a bash script to run NoWag-P on the Llama 2 7B/13B/70B and the Llama-3 8B/70B models using the same seed and calibration data as in the paper. The script can be found in scripts/prune.bash.
Vector Quantization (NoWag-VQ): This method quantizes the model weights using vector quantization techniques. We have provided a bash script to run NoWag-VQ on the Llama 2 7B/13B/70B and the Llama-3 8B models using the same seed and calibration data as in the paper. The script can be found in scripts/quantize.bash. We are working on implementing the cuda kernel for inference, the current results are from simulation only, and cannot be used for inference.

Layerwise Fine Tuning

We also examine the performance of \methodVQ beyond the ``one-shot'' compression regime. Existing literature has proposed several methods for post quantization finetuning. One popular method is finetuning the remaining continuous parameters of each transformer block to minimize the block output errors. We implemented this method in our codebase, to run this method, run finetune_layerwise.py with `run_name` specified to be the same as the `run_name` used in the `NoWag.py` script.

Repository Structure

.
├── models/             # Model definitions
├── config/            # Configuration files
├── scripts/            # Utility scripts
├── src            # src code
├── NoWag.py         # Main script for NoWag compression
├── finetune_layerwise.py # Script for layerwise fine tuning
└── README.md           # Project documentation

Citation

If you use this framework in your research, please cite:

@article{liu2025nowag,
  title={NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models},
  author={Liu, Lawrence and Chakrabarti, Inesh and Li, Yixiao and Wang, Mengdi and Zhao, Tuo and Yang, Lin F},
  journal={arXiv preprint arXiv:2504.14569},
  year={2025}
}

License

This project is licensed under the GNU GPL v3 License. See the LICENSE file for details. Use of Llama models is governed by the Meta license available here.

Contact

For questions or issues, please contact [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Overview

Features

Requirements

Installation

Usage

One Shot Compression

Layerwise Fine Tuning

Repository Structure

Citation

License

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
assets		assets
config		config
plots		plots
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NoWag.py		NoWag.py
env.yml		env.yml
finetune_layerwise.py		finetune_layerwise.py
readme.md		readme.md

License

LawrenceRLiu/NoWag

Folders and files

Latest commit

History

Repository files navigation

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Overview

Features

Requirements

Installation

Usage

One Shot Compression

Layerwise Fine Tuning

Repository Structure

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages