Skip to content

UMass-Embodied-AGI/CommVQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CommVQ: Commutative Vector Quantization for KV Cache Compression

[Paper] [Hugging Face Models]

This repository contains the official implementation of CommVQ, a method for memory-efficient and long-context inference through KV cache quantization with learned codebooks. It achieves strong performance across a wide range of benchmarks while significantly reducing memory overhead.

Table of Contents

News

  • [June, 2025]: Released code and model weights.
  • [May, 2025]: CommVQ is accepted to ICML 2025! See you in Vancouver, BC.

Model Checkpoints

We release the following LLaMA-3.1 8B checkpoints with CommVQ 1-bit and 2-bit compression. Both value codebooks and key codebooks are provided below. The value codebooks are used together with the original (unchanged) model weights.

Model Variant Value Codebook Key Codebook
LLaMA-3.1 8B + CommVQ 1-bit 🤗 Hugging Face 🤗 Hugging Face
LLaMA-3.1 8B + CommVQ 2-bit 🤗 Hugging Face 🤗 Hugging Face

Installation

conda create -n commvq python=3.10
conda activate commvq
pip install -e .
pip install flash-attn --no-build-isolation

Training

cd training

# Step 1: Collect KV cache
bash collect_kv.sh

# Step 2: Prepare scaling factors
python make_scale.py

# Step 3: Train the codebook for key cache
bash quantize_key_cache.sh

# Step 4: Train the codebook for value cache
bash finetune/llama3.1_8b_int1.sh

Evaluation

Longbench

cd evaluation/longbench
python pred.py --model $CHECKPOINT
python eval.py --model $RESULT_DIR

Infinitebench

cd evaluation/infiniteBench/src
# Download the evaluation datasets
bash scripts/download_dataset.sh
# Evaluate each tasks
bash run_passkey.sh
# Merge all results in each task into one jsonl file
cat ../results/commvq/preds_passkey_*.jsonl > ../results/commvq/preds_passkey.jsonl
# Compute the task score
python compute_scores.py --task all --model_name commvq

NIAH

cd evaluation/niah
bash run.sh $CHECKPOINT

Memory Measurement

We implement Triton-based kernels to further optimize memory usage and enable real memory savings with CommVQ. (Currently supports LLaMA-3.1 8B with 1-bit quantization; ongoing development for broader model support.)

cd evaluation/memory_measurement
pip install -e ../../transformers_triton_infer
bash eval_memory.sh $CHECKPOINT

Citation

If you find CommVQ useful in your research or applications, please consider citing:

@inproceedings{li2025commvq,
  title = {CommVQ: Commutative Vector Quantization for KV Cache Compression},
  author = {Junyan Li and Yang Zhang and Muhammad Yusuf Hassan and Talha Chafekar and Tianle Cai and Zhile Ren and Pengsheng Guo and Binazir Karimzadeh and Colorado J Reed and Chong Wang and Chuang Gan},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
  year = {2025}
}

About

[ICML 2025] CommVQ: Commutative Vector Quantization for KV Cache Compression

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published