This repository contains the official implementation of CommVQ, a method for memory-efficient and long-context inference through KV cache quantization with learned codebooks. It achieves strong performance across a wide range of benchmarks while significantly reducing memory overhead.
- [June, 2025]: Released code and model weights.
- [May, 2025]: CommVQ is accepted to ICML 2025! See you in Vancouver, BC.
We release the following LLaMA-3.1 8B checkpoints with CommVQ 1-bit and 2-bit compression. Both value codebooks and key codebooks are provided below. The value codebooks are used together with the original (unchanged) model weights.
Model Variant | Value Codebook | Key Codebook |
---|---|---|
LLaMA-3.1 8B + CommVQ 1-bit | 🤗 Hugging Face | 🤗 Hugging Face |
LLaMA-3.1 8B + CommVQ 2-bit | 🤗 Hugging Face | 🤗 Hugging Face |
conda create -n commvq python=3.10
conda activate commvq
pip install -e .
pip install flash-attn --no-build-isolation
cd training
# Step 1: Collect KV cache
bash collect_kv.sh
# Step 2: Prepare scaling factors
python make_scale.py
# Step 3: Train the codebook for key cache
bash quantize_key_cache.sh
# Step 4: Train the codebook for value cache
bash finetune/llama3.1_8b_int1.sh
cd evaluation/longbench
python pred.py --model $CHECKPOINT
python eval.py --model $RESULT_DIR
cd evaluation/infiniteBench/src
# Download the evaluation datasets
bash scripts/download_dataset.sh
# Evaluate each tasks
bash run_passkey.sh
# Merge all results in each task into one jsonl file
cat ../results/commvq/preds_passkey_*.jsonl > ../results/commvq/preds_passkey.jsonl
# Compute the task score
python compute_scores.py --task all --model_name commvq
cd evaluation/niah
bash run.sh $CHECKPOINT
We implement Triton-based kernels to further optimize memory usage and enable real memory savings with CommVQ. (Currently supports LLaMA-3.1 8B with 1-bit quantization; ongoing development for broader model support.)
cd evaluation/memory_measurement
pip install -e ../../transformers_triton_infer
bash eval_memory.sh $CHECKPOINT
If you find CommVQ useful in your research or applications, please consider citing:
@inproceedings{li2025commvq,
title = {CommVQ: Commutative Vector Quantization for KV Cache Compression},
author = {Junyan Li and Yang Zhang and Muhammad Yusuf Hassan and Talha Chafekar and Tianle Cai and Zhile Ren and Pengsheng Guo and Binazir Karimzadeh and Colorado J Reed and Chong Wang and Chuang Gan},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
year = {2025}
}