Skip to content

abdelfattah-lab/xKV

Repository files navigation

xKV Logo xKV: Cross-Layer SVD for KV-Cache Compression

Chi-Chih Chang1, Chien-Yu Lin2, Yash Akhauri1, Wei-Cheng Lin3,
Kai-Chiang Wu3, Luis Ceze2, Mohamed S. Abdelfattah1

1 Cornell University, 2University of Washington,
3National Yang Ming Chiao Tung University
[Paper] | [Website]

xKV Framework

Updates

  • [2025.03.24]:🚀 We release the 1st version of arXiv and code of xKV

TL;DR

We introduce xKV, a simple yet effective post-training compression method for KV-Cache, leveraging inter-layer redundancy. By applying singular value decomposition (SVD) across group of layers, xKV achieves up to 8× compression of the KV-Cache while maintaining strong accuracy.

Environment Setup

  1. Clone the repository (Make sure you have Git installed on your system)
git clone https://github.com/abdelfattah-lab/xKV.git
cd xKV
  1. Prepare environment To run the code in this project, first, create a Python virtual environment using e.g. uv. To install uv, follow the UV Installation Guide.
uv venv --python 3.11 && source .venv/bin/activate && uv pip install --upgrade pip

Next, install dependency

git submodule update --init --recursive
uv pip install -r requirements.txt
uv pip install flash-attn==2.7.4.post1 --no-build-isolation
uv pip install -e 3rdparty/MInference --no-build-isolation
cd 3rdparty/MInference
git am ../0001-Change-KIVI-kernel-to-Triton-version.patch

Install the optional hadamard transform dependency for better quantization integration,

uv pip install -e 3rdparty/fast-hadamard-transform --no-build-isolation # Optional
  1. Create Datasets (for RULER evaluation only)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
cd evaluate/data/ruler
bash create_dataset.sh "meta-llama/Meta-Llama-3.1-8B-Instruct" "llama-3"

Accuracy Evaluations

We provide an evaluation script evaluate/eval_acc.py to measure the accuracy impact of compressing the KV-Cache with three different methods included in our paper:

  1. Minicache
  2. Single SVD
  3. xKV

Key Arguments

  • --model_name_or_path: Path or name of the model to evaluate (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct).
  • --xKV: Toggle for enabling comprssion.
  • --dataset_name Comma-separated list of datasets (e.g., ruler/niah_single_1,ruler/niah_single_2,...).
  • --layer_group_size: Number of layers to be grouped.
  • --rank_k, --rank_v: Ranks used for each group of layers.
  • --layer_merge_impl Target compression approaches [svd(default), slerp].

Note

When increasing the layer group size, you often need to adjust these ranks for a fair comparison. For instance, if you use rank_k=128 for layer_group_size=1, then to compare performance under layer_group_size=2, set rank_k=256 so that the average rank per layer is similar.

Warning

When evaluting Qwen series, please pass --flash2 to switch backend to FlashAttention 2. ref


Evaluation on RULER Benchmark

Below we provide the example commands for running the RULER benchmarks with different suppoted KV-Cache compression results.

xKV

Enables xKV compression for all layers (start_layer_idx=0 to end_layer_idx=-1), grouping every 4 layers (layer_group_size=4), using ranks 512 and 768 for each grouped keys and values.

# xKV-4
CUDA_VISIBLE_DEVICES=4,5,6,7 OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 4 evaluate/eval_acc.py --datalen 65536 --batch_size 1 --dataset_name "ruler/vt" --model_name_or_path meta-llama/Meta-Llama-3.1-8B-Instruct --xKV --merge_k --merge_v --rank_k 256 --rank_v 384 --layer_group_size 4 --start_layer_idx 0 --end_layer_idx -1

Single SVD

For evaluation of Single SVD under similar compression level, replacing the arguments --layer_group_size 1 and --rank_k 128 --rank_v_192.

CUDA_VISIBLE_DEVICES=4,5,6,7 OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 4 evaluate/eval_acc.py --datalen 65536 --batch_size 1 --dataset_name "ruler/vt" --model_name_or_path meta-llama/Meta-Llama-3.1-8B-Instruct --xKV --merge_k --merge_v --rank_k 128 --rank_v 192 --layer_group_size 1 --start_layer_idx 0 --end_layer_idx -1

MiniCache

This command enables the MiniCache approach by specifying --layer_merge_impl slerp. The layers 16 through 31 are compressed.

CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 4 evaluate/eval_acc.py --datalen 65536 --batch_size 1 --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --model_name_or_path meta-llama/Meta-Llama-3.1-8B-Instruct --xKV --merge_k --merge_v --layer_merge_impl slerp --layer_group_size 2 --start_layer_idx 16 --end_layer_idx 31

Customized Merge Config

We also support customized merge config by providing a yaml file to the --customized_merge_config argument. By writing a yaml file you can experiment with different merging groups and different ranks for each group. Please refer to the configs/example.yaml for the format.

CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 4 evaluate/eval_acc.py --datalen 65536 --batch_size 1 --dataset_name "ruler/niah_single_1" --model_name_or_path meta-llama/Meta-Llama-3.1-8B-Instruct --xKV --customized_merge_config e
xample.yaml 

Evalaution on DeepSeek Models

DeepSeek’s MLA (multi-latent attention) architecture has two types of hidden states that can be cached during inference:

  • Non-RoPE Latents (the learned, position-agnostic latent vectors).
  • RoPE-based Key States (rotary-positioned keys). We reuse the Key and Value compression interfaces for these two elements:
  • --merge_k and --rank_k control compression of the non-RoPE latents (treated like “Keys”).
  • --merge_v and --rank_v control compression of the RoPE-based Key states (treated like “Values”). In our paper, we focus on compressing only the non-RoPE latents only.

xKV for DeepSeek (compress only non-RoPE latents)

Enables xKV compression for all layers (start_layer_idx=0 to end_layer_idx=-1), grouping every 4 layers (layer_group_size=4), using ranks 512 for grouped latents.

CUDA_VISIBLE_DEVICES=4,5,6,7 OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 4 \
evaluate/eval_acc.py \
--datalen 65536 \
--batch_size 1 \
--dataset_name "long_bench/repobench-p" \
--model_name_or_path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
--xKV \
--merge_k \
--rank_k 64 \
--layer_group_size 1 \
--start_layer_idx 0 \
--end_layer_idx -1 \
--flash2

Upcoming Roadmap

  • Accuracy Evaluation
  • Release end-to-end system and efficiency evalution.
  • Integration with sparse attention (e.g., ShadowKV)

Citation

If you find xKV useful or relevant to your project and research, please kindly cite our paper:

@article{chang2025xkv,
  title = {xKV: Cross-Layer SVD for KV-Cache Compression},
  author = {Chang, Chi-Chih and Lin, Chien-Yu and Akhauri, Yash and Lin, Wei-Cheng and Wu, Kai-Chiang and Ceze, Luis and Abdelfattah, Mohamed S.},
  year = {2025},
  journal = {arXiv preprint arXiv:2503.18893},
  year = {2025}
}

Acknowledgement

The evaluation scripts are built upon ShadowKV and Palu repository.