Skip to content

The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".

License

Notifications You must be signed in to change notification settings

CASE-Lab-UMD/Unified-MoE-Compression

Repository files navigation

Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques

Shwai He*, Daize Dong*, Liang Ding, Ang Li

Published on Transactions on Machine Learning Research (TMLR).

arXiv

Introduction

The Mixture of Experts (MoE) approach dynamically selects and activates only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE introduces potential redundancy (e.g., parameters) and extra costs (e.g., communication overhead). Since the compression of MoE remains under-explored, we address this gap with a cutting-edge unified framework that seamlessly integrates mainstream compression methods and helps systematically understand MoE compression. This framework approaches compression from two perspectives: Expert Slimming, which compresses individual experts, and Expert Trimming, which removes structured modules. Within this framework, we explore the optimization space unexplored by existing methods and introduce aggressive Expert Trimming techniques, such as Layer Drop and Block Drop, to eliminate redundancy on a larger scale. Based on these insights, we present a comprehensive recipe to guide practitioners in effectively compressing MoE.

unified-view.svg

unified-view-table.svg

Installation

Environments

Create a conda environment and install the pipeline for pruning and Expert Trimming (based on the LLaMA-Factory).

conda create -n moe-compression python=3.10
conda activate moe-compression

git clone [email protected]:CASE-Lab-UMD/Unified-MoE-Compression.git
cd ./Unified-MoE-Compression
pip install -e .
pip install flash-attn --no-build-isolation

Install the pipeline for quantization (based on the AutoAWQ and AutoGPTQ). Ensure you carefully install the packages that correspond to your CUDA version. For more details, you can refer to the README files in corresponding folders.

cd ./AutoAWQ
pip install -e .

cd ./AutoAWQ/AutoAWQ_kernels
pip install -e .

cd ./AutoGPTQ
pip install -vvv --no-build-isolation -e .

Model Checkpoints

Download the Mixtral-8x7B and DeepSeek-MoE-16B model from HuggingFace, and delete the following lines in the config.json of DeepSeek-MoE-16B.

"auto_map": {
  "AutoConfig": "configuration_deepseek.DeepseekConfig",
  "AutoModel": "modeling_deepseek.DeepseekModel",
  "AutoModelForCausalLM": "modeling_deepseek.DeepseekForCausalLM"
},

Running Compression

Just run the following shell scripts.

Expert Slimming

Pruning

bash scripts/compression/pruning/mixtral_prune.sh
bash scripts/compression/pruning/deepseek_prune.sh
bash scripts/compression/pruning/deepseek_prune_noshared.sh

Quantization

bash scripts/compression/quantization/awq.sh
bash scripts/compression/quantization/gptq.sh

Expert Trimming

Expert Drop

bash scripts/compression/expert_drop/mixtral_expert_drop.sh
bash scripts/compression/expert_drop/deepseek_expert_drop.sh

Layer Drop

bash scripts/compression/layer_drop/mixtral_layer_drop.sh
bash scripts/compression/layer_drop/deepseek_layer_drop.sh

Block Drop

bash scripts/compression/block_drop/mixtral_block_drop.sh
bash scripts/compression/block_drop/deepseek_block_drop.sh

Note that the Expert Trimming methods can also be combined with each other. For example, you can apply Expert Drop after Layer Drop. This may provide a better trade-off between performance and efficiency.

Running Post-Finetuning

We provide example scripts for finetuning the Mixtral-8x7B and DeepSeek-MoE-16B. Just run:

bash scripts/finetuning/mixtral_finetune.sh
bash scripts/finetuning/deepseek_finetune.sh

Note that the scripts are configurated for finetuning on 8 NVIDIA-A100-80G GPUs. You may need to adjust the batch_size according to your resources.

Running Evaluation

FLOPs & Speed

bash scripts/evaluation/speedup/measure_flops.sh
bash scripts/evaluation/speedup/measure_speed.sh

Loss & PPL

bash scripts/evaluation/loss/mixtral_evaluate.sh
bash scripts/evaluation/loss/deepseek_evaluate.sh

Benchmarks

You should first install the pipeline for evaluation (based on the EleutherAI/lm-evaluation-harness).

cd ./lm-evaluation-harness
pip install -e .

Then run the following script:

bash scripts/evaluation/benchmark/run_benchmark.sh

Others

To add a dataset, please refer to the README.md in ./data.

Citation

@article{
  he2025towards,
  title={Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques},
  author={Shwai He and Daize Dong and Liang Ding and Ang Li},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2025},
  url={https://openreview.net/forum?id=HTpMOl6xSI},
  note={}
}

Contact Us

If you have any questions, please contact:

About

The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •