This repository contains the training code and models of EfficientLLM introduced in our work: "EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models".
- Feb 10, 2025: 🚀 100M ~ 1B edge models are publicly available on HuggingFace.
Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. Recently, the increasing concerns about cloud costs, latency and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-scalable: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Architecture-agnostic: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary. EfficientLLM significantly outperforms SoTA baselines with 100M ∼ 1B parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in common sense benchmarks.
Figure 1: Pruning-aware pretraining. (a) Training loop includes the joint saliency detection and weight optimizing, pruning type selection from pruning space, and second-order weight updating. (b) Traditional post-training pruning can be embedded in the training loop to scale up. (c) Continuous model size compression in pretraining.
To load a pre-trained model and tokenizer, you can use the following code snippet:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("xrxing/EfficientLLM-469M", use_fast=False)
# Load the model
model = AutoModelForCausalLM.from_pretrained("xrxing/EfficientLLM-469M", trust_remote_code=True, attn_implementation="flash_attention_2")- Release technical report
- Release Huggingface models
- Evaluation code
- Pretraining code
- Demos and applications
Xingrun Xing, CASIA, BAAI ([email protected])
If you find this work useful for your research, please consider citing:
@misc{xing2025efficientllm,
title={EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models},
author={Xingrun Xing and Zheng Liu and Shitao Xiao and Boyan Gao and Yiming Liang and Wanpeng Zhang and Haokun Lin and Guoqi Li and Jiajun Zhang},
year={2025},
eprint={2502.06663},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.06663},
}

