EfficientLLM: Pruning-Aware Pretraining

This repository contains the training code and models of EfficientLLM introduced in our work: "EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models".

News

Feb 10, 2025: 🚀 100M ~ 1B edge models are publicly available on HuggingFace.

1. Overview

Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. Recently, the increasing concerns about cloud costs, latency and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-scalable: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Architecture-agnostic: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary. EfficientLLM significantly outperforms SoTA baselines with 100M ∼ 1B parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in common sense benchmarks.

Figure 1: Pruning-aware pretraining. (a) Training loop includes the joint saliency detection and weight optimizing, pruning type selection from pruning space, and second-order weight updating. (b) Traditional post-training pruning can be embedded in the training loop to scale up. (c) Continuous model size compression in pretraining.

2. Results on Zero-shot Benchmarks

3. Auto-Designed Architecture

4. Load Huggingface Models

To load a pre-trained model and tokenizer, you can use the following code snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("xrxing/EfficientLLM-469M", use_fast=False)

# Load the model
model = AutoModelForCausalLM.from_pretrained("xrxing/EfficientLLM-469M", trust_remote_code=True, attn_implementation="flash_attention_2")

5. ToDo List

Contact

Xingrun Xing, CASIA, BAAI ([email protected])

Citation

If you find this work useful for your research, please consider citing:

@misc{xing2025efficientllm,
      title={EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models}, 
      author={Xingrun Xing and Zheng Liu and Shitao Xiao and Boyan Gao and Yiming Liang and Wanpeng Zhang and Haokun Lin and Guoqi Li and Jiajun Zhang},
      year={2025},
      eprint={2502.06663},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.06663}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
imgs		imgs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EfficientLLM: Pruning-Aware Pretraining

News

1. Overview

2. Results on Zero-shot Benchmarks

3. Auto-Designed Architecture

4. Load Huggingface Models

5. ToDo List

Contact

Citation

About

Uh oh!

Releases

Packages

License

Xingrun-Xing2/EfficientLLM

Folders and files

Latest commit

History

Repository files navigation

EfficientLLM: Pruning-Aware Pretraining

News

1. Overview

2. Results on Zero-shot Benchmarks

3. Auto-Designed Architecture

4. Load Huggingface Models

5. ToDo List

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages