Skip to content

Xingrun-Xing2/EfficientLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 

Repository files navigation

EfficientLLM: Pruning-Aware Pretraining

Build Build

This repository contains the training code and models of EfficientLLM introduced in our work: "EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models".

News

  • Feb 10, 2025: 🚀 100M ~ 1B edge models are publicly available on HuggingFace.

1. Overview

Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. Recently, the increasing concerns about cloud costs, latency and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-scalable: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Architecture-agnostic: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary. EfficientLLM significantly outperforms SoTA baselines with 100M ∼ 1B parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in common sense benchmarks.

Figure 1: Pruning-aware pretraining. (a) Training loop includes the joint saliency detection and weight optimizing, pruning type selection from pruning space, and second-order weight updating. (b) Traditional post-training pruning can be embedded in the training loop to scale up. (c) Continuous model size compression in pretraining.

2. Results on Zero-shot Benchmarks

3. Auto-Designed Architecture

4. Load Huggingface Models

To load a pre-trained model and tokenizer, you can use the following code snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("xrxing/EfficientLLM-469M", use_fast=False)

# Load the model
model = AutoModelForCausalLM.from_pretrained("xrxing/EfficientLLM-469M", trust_remote_code=True, attn_implementation="flash_attention_2")

5. ToDo List

  • Release technical report
  • Release Huggingface models
  • Evaluation code
  • Pretraining code
  • Demos and applications

Contact

Xingrun Xing, CASIA, BAAI ([email protected])

Citation

If you find this work useful for your research, please consider citing:

@misc{xing2025efficientllm,
      title={EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models}, 
      author={Xingrun Xing and Zheng Liu and Shitao Xiao and Boyan Gao and Yiming Liang and Wanpeng Zhang and Haokun Lin and Guoqi Li and Jiajun Zhang},
      year={2025},
      eprint={2502.06663},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.06663}, 
}

About

A family of efficient edge language models in 100M~1B sizes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published