EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address high latency, elevated operational costs, and privacy risks arising from cloud dependency in edge deployment, this work proposes a pruning-aware pretraining paradigm that breaks conventional model scaling laws and deeply integrates large-model compression with pretraining. Methodologically, we introduce the first structured-pruning-driven automated neural architecture search, jointly optimizing pretraining objectives and sparsity constraints while incorporating parameter-group minimization. We also extend the SparseGPT/LLM-Pruner frameworks to the pretraining stage—marking the first such adaptation—thereby bridging the performance gap between compressed and directly pretrained models. Evaluated across 100M–1B parameter scales, our approach consistently outperforms state-of-the-art edge-oriented models—including MobileLLM, SmolLM, and Qwen2.5-0.5B—achieving significant gains on commonsense reasoning benchmarks. It supports architecture-agnostic, data-scalable compact model generation, and all code is fully open-sourced.

Technology Category

Application Category

📝 Abstract

Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. Recently, the increasing concerns about cloud costs, latency, and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-scalable: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Architecture-agnostic: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary. EfficientLLM significantly outperforms SoTA baselines with $100M sim 1B$ parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in common sense benchmarks. As the first attempt, EfficientLLM bridges the performance gap between traditional LLM compression and direct pretraining methods, and we will fully open source at https://github.com/Xingrun-Xing2/EfficientLLM.

Problem

Research questions and friction points this paper is trying to address.

Develops compact edge language models

Introduces pruning-aware pretraining method

Bridges performance gap in LLM compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pruning-aware pretraining for edge models

Architecture-agnostic auto-designed LLMs

Scaling up LLM compression boundaries

🔎 Similar Papers

Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

2024-08-20arXiv.orgCitations: 7

Anthropic

$350,000—$850,000 USD

San Francisco, CA, USA

Authors to Follow