IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inference budget constraints in large language model (LLM) deployment, this paper proposes an end-to-end “amplify–prune–recover” joint pretraining paradigm. Methodologically, it introduces the first systematic investigation of non-deployment-oriented amplification pretraining—using a 2.8B-parameter model—and designs an iterative structured pruning and capacity reallocation mechanism, jointly optimizing model amplification, sparsification, and capability recovery under cosine-annealed learning rate scheduling. After 2T-token pretraining, the 2.8B model is efficiently compressed to 1.3B parameters, outperforming both same-scale from-scratch models and conventional pruning baselines across multiple benchmarks. The results empirically validate that amplification-aware pretraining significantly improves token efficiency and inference cost-performance trade-offs, establishing a new paradigm for resource-constrained LLM deployment.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models have intensified the need for efficient and deployable models within limited inference budgets. Structured pruning pipelines have shown promise in token efficiency compared to training target-size models from scratch. In this paper, we advocate incorporating enlarged model pretraining, which is often ignored in previous works, into pruning. We study the enlarge-and-prune pipeline as an integrated system to address two critical questions: whether it is worth pretraining an enlarged model even when the model is never deployed, and how to optimize the entire pipeline for better pruned models. We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery under a single cosine annealing learning rate schedule. This approach is further complemented by a novel iterative structured pruning method for gradual parameter removal. The proposed method helps to mitigate the knowledge loss caused by the rising learning rate in naive enlarge-and-prune pipelines and enable effective redistribution of model capacity among surviving neurons, facilitating smooth compression and enhanced performance. We conduct comprehensive experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.
Problem

Research questions and friction points this paper is trying to address.

Efficient model deployment within limited inference budgets
Optimizing enlarge-and-prune pipeline for better pruned models
Mitigating knowledge loss in model compression and redistribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated enlarge-and-prune pipeline for model compression
Cosine annealing learning rate schedule for training
Iterative structured pruning for gradual parameter removal
🔎 Similar Papers
No similar papers found.