LESA: Learnable LLM Layer Scaling-Up

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Training large language models (LLMs) from scratch incurs prohibitive computational costs, and existing depth-scaling methods—relying on heuristic layer duplication—suffer from poor initialization and slow convergence. Method: This paper proposes a learnable depth-scaling framework that models inter-layer parameter relationships as a data-driven, differentiable process, abandoning rule-based replication. Leveraging singular value decomposition to uncover structural regularities in layer parameters, the framework employs a lightweight neural network to predict optimal parameters for newly inserted layers. Contribution/Results: Extensive experiments demonstrate that our method significantly accelerates convergence during continued pretraining, reduces computational overhead by over 50%, and exhibits strong generalization across diverse model scales and downstream tasks—outperforming conventional depth-scaling baselines without requiring task-specific fine-tuning or architectural modifications.

Technology Category

Application Category

📝 Abstract
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose extbf{LESA}, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines, achieving superior performance with less than half the computational cost during continual pre-training. Extensive analyses demonstrate its effectiveness across different model sizes and tasks.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost for LLM training
Improves initialization and convergence in model scaling
Introduces learnable depth scaling method LESA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable depth scaling-up method
Singular Value Decomposition analysis
Neural network parameter prediction
Yifei Yang
Yifei Yang
Shanghai Jiao Tong University
Natural Language Processing
Zouying Cao
Zouying Cao
Shanghai Jiao Tong University
Natural Language ProcessingLarge Language ModelsReinforcement Learning
Xinbei Ma
Xinbei Ma
Shanghai Jiao Tong University
Y
Yao Yao
Department of Computer Science and Engineering, Shanghai Jiao Tong University; Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University; Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3
L
Libo Qin
School of Computer Science and Engineering, Central South University
Z
Zhi Chen
ByteDance
H
Hai Zhao
Department of Computer Science and Engineering, Shanghai Jiao Tong University; Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University; Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3