SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the training instability commonly observed during mid-stage width expansion of neural networks, which manifests as imbalanced activation statistics and gradient symmetry, thereby limiting feature diversity and computational efficiency. To systematically resolve these signal instability and gradient symmetry issues, the authors propose a stable progressive learning framework for mid-stage width expansion. The approach enforces RMS-scale consistency to stabilize activation statistics, employs asymmetric optimizer state resetting to break gradient symmetry, and incorporates a learning rate rewarming mechanism. The framework is compatible with diverse optimizers and Mixture-of-Experts (MoE) architectures, achieving up to 35% training cost savings at 2× width expansion compared to training from scratch, while demonstrating robust performance across varying widths and optimizer configurations.

Technology Category

Application Category

📝 Abstract
Progressive Learning (PL) reduces pre-training computational overhead by gradually increasing model scale. While prior work has extensively explored depth expansion, width expansion remains significantly understudied, with the few existing methods limited to the early stages of training. However, expanding width during the mid-stage is essential for maximizing computational savings, yet it remains a formidable challenge due to severe training instabilities. Empirically, we show that naive initialization at this stage disrupts activation statistics, triggering loss spikes, while copy-based initialization introduces gradient symmetry that hinders feature diversity. To address these issues, we propose SPARKLING (balancing {S}ignal {P}reservation {A}nd symmet{R}y brea{K}ing for width-progressive {L}earn{ING}), a novel framework for mid-stage width expansion. Our method achieves signal preservation via RMS-scale consistency, stabilizing activation statistics during expansion. Symmetry breaking is ensured through asymmetric optimizer state resetting and learning rate re-warmup. Extensive experiments on Mixture-of-Experts (MoE) models demonstrate that, across multiple width axes and optimizer families, SPARKLING consistently outperforms training from scratch and reduces training cost by up to 35% under $2\times$ width expansion.
Problem

Research questions and friction points this paper is trying to address.

Progressive Learning
Width Expansion
Training Instability
Symmetry Breaking
Signal Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

width-progressive learning
signal preservation
symmetry breaking
optimizer state resetting
RMS-scale consistency
🔎 Similar Papers
No similar papers found.
Qifan Yu
Qifan Yu
Zhejiang University
MLLMmultimodal learningimage generation & editing
Xinyu Ma
Xinyu Ma
Researcher @ ByteDance Seed | Ph.D. @ PKU
Large Language ModelsGraph LearningEMR Analysis
Z
Zhijian Zhuo
1ByteDance Seed
M
Minrui Wang
1ByteDance Seed
D
Deyi Liu
1ByteDance Seed
S
Shiyi Zhan
1ByteDance Seed
Yiyuan Ma
Yiyuan Ma
Bytedance Seed
L
Liang Xiang
1ByteDance Seed
X
Xingyan Bin
1ByteDance Seed
Di He
Di He
Peking University
Machine Learning