π€ AI Summary
Existing model merging methods lack theoretical grounding and often yield inconsistent performance gains. This work identifies that, during late-stage pretraining, the trajectories of models to be merged collapse into an approximately rank-1 subspace. Leveraging this observation, the authors propose Extra-Merge, a training-free extrapolation-based merging strategy. Framed through a geometric low-pass filtering perspective, Extra-Merge interprets the loss landscape as a βvalleyβ structure, enabling efficient gradient-free model fusion. Evaluated across GPT-2, LLaMA (124Mβ2B), and Pythia-12B, Extra-Merge consistently outperforms current baselines, delivering substantial improvements in zero-shot accuracy and demonstrating successful generalization to models trained with the Muon optimizer.
π Abstract
Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a \textbf{Rank-1 Subspace} phenomenon: while raw optimization steps oscillate violently, consecutive \emph{merged} checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emph{river-valley} landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose \textbf{Extra-Merge}, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citep{jordan2024muon}.