Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

πŸ“… 2026-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing model merging methods lack theoretical grounding and often yield inconsistent performance gains. This work identifies that, during late-stage pretraining, the trajectories of models to be merged collapse into an approximately rank-1 subspace. Leveraging this observation, the authors propose Extra-Merge, a training-free extrapolation-based merging strategy. Framed through a geometric low-pass filtering perspective, Extra-Merge interprets the loss landscape as a β€œvalley” structure, enabling efficient gradient-free model fusion. Evaluated across GPT-2, LLaMA (124M–2B), and Pythia-12B, Extra-Merge consistently outperforms current baselines, delivering substantial improvements in zero-shot accuracy and demonstrating successful generalization to models trained with the Muon optimizer.
πŸ“ Abstract
Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a \textbf{Rank-1 Subspace} phenomenon: while raw optimization steps oscillate violently, consecutive \emph{merged} checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emph{river-valley} landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose \textbf{Extra-Merge}, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citep{jordan2024muon}.
Problem

Research questions and friction points this paper is trying to address.

model merging
Rank-1 Subspace
pre-training
large language models
optimization trajectory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rank-1 Subspace
Model Merging
Extra-Merge
River-Valley Landscape
Training-Free Extrapolation
πŸ”Ž Similar Papers