Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing model merging methods lack theoretical grounding and often yield inconsistent performance gains. This work identifies that, during late-stage pretraining, the trajectories of models to be merged collapse into an approximately rank-1 subspace. Leveraging this observation, the authors propose Extra-Merge, a training-free extrapolation-based merging strategy. Framed through a geometric low-pass filtering perspective, Extra-Merge interprets the loss landscape as a “valley” structure, enabling efficient gradient-free model fusion. Evaluated across GPT-2, LLaMA (124M–2B), and Pythia-12B, Extra-Merge consistently outperforms current baselines, delivering substantial improvements in zero-shot accuracy and demonstrating successful generalization to models trained with the Muon optimizer.

📝 Abstract

Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a \textbf{Rank-1 Subspace} phenomenon: while raw optimization steps oscillate violently, consecutive \emph{merged} checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emph{river-valley} landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose \textbf{Extra-Merge}, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citep{jordan2024muon}.

Problem

Research questions and friction points this paper is trying to address.

model merging

Rank-1 Subspace

pre-training

large language models

optimization trajectory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rank-1 Subspace

Model Merging

Extra-Merge