How does the optimizer implicitly bias the model merging loss landscape?

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work investigates how optimization dynamics influence the geometric structure of the loss landscape and, consequently, model merging efficacy. Method: By systematically varying learning rate, weight decay, batch size, and data augmentation, we quantify an “effective noise scale” induced jointly by optimizer design and data selection. We analyze its impact on merging performance across diverse architectures and datasets. Contribution/Results: We establish, for the first time, a quantitative link between optimization dynamics and model mergeability. The effective noise scale—not only modulates local minimum flatness but also shapes global landscape geometry—exhibits a non-monotonic relationship with merging performance. Empirically, we identify a well-defined optimal range of effective noise; tuning hyperparameters to operate within this regime consistently improves linear interpolation and task arithmetic. These findings hold across multiple model architectures (e.g., ViT, ResNet) and benchmarks (e.g., ImageNet, CIFAR-100), demonstrating broad applicability and offering principled guidance for enhancing model merging.

Technology Category

Application Category

📝 Abstract

Model merging methods combine models with different capabilities into a single one while maintaining the same inference cost. Two popular approaches are linear interpolation, which linearly interpolates between model weights, and task arithmetic, which combines task vectors obtained by the difference between finetuned and base models. While useful in practice, what properties make merging effective are poorly understood. This paper explores how the optimization process affects the loss landscape geometry and its impact on merging success. We show that a single quantity -- the effective noise scale -- unifies the impact of optimizer and data choices on model merging. Across architectures and datasets, the effectiveness of merging success is a non-monotonic function of effective noise, with a distinct optimum. Decomposing this quantity, we find that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale, exhibiting the same qualitative trend. Unlike prior work that connects optimizer noise to the flatness or generalization of individual minima, we show that it also affects the global loss landscape, predicting when independently trained solutions can be merged. Our findings broaden the understanding of how optimization shapes the loss landscape geometry and its downstream consequences for model merging, suggesting the possibility of further manipulating the training dynamics to improve merging effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Explores how optimizer choices affect loss landscape geometry for model merging

Identifies effective noise scale as key factor unifying optimizer and data impacts

Shows optimization noise globally influences when trained models can be merged

Innovation

Methods, ideas, or system contributions that make the work stand out.

Effective noise scale unifies optimizer and data impacts

Merging success non-monotonically depends on effective noise

Training dynamics manipulation improves model merging effectiveness

🔎 Similar Papers

Checkpoint Merging via Bayesian Optimization in LLM Pretraining