If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs

📅 2024-12-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the prevalent issue in large-model (~100B) training where numerous suboptimal checkpoints—exhibiting significant capability trade-offs—are discarded. We propose a scalable weighted fusion method to transform them into Pareto-optimal models. Methodologically, we introduce a continuous weight optimization framework grounded in linear model merging, integrated with multi-task performance evaluation and gradient-driven multi-objective search. Empirically, we demonstrate for the first time that nearly all checkpoints—including those clearly suboptimal in isolation—contribute meaningfully to high-quality ensembles, challenging the conventional paradigm of selecting only top-performing checkpoints for merging. Experiments show that our fused models systematically outperform individual checkpoints and state-of-the-art merging baselines across diverse tasks—including instruction following and code generation—thereby substantially expanding the overall capability frontier.

Technology Category

Application Category

📝 Abstract

Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging"generalist"models trained on many tasks. We explore merging in the context of large (~100B) models, by recycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and the suboptimal ones are usually discarded. Given a pool of model checkpoints obtained from different training runs (e.g., different stages, objectives, hyperparameters, and data mixtures), which naturally show tradeoffs across different language capabilities (e.g., instruction following vs. code generation), we investigate whether merging can recycle such suboptimal models into a Pareto-optimal one. Our optimization algorithm tunes the weight of each checkpoint in a linear combination, resulting in such an optimal model that outperforms both individual models and merge-based baselines. Further analysis shows that good merges tend to include almost all checkpoints with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.

Problem

Research questions and friction points this paper is trying to address.

Optimizing model merging

Recycling suboptimal checkpoints

Achieving Pareto-optimal performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recycling suboptimal model checkpoints

Optimizing linear combination weights

Creating Pareto-optimal merged models

🔎 Similar Papers

Checkpoint Merging via Bayesian Optimization in LLM Pretraining