Model Merging Scaling Laws in Large Language Models

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing model merging techniques lack quantifiable scaling laws, hindering principled co-scaling of expert count and model size. Method: We empirically discover that merging gain decays approximately as 1/k with respect to the number of experts *k*, and identify two key patterns: early concentration of returns and variance contraction. Building on these, we propose a unified power-law formula that quantitatively maps model capacity, expert count, and performance gain—measured by cross-entropy loss reduction. Contribution/Results: The framework achieves high-fidelity fitting across diverse merging methods (Average, TA, TIES, DARE), architectures, and tasks. It enables cross-configuration performance prediction, budget-aware scaling decisions (e.g., “scale model” vs. “add experts”), and precise estimation of the minimum *k* required to achieve a target loss. This advances model merging from heuristic practice toward a systematic, predictable, and reproducible paradigm.

Technology Category

Application Category

📝 Abstract

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

Problem

Research questions and friction points this paper is trying to address.

Identifies scaling laws for merging language models via cross-entropy

Explains diminishing returns when adding experts to merged models

Enables predictive planning for model composition under budget constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies power law linking model size and experts

Explains diminishing returns with increasing expert numbers

Enables predictive planning for model merging efficiency

🔎 Similar Papers

No similar papers found.