Model Merging: Foundations and Algorithms

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the challenge of efficiently fusing independently trained neural network models for capability reuse without access to original training data and with minimal optimization. The authors propose a novel paradigm of direct weight-space fusion: for single-task settings, they introduce the reference-free C²M³ alignment algorithm; for multi-task scenarios, they develop a framework comprising TSV low-rank decomposition, MASS input-adaptive routing, and MERGE³ evolutionary fusion, grounded in gradient-based approximations of task vectors. By innovatively integrating Frank-Wolfe optimization, item response theory for evaluation, and task vector analysis, the method substantially mitigates task interference and computational overhead—reducing evaluation costs by up to 50×—while maintaining strong performance, all without requiring any original training data.

📝 Abstract

Modern deep learning usually treats models as separate artifacts: trained independently, specialized for particular purposes, and replaced when improved versions appear. This thesis studies model merging as an alternative paradigm: combining independently trained neural networks directly in weight space, with little or no optimization and without requiring access to the original training data. The thesis considers two main regimes. In the single-task setting, where models share an objective but differ in initialization, we introduce C$^2$M$^3$, a cycle-consistent merging algorithm based on Frank-Wolfe optimization. C$^2$M$^3$ aligns multiple networks into a shared, reference-free parameter space, making weight averaging meaningful without privileging any individual model. In the multi-task setting, where models are fine-tuned for different downstream tasks from a common pretrained initialization, we first develop a theoretical account of task vectors as approximate gradients. This explains both the effectiveness and the limitations of task arithmetic. Building on this view, we show that task vectors inherit the low-rank structure of gradients and introduce Task Singular Vectors (TSV), a decomposition that enables compression and interference reduction through TSV-Merge. We then present MASS, an input-adaptive routing method that uses TSV geometry to select task-relevant subspaces at inference time. Finally, we introduce MERGE$^3$, an evolutionary merging framework that uses Item Response Theory to reduce evaluation costs by up to 50$\times$ while preserving solution quality. Together, these contributions provide theoretical and algorithmic foundations for model merging, supporting a paradigm in which learned capabilities can be composed, reused, and extended across models.

Problem

Research questions and friction points this paper is trying to address.

model merging

weight space

multi-task learning

single-task setting

task vectors

Innovation

Methods, ideas, or system contributions that make the work stand out.

model merging

task vectors

Frank-Wolfe optimization