Bayesian Model Merging

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the limitations of existing plug-and-play model merging methods, which neglect the prior knowledge of strong anchor models and lack global hyperparameter optimization. The authors propose a bilevel optimization framework: the inner loop leverages the inductive bias of an anchor model to obtain a closed-form solution via activation-driven Bayesian regression, while the outer loop employs Bayesian optimization to globally search for module-wise hyperparameters on a small validation set. This approach is the first to integrate anchor model priors into the merging process and reveals a principled alignment between activation statistics and task vectors, enabling a variant that requires no auxiliary data. Evaluated across up to 20 vision and 5 language tasks, the method significantly outperforms current approaches, achieving 95.1% average performance on 8 tasks with ViT-L/14—approaching the average expert model performance of 95.8%.

📝 Abstract

Model merging aims to combine multiple task-specific expert models into a single model without joint retraining, offering a practical alternative to multi-task learning when data access or computational budget is limited. Existing methods, however, face two key limitations: (1) they overlook the valuable inductive bias of strong anchor models and estimate the merged weights from scratch, and (2) they rely on a shared hyperparameter setting across different modules of the network, lacking a global optimization strategy. This paper introduces Bayesian Model Merging (BMM), a plug-and-play bi-level optimization framework, where the inner level formulates the model merging as an activation-based Bayesian regression under a strong prior induced by an anchor model, yielding an efficient closed-form solution; and the outer level leverages a Bayesian optimization procedure to search module-specific hyperparameters globally based on a small validation set. Furthermore, we reveal a key alignment between activation statistics and task vectors, enabling us to derive a data-free variant of BMM that estimates the Gram matrix for regression without any auxiliary data. Across extensive benchmarks, including up to 20-task merging in vision and 5-task merging in language, BMM consistently outperforms all plug-and-play anchor baselines (e.g., TA, WUDI-Merging, and TSV). In particular, on the ViT-L/14 benchmark for 8-task merging, a single merged model reaches 95.1, closely matching the average performance of eight task-specific experts (95.8).

Problem

Research questions and friction points this paper is trying to address.

model merging

multi-task learning

inductive bias

hyperparameter optimization

Bayesian regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Model Merging

anchor model prior

bi-level optimization