🤖 AI Summary
Existing model merging methods struggle to balance accuracy and efficiency, failing to match the performance of independently fine-tuned models. To address this, we propose MASS—a training-free, data-agnostic multi-task model fusion framework. MASS constructs a shared backbone via low-rank update decomposition and adaptive singular value subspace selection; it further introduces a non-parametric intermediate feature routing mechanism that dynamically activates task-specific low-rank subspaces during inference based on input semantics, enabling Mixture-of-Experts–style lightweight multi-task collaboration. Evaluated on the CLIP+ViT architecture across 8, 14, and 20 tasks, MASS achieves state-of-the-art performance, attaining on average 98% of the accuracy of individually fine-tuned models. It requires only two forward passes at inference time and incurs a constant storage overhead of approximately twice that of a single model—significantly outperforming conventional model ensembles in both efficiency and scalability.
📝 Abstract
Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input's intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.