🤖 AI Summary
To address critical challenges in model merging—including misalignment of expert parameters and inconsistent downstream behavior, as well as neglect of inter-layer heterogeneity—this paper proposes Expert Merging, a lightweight, unsupervised model fusion method. Its core contributions are: (1) an unsupervised expert alignment objective based on hidden states and logits; (2) an adaptive calibration mechanism that learns layer-wise coefficients using unlabeled data; and (3) Expert Merging++, which introduces an importance-guided layer partitioning strategy to explicitly model inter-layer heterogeneity and enable efficient parameter allocation. Stability is ensured via coefficient regularization, task-weighted loss, and normalized importance scoring. Evaluated on large language and multimodal models—including Mistral, InternVL, and Qwen2-VL—Expert Merging significantly outperforms both training-free and fine-tuning-based baselines, with some results even surpassing supervised mixture-of-experts training.
📝 Abstract
Model merging, which combines multiple domain-specialized experts into a single model, offers a practical path to endow Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) with broad capabilities without the cost of joint training or serving many models. However, training-free methods rely on hand-tuned coefficients, whereas training-based methods primarily align parameters rather than downstream task behavior and typically treat all layers uniformly, ignoring inter-layer heterogeneity. We introduce Expert Merging, a training-light method that learns a small set of layer-wise coefficients using only unlabeled calibration data. The coefficients are optimized to explicitly align the merged model's hidden states and logits with those of the corresponding experts, with a coefficient regularizer for stability and task-weighted losses for controllable trade-offs. To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking: a normalized layer-importance metric, derived from learned coefficients, task-vector magnitudes, and parameter counts, allocates more chunk-wise coefficients to high-importance layers while keeping low-importance layers lightweight. The result is a label-free, parameter-efficient, and scalable approach to multi-expert model merging across LLMs and MLLMs. Across MLLM backbones (InternVL and Qwen2-VL) and the LLM backbone (Mistral), our method surpasses strong training-free and training-based merging baselines, with Expert Merging++ delivering further gains and, in some cases, even exceeding supervised Mixture Training. The source code is available at https://github.com/Littleor/ExpertMerging.