Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address critical challenges in model merging—including misalignment of expert parameters and inconsistent downstream behavior, as well as neglect of inter-layer heterogeneity—this paper proposes Expert Merging, a lightweight, unsupervised model fusion method. Its core contributions are: (1) an unsupervised expert alignment objective based on hidden states and logits; (2) an adaptive calibration mechanism that learns layer-wise coefficients using unlabeled data; and (3) Expert Merging++, which introduces an importance-guided layer partitioning strategy to explicitly model inter-layer heterogeneity and enable efficient parameter allocation. Stability is ensured via coefficient regularization, task-weighted loss, and normalized importance scoring. Evaluated on large language and multimodal models—including Mistral, InternVL, and Qwen2-VL—Expert Merging significantly outperforms both training-free and fine-tuning-based baselines, with some results even surpassing supervised mixture-of-experts training.

Technology Category

Application Category

📝 Abstract

Model merging, which combines multiple domain-specialized experts into a single model, offers a practical path to endow Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) with broad capabilities without the cost of joint training or serving many models. However, training-free methods rely on hand-tuned coefficients, whereas training-based methods primarily align parameters rather than downstream task behavior and typically treat all layers uniformly, ignoring inter-layer heterogeneity. We introduce Expert Merging, a training-light method that learns a small set of layer-wise coefficients using only unlabeled calibration data. The coefficients are optimized to explicitly align the merged model's hidden states and logits with those of the corresponding experts, with a coefficient regularizer for stability and task-weighted losses for controllable trade-offs. To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking: a normalized layer-importance metric, derived from learned coefficients, task-vector magnitudes, and parameter counts, allocates more chunk-wise coefficients to high-importance layers while keeping low-importance layers lightweight. The result is a label-free, parameter-efficient, and scalable approach to multi-expert model merging across LLMs and MLLMs. Across MLLM backbones (InternVL and Qwen2-VL) and the LLM backbone (Mistral), our method surpasses strong training-free and training-based merging baselines, with Expert Merging++ delivering further gains and, in some cases, even exceeding supervised Mixture Training. The source code is available at https://github.com/Littleor/ExpertMerging.

Problem

Research questions and friction points this paper is trying to address.

Optimizes model merging for LLMs and MLLMs without labeled data

Aligns hidden states and logits using layer-wise coefficients

Addresses inter-layer heterogeneity through importance-guided chunking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised expert alignment with layer-wise coefficients

Importance-guided chunking for inter-layer variation

Label-free parameter-efficient multi-expert model merging

🔎 Similar Papers

Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models