🤖 AI Summary
To address performance degradation caused by parameter conflicts in large language model (LLM) fusion and the high storage/computation overhead and poor cross-task knowledge sharing induced by conventional model routing, this paper proposes a hierarchical fusion framework. Motivated by our first empirical discovery of heterogeneous parameter conflict across LLM layers, the framework applies parameter averaging to low-conflict layers while introducing task-level expert routing for high-conflict layers. We further design a sparse expert decoupling mechanism and an input uncertainty—driven (entropy/variance) dynamic expert selection strategy. This approach jointly optimizes knowledge sharing and conflict mitigation. Extensive experiments on LLaMA and Qwen series models demonstrate that our method significantly outperforms existing fusion baselines: average accuracy improves by 3.2–5.8%, GPU memory consumption decreases by 37%, and inference latency reduces by 29%.
📝 Abstract
Model merging aggregates Large Language Models (LLMs) finetuned on different tasks into a stronger one. However, parameter conflicts between models leads to performance degradation in averaging. While model routing addresses this issue by selecting individual models during inference, it imposes excessive storage and compute costs, and fails to leverage the common knowledge from different models. In this work, we observe that different layers exhibit varying levels of parameter conflicts. Building on this insight, we average layers with minimal parameter conflicts and use a novel task-level expert routing for layers with significant conflicts. To further reduce storage costs, inspired by task arithmetic sparsity, we decouple multiple fine-tuned experts into a dense expert and several sparse experts. Considering the out-of-distribution samples, we select and merge appropriate experts based on the task uncertainty of the input data. We conduct extensive experiments on both LLaMA and Qwen with varying parameter scales, and evaluate on real-world reasoning tasks. Results demonstrate that our method consistently achieves significant performance improvements while requiring less system cost compared to existing methods.