Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation caused by parameter conflicts in large language model (LLM) fusion and the high storage/computation overhead and poor cross-task knowledge sharing induced by conventional model routing, this paper proposes a hierarchical fusion framework. Motivated by our first empirical discovery of heterogeneous parameter conflict across LLM layers, the framework applies parameter averaging to low-conflict layers while introducing task-level expert routing for high-conflict layers. We further design a sparse expert decoupling mechanism and an input uncertainty—driven (entropy/variance) dynamic expert selection strategy. This approach jointly optimizes knowledge sharing and conflict mitigation. Extensive experiments on LLaMA and Qwen series models demonstrate that our method significantly outperforms existing fusion baselines: average accuracy improves by 3.2–5.8%, GPU memory consumption decreases by 37%, and inference latency reduces by 29%.

Technology Category

Application Category

📝 Abstract
Model merging aggregates Large Language Models (LLMs) finetuned on different tasks into a stronger one. However, parameter conflicts between models leads to performance degradation in averaging. While model routing addresses this issue by selecting individual models during inference, it imposes excessive storage and compute costs, and fails to leverage the common knowledge from different models. In this work, we observe that different layers exhibit varying levels of parameter conflicts. Building on this insight, we average layers with minimal parameter conflicts and use a novel task-level expert routing for layers with significant conflicts. To further reduce storage costs, inspired by task arithmetic sparsity, we decouple multiple fine-tuned experts into a dense expert and several sparse experts. Considering the out-of-distribution samples, we select and merge appropriate experts based on the task uncertainty of the input data. We conduct extensive experiments on both LLaMA and Qwen with varying parameter scales, and evaluate on real-world reasoning tasks. Results demonstrate that our method consistently achieves significant performance improvements while requiring less system cost compared to existing methods.
Problem

Research questions and friction points this paper is trying to address.

Memory-efficient LLM merging
Reducing parameter conflicts
Task uncertainty-based expert routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise averaging minimizes parameter conflicts
Task-level expert routing handles significant conflicts
Sparse experts reduce storage costs effectively
🔎 Similar Papers
No similar papers found.