🤖 AI Summary
To address knowledge fragmentation and poor cross-domain generalization in domain-specific multimodal large language models (MLLMs), this paper proposes a compatibility-aware parameter fusion framework that enables modular integration of expert model capabilities. Methodologically, it innovatively combines local functional attribution with global information-theoretic signals to quantify alignment among heterogeneous expert models at the activation layer; employs low-rank adapter-granular parameter concatenation; and introduces a compatibility scoring mechanism to achieve efficient, scalable, fine-grained knowledge coordination. Experiments across multiple multimodal benchmarks demonstrate substantial improvements in cross-domain adaptation performance while maintaining controlled inference overhead. The framework establishes a novel paradigm for building plug-and-play, modular MLLMs.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved success across various domains. However, their applicability tends to degrade when confronted with different types of data inputs, especially for MLLMs that have been fine-tuned for specific tasks. Despite its importance, the study of knowledge sharing among domain-specific MLLMs--such as those trained for mathematics or code--remains largely underexplored. To address the fragmentation of knowledge across domain-specialized MLLMs, we propose a unified parameter integration framework that enables modular composition of expert capabilities. Our method is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy, which leverages both local functional attribution and global information-theoretic signals to guide selective parameter fusion. By extending this mechanism to the low-rank adaptation layer granularity, we ensure efficient integration with minimal inference overhead. Furthermore, we introduce a domain compatibility scoring mechanism that quantifies inter-expert alignment at the activation level and correlates with downstream task utility. This principled fusion protocol allows the final model to synergize heterogeneous expertise while preserving structural modularity. Extensive evaluations across diverse multimodal benchmarks validate the effectiveness of our framework, offering a scalable path toward compositional, domain-adaptive MLLMs.