🤖 AI Summary
To address the challenge of cross-modal data distribution heterogeneity in fine-tuning multimodal large language models (MLLMs) under federated learning (FL), this paper proposes FedMLLM—a general-purpose FL framework for MLLMs. Methodologically, it integrates classical FL paradigms with two modality-agnostic strategies: modality-agnostic feature alignment and lightweight heterogeneity modeling. Furthermore, it introduces the first comprehensive benchmark framework covering four major cross-modal scenarios and over ten types of modality heterogeneity. Extensive experiments on two lightweight MLLMs, five cross-domain datasets, and two downstream tasks demonstrate that FedMLLM significantly mitigates performance degradation induced by modality heterogeneity, enhancing model generalization and robustness. The framework provides a scalable, highly adaptive paradigm for privacy-sensitive multimodal federated fine-tuning.
📝 Abstract
Multimodal Large Language Models (MLLMs) have made significant advancements, demonstrating powerful capabilities in processing and understanding multimodal data. Fine-tuning MLLMs with Federated Learning (FL) allows for expanding the training data scope by including private data sources, thereby enhancing their practical applicability in privacy-sensitive domains. However, current research remains in the early stage, particularly in addressing the extbf{multimodal heterogeneities} in real-world applications. In this paper, we introduce a benchmark to evaluate the performance of federated fine-tuning of MLLMs across various multimodal heterogeneous scenarios, laying the groundwork for future research in the field. Our benchmark includes two lightweight MLLMs, two downstream tasks, three evaluation metrics, and five datasets across three domains, along with six comparison baselines, covering over ten types of modality heterogeneities across four multimodal scenarios. To address the challenges posed by multimodal heterogeneity, we develop a general FedMLLM framework that integrates classic FL methods alongside two modality-agnostic strategies. Extensive experimental results show that our proposed FL paradigm improves the performance of MLLMs by broadening the range of training data and mitigating multimodal heterogeneity. Code is available in supplementary materials.