FedUMM: A General Framework for Federated Learning with Unified Multimodal Models

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the challenges of deploying unified multimodal models in privacy-sensitive and geographically distributed settings, where non-IID data, high communication overhead, and the infeasibility of centralized training pose significant obstacles. To this end, we propose FedUMM, the first framework to integrate unified multimodal models into federated learning. FedUMM freezes the backbone model and performs parameter-efficient fine-tuning exclusively on lightweight LoRA adapters at the clients, while the server aggregates only these adapter updates—dramatically reducing communication costs. Evaluated on the NVIDIA FLARE platform with a BLIP3o backbone across VQA v2 and GenEval benchmarks, FedUMM demonstrates only marginal performance degradation under 16 highly heterogeneous clients, achieves over an order-of-magnitude reduction in communication overhead compared to full-model fine-tuning, and effectively supports both generative and understanding tasks.

Technology Category

Application Category

📝 Abstract

Unified multimodal models (UMMs) are emerging as strong foundation models that can do both generation and understanding tasks in a single architecture. However, they are typically trained in centralized settings where all training and downstream datasets are gathered in a central server, limiting the deployment in privacy-sensitive and geographically distributed scenarios. In this paper, we present FedUMM, a general federated learning framework for UMMs under non-IID multimodal data with low communication cost. Built on NVIDIA FLARE, FedUMM instantiates federation for a BLIP3o backbone via parameter-efficient fine-tuning: clients train lightweight LoRA adapters while freezing the foundation models, and the server aggregates only adapter updates. We evaluate on VQA v2 and the GenEval compositional generation benchmarks under Dirichlet-controlled heterogeneity with up to 16 clients. Results show slight degradation as client count and heterogeneity increase, while remaining competitive with centralized training. We further analyze computation--communication trade-offs and demonstrate that adapter-only federation reduces per-round communication by over an order of magnitude compared to full fine-tuning, enabling practical federated UMM training. This work provides empirical experience for future research on privacy-preserving federated unified multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Federated Learning

Unified Multimodal Models

Privacy Preservation

Non-IID Data

Communication Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning

Unified Multimodal Models

Parameter-Efficient Fine-Tuning