FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the overfitting of text-based prompt tuning to known concepts and its poor generalization in federated learning, this paper proposes a multimodal visual prompt tuning framework that enables, for the first time, dynamic visual prompt generation conditioned jointly on both image and text inputs. Methodologically, we design PromptFormer—a cross-modal alignment module that fuses frozen CLIP’s vision-language representations—and introduce a joint optimization objective comprising contrastive similarity loss and consistency loss to enhance prompt robustness and cross-domain/cross-class generalization. Extensive experiments across 20 datasets and three generalization settings—unseen classes, unseen domains, and distribution shifts—demonstrate that our method maintains in-distribution performance while achieving an average accuracy gain of 5.2% on unseen concepts, significantly outperforming existing federated prompt tuning approaches.

Technology Category

Application Category

📝 Abstract

Textual prompt tuning adapts Vision-Language Models (e.g., CLIP) in federated learning by tuning lightweight input tokens (or prompts) on local client data, while keeping network weights frozen. Post training, only the prompts are shared by the clients with the central server for aggregation. However, textual prompt tuning often struggles with overfitting to known concepts and may be overly reliant on memorized text features, limiting its adaptability to unseen concepts. To address this limitation, we propose Federated Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on comprehensive contextual information -- image-conditioned features and textual attribute features of a class -- that is multimodal in nature. At the core of FedMVP is a PromptFormer module that synergistically aligns textual and visual features through cross-attention, enabling richer contexual integration. The dynamically generated multimodal visual prompts are then input to the frozen vision encoder of CLIP, and trained with a combination of CLIP similarity loss and a consistency loss. Extensive evaluation on 20 datasets spanning three generalization settings demonstrates that FedMVP not only preserves performance on in-distribution classes and domains, but also displays higher generalizability to unseen classes and domains when compared to state-of-the-art methods. Codes will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Overfitting in textual prompt tuning for federated learning

Limited adaptability to unseen concepts in CLIP models

Need for multimodal visual prompts to enhance generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal visual prompts integrate image and text features

PromptFormer aligns features via cross-attention for richer context

Combines CLIP similarity loss with consistency loss

🔎 Similar Papers

No similar papers found.