🤖 AI Summary
To address the degraded generalization of CLIP models in multi-domain federated learning caused by client-wise domain shift and label heterogeneity, this paper proposes FedDEAP: a framework that achieves unbiased semantic–domain feature disentanglement, designs a dual-prompt tuning mechanism comprising global semantic prompts and local domain prompts, and preserves domain-specific knowledge during federated aggregation. Additionally, FedDEAP introduces cross-modal representation alignment to jointly optimize the image–text joint embedding space. Theoretical analysis and experiments on four cross-domain benchmarks demonstrate that FedDEAP significantly improves CLIP’s image classification accuracy and cross-domain generalization under non-IID federated settings. To the best of our knowledge, FedDEAP is the first approach to achieve an effective balance between semantic sharing and domain personalization for vision-language models in multi-domain federated learning.
📝 Abstract
Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.