Federated Cross-Modal Style-Aware Prompt Generation

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing federated prompt learning methods exclusively leverage top-layer features from the CLIP visual encoder, neglecting multi-scale visual cues and client-specific stylistic heterogeneity, thereby limiting generalization. To address this, we propose FedCSAP—the first federated framework that jointly models low-, mid-, and high-level visual features of CLIP and incorporates batch-level style statistics as client-specific style indicators for style-aware cross-modal contextual prompt generation. By aligning textual context with multi-scale visual features, FedCSAP generates discriminative, diverse, and non-redundant prompt vectors while preserving data privacy. Extensive experiments on multiple image classification benchmarks demonstrate that FedCSAP significantly improves both in-distribution and out-of-distribution generalization, outperforming state-of-the-art federated prompt learning methods in accuracy and cross-domain/cross-category transfer capability.

Technology Category

Application Category

📝 Abstract

Prompt learning has propelled vision-language models like CLIP to excel in diverse tasks, making them ideal for federated learning due to computational efficiency. However, conventional approaches that rely solely on final-layer features miss out on rich multi-scale visual cues and domain-specific style variations in decentralized client data. To bridge this gap, we introduce FedCSAP (Federated Cross-Modal Style-Aware Prompt Generation). Our framework harnesses low, mid, and high-level features from CLIP's vision encoder alongside client-specific style indicators derived from batch-level statistics. By merging intricate visual details with textual context, FedCSAP produces robust, context-aware prompt tokens that are both distinct and non-redundant, thereby boosting generalization across seen and unseen classes. Operating within a federated learning paradigm, our approach ensures data privacy through local training and global aggregation, adeptly handling non-IID class distributions and diverse domain-specific styles. Comprehensive experiments on multiple image classification datasets confirm that FedCSAP outperforms existing federated prompt learning methods in both accuracy and overall generalization.

Problem

Research questions and friction points this paper is trying to address.

Enhances vision-language models with multi-scale visual cues

Addresses domain-specific style variations in decentralized data

Improves federated learning accuracy and generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale CLIP features for rich visual cues

Client-specific style indicators from batch statistics

Federated learning with local training and global aggregation

🔎 Similar Papers

No similar papers found.

Authors to Follow