Federated Cross-Modal Style-Aware Prompt Generation

πŸ“… 2025-08-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing federated prompt learning methods exclusively leverage top-layer features from the CLIP visual encoder, neglecting multi-scale visual cues and client-specific stylistic heterogeneity, thereby limiting generalization. To address this, we propose FedCSAPβ€”the first federated framework that jointly models low-, mid-, and high-level visual features of CLIP and incorporates batch-level style statistics as client-specific style indicators for style-aware cross-modal contextual prompt generation. By aligning textual context with multi-scale visual features, FedCSAP generates discriminative, diverse, and non-redundant prompt vectors while preserving data privacy. Extensive experiments on multiple image classification benchmarks demonstrate that FedCSAP significantly improves both in-distribution and out-of-distribution generalization, outperforming state-of-the-art federated prompt learning methods in accuracy and cross-domain/cross-category transfer capability.

Technology Category

Application Category

πŸ“ Abstract
Prompt learning has propelled vision-language models like CLIP to excel in diverse tasks, making them ideal for federated learning due to computational efficiency. However, conventional approaches that rely solely on final-layer features miss out on rich multi-scale visual cues and domain-specific style variations in decentralized client data. To bridge this gap, we introduce FedCSAP (Federated Cross-Modal Style-Aware Prompt Generation). Our framework harnesses low, mid, and high-level features from CLIP's vision encoder alongside client-specific style indicators derived from batch-level statistics. By merging intricate visual details with textual context, FedCSAP produces robust, context-aware prompt tokens that are both distinct and non-redundant, thereby boosting generalization across seen and unseen classes. Operating within a federated learning paradigm, our approach ensures data privacy through local training and global aggregation, adeptly handling non-IID class distributions and diverse domain-specific styles. Comprehensive experiments on multiple image classification datasets confirm that FedCSAP outperforms existing federated prompt learning methods in both accuracy and overall generalization.
Problem

Research questions and friction points this paper is trying to address.

Enhances vision-language models with multi-scale visual cues
Addresses domain-specific style variations in decentralized data
Improves federated learning accuracy and generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale CLIP features for rich visual cues
Client-specific style indicators from batch statistics
Federated learning with local training and global aggregation
πŸ”Ž Similar Papers
No similar papers found.
S
Suraj Prasad
Indian Institute of Technology Bombay
N
Navyansh Mahla
Indian Institute of Technology Bombay
S
Sunny Gupta
Indian Institute of Technology Bombay
Amit Sethi
Amit Sethi
Indian Institute of Technology Bombay, Indian Institute of Technology Guwahati, University of
Image processingcomputer visionmachine learningmedical image processing