🤖 AI Summary
Addressing the challenges of backdoor defense in federated learning under non-IID data—particularly the reliance on homogeneous assumptions or clean server-side data—this paper proposes CLIP-Fed, the first framework to leverage the zero-shot capability of vision-language pre-trained models (CLIP) for federated backdoor defense. CLIP-Fed generates a server-side augmented dataset without client samples via frequency-domain analysis and introduces a pre- and post-aggregation defense strategy. This strategy jointly employs prototype contrastive loss and KL-divergence regularization to decouple trigger patterns from label associations, enabling unsupervised, client-data-free robust defense. Evaluated on CIFAR-10 and CIFAR-10-LT, CLIP-Fed reduces average attack success rate (ASR) by 2.03% and 1.35%, respectively, while improving model accuracy (MA) by 7.92% and 0.48%, outperforming state-of-the-art methods significantly.
📝 Abstract
Existing backdoor defense methods in Federated Learning (FL) rely on the assumption of homogeneous client data distributions or the availability of a clean serve dataset, which limits the practicality and effectiveness. Defending against backdoor attacks under heterogeneous client data distributions while preserving model performance remains a significant challenge. In this paper, we propose a FL backdoor defense framework named CLIP-Fed, which leverages the zero-shot learning capabilities of vision-language pre-training models. By integrating both pre-aggregation and post-aggregation defense strategies, CLIP-Fed overcomes the limitations of Non-IID imposed on defense effectiveness. To address privacy concerns and enhance the coverage of the dataset against diverse triggers, we construct and augment the server dataset using the multimodal large language model and frequency analysis without any client samples. To address class prototype deviations caused by backdoor samples and eliminate the correlation between trigger patterns and target labels, CLIP-Fed aligns the knowledge of the global model and CLIP on the augmented dataset using prototype contrastive loss and Kullback-Leibler divergence. Extensive experiments on representative datasets validate the effectiveness of CLIP-Fed. Compared to state-of-the-art methods, CLIP-Fed achieves an average reduction in ASR, i.e., 2.03% on CIFAR-10 and 1.35% on CIFAR-10-LT, while improving average MA by 7.92% and 0.48%, respectively.