🤖 AI Summary
To address three key challenges in smart agriculture image classification—data privacy leakage, performance degradation due to non-IID data distributions across clients, and high communication overhead in federated learning—this paper proposes a feature-replay-based federated transfer learning framework. Our method freezes a pre-trained CLIP ViT visual encoder to extract irreversible, semantically robust image features; only 1% class-level prototype features are shared across clients to enable cross-client semantic alignment; and lightweight Transformer classifiers are trained locally, with knowledge transfer facilitated via a feature replay mechanism. Evaluated on agricultural image classification, our approach achieves 86.6% accuracy—over four times higher than mainstream federated baselines—while drastically reducing communication costs. The framework simultaneously ensures strong privacy preservation and robustness to non-IID data, offering a practical solution for privacy-sensitive, resource-constrained edge environments in smart agriculture.
📝 Abstract
Accurate classification plays a pivotal role in smart agriculture, enabling applications such as crop monitoring, fruit recognition, and pest detection. However, conventional centralized training often requires large-scale data collection, which raises privacy concerns, while standard federated learning struggles with non-independent and identically distributed (non-IID) data and incurs high communication costs. To address these challenges, we propose a federated learning framework that integrates a frozen Contrastive Language-Image Pre-training (CLIP) vision transformer (ViT) with a lightweight transformer classifier. By leveraging the strong feature extraction capability of the pre-trained CLIP ViT, the framework avoids training large-scale models from scratch and restricts federated updates to a compact classifier, thereby reducing transmission overhead significantly. Furthermore, to mitigate performance degradation caused by non-IID data distribution, a small subset (1%) of CLIP-extracted feature representations from all classes is shared across clients. These shared features are non-reversible to raw images, ensuring privacy preservation while aligning class representation across participants. Experimental results on agricultural classification tasks show that the proposed method achieve 86.6% accuracy, which is more than 4 times higher compared to baseline federated learning approaches. This demonstrates the effectiveness and efficiency of combining vision-language model features with federated learning for privacy-preserving and scalable agricultural intelligence.