CT-CLIP: A Multi-modal Fusion Framework for Robust Apple Leaf Disease Recognition in Complex Environments

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In orchard environments, apple leaf disease phenotypes exhibit high heterogeneity, and conventional multi-scale feature fusion methods struggle to simultaneously preserve local texture details and global semantic context. To address this, we propose CT-CLIP, a multimodal fusion framework that synergistically leverages CNNs for local texture modeling and Vision Transformers for global structural representation. An Adaptive Feature Fusion Module (AFFM) is introduced to dynamically integrate hierarchical visual features. Furthermore, CLIP’s pre-trained vision–language alignment capability is incorporated to bridge image representations with textual disease descriptions, substantially improving few-shot generalization. Evaluated on both public and in-house apple disease datasets, CT-CLIP achieves accuracies of 97.38% and 96.12%, respectively—outperforming state-of-the-art baselines. These results demonstrate the framework’s robustness and practical efficacy in complex, real-world agricultural scenarios.

Technology Category

Application Category

📝 Abstract

In complex orchard environments, the phenotypic heterogeneity of different apple leaf diseases, characterized by significant variation among lesions, poses a challenge to traditional multi-scale feature fusion methods. These methods only integrate multi-layer features extracted by convolutional neural networks (CNNs) and fail to adequately account for the relationships between local and global features. Therefore, this study proposes a multi-branch recognition framework named CNN-Transformer-CLIP (CT-CLIP). The framework synergistically employs a CNN to extract local lesion detail features and a Vision Transformer to capture global structural relationships. An Adaptive Feature Fusion Module (AFFM) then dynamically fuses these features, achieving optimal coupling of local and global information and effectively addressing the diversity in lesion morphology and distribution. Additionally, to mitigate interference from complex backgrounds and significantly enhance recognition accuracy under few-shot conditions, this study proposes a multimodal image-text learning approach. By leveraging pre-trained CLIP weights, it achieves deep alignment between visual features and disease semantic descriptions. Experimental results show that CT-CLIP achieves accuracies of 97.38% and 96.12% on a publicly available apple disease and a self-built dataset, outperforming several baseline methods. The proposed CT-CLIP demonstrates strong capabilities in recognizing agricultural diseases, significantly enhances identification accuracy under complex environmental conditions, provides an innovative and practical solution for automated disease recognition in agricultural applications.

Problem

Research questions and friction points this paper is trying to address.

Addresses apple leaf disease recognition in complex orchard environments

Overcomes limitations of traditional multi-scale feature fusion methods

Mitigates interference from complex backgrounds to enhance accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN-Transformer-CLIP framework for multi-modal apple disease recognition

Adaptive Feature Fusion Module dynamically combines local and global features

Pre-trained CLIP weights align visual features with semantic descriptions

🔎 Similar Papers

Machine Vision-Based Assessment of Fall Color Changes and its Relationship with Leaf Nitrogen Concentration