CT-CLIP: A Multi-modal Fusion Framework for Robust Apple Leaf Disease Recognition in Complex Environments

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In orchard environments, apple leaf disease phenotypes exhibit high heterogeneity, and conventional multi-scale feature fusion methods struggle to simultaneously preserve local texture details and global semantic context. To address this, we propose CT-CLIP, a multimodal fusion framework that synergistically leverages CNNs for local texture modeling and Vision Transformers for global structural representation. An Adaptive Feature Fusion Module (AFFM) is introduced to dynamically integrate hierarchical visual features. Furthermore, CLIP’s pre-trained vision–language alignment capability is incorporated to bridge image representations with textual disease descriptions, substantially improving few-shot generalization. Evaluated on both public and in-house apple disease datasets, CT-CLIP achieves accuracies of 97.38% and 96.12%, respectively—outperforming state-of-the-art baselines. These results demonstrate the framework’s robustness and practical efficacy in complex, real-world agricultural scenarios.

Technology Category

Application Category

📝 Abstract
In complex orchard environments, the phenotypic heterogeneity of different apple leaf diseases, characterized by significant variation among lesions, poses a challenge to traditional multi-scale feature fusion methods. These methods only integrate multi-layer features extracted by convolutional neural networks (CNNs) and fail to adequately account for the relationships between local and global features. Therefore, this study proposes a multi-branch recognition framework named CNN-Transformer-CLIP (CT-CLIP). The framework synergistically employs a CNN to extract local lesion detail features and a Vision Transformer to capture global structural relationships. An Adaptive Feature Fusion Module (AFFM) then dynamically fuses these features, achieving optimal coupling of local and global information and effectively addressing the diversity in lesion morphology and distribution. Additionally, to mitigate interference from complex backgrounds and significantly enhance recognition accuracy under few-shot conditions, this study proposes a multimodal image-text learning approach. By leveraging pre-trained CLIP weights, it achieves deep alignment between visual features and disease semantic descriptions. Experimental results show that CT-CLIP achieves accuracies of 97.38% and 96.12% on a publicly available apple disease and a self-built dataset, outperforming several baseline methods. The proposed CT-CLIP demonstrates strong capabilities in recognizing agricultural diseases, significantly enhances identification accuracy under complex environmental conditions, provides an innovative and practical solution for automated disease recognition in agricultural applications.
Problem

Research questions and friction points this paper is trying to address.

Addresses apple leaf disease recognition in complex orchard environments
Overcomes limitations of traditional multi-scale feature fusion methods
Mitigates interference from complex backgrounds to enhance accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN-Transformer-CLIP framework for multi-modal apple disease recognition
Adaptive Feature Fusion Module dynamically combines local and global features
Pre-trained CLIP weights align visual features with semantic descriptions
L
Lemin Liu
College of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China
F
Fangchao Hu
College of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China
H
Honghua Jiang
College of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China
Yaru Chen
Yaru Chen
Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey
Multi-modal learningComputer vision
L
Limin Liu
School of Mechanical and Electronic Engineering, Shandong Agricultural Engineering College, Jinan 250100, China
Yongliang Qiao
Yongliang Qiao
Australian Institute for Machine Learning (AIML) ,The University of Adelaide
Smart agricultureCausalityArtificial intelligenceAgricultural robotsIntelligent perception