🤖 AI Summary
To address challenges in crop disease diagnosis—including difficulty in fusing heterogeneous multimodal data, poor model interpretability, and limited cross-domain generalization—this paper introduces the first multimodal agricultural benchmark dataset integrating images, textual disease descriptions, and a disease-specific knowledge graph, covering 12 staple crops and 86 diseases. We propose a Vision–Language–Knowledge Collaborative Diagnosis Framework that unifies annotation across all three modalities, injects domain knowledge via ViT-BERT and Graph Neural Networks (GNNs), aligns modalities through contrastive learning, and enables fine-grained interpretable classification. Evaluated on our benchmark, the framework achieves a mean accuracy of 92.7% and improves cross-domain generalization by 14.3% over state-of-the-art unimodal and bimodal methods. This work establishes a reusable multimodal paradigm and technical foundation for intelligent agricultural extension services.