LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of large-scale multimodal plant disease datasets and evaluation benchmarks, which has hindered the application of vision-language models (VLMs) in agricultural pathology diagnosis. To bridge this gap, the authors introduce LeafNet—a dataset comprising 186,000 leaf images and 13,950 structured question-answer pairs—and LeafBench, a comprehensive benchmark covering 97 disease categories and six core agricultural tasks. This work presents the first fine-grained multimodal question-answering benchmark tailored to plant diseases and systematically evaluates 12 state-of-the-art VLMs. Experimental results show that while binary classification of healthy versus diseased leaves achieves over 90% accuracy, performance on pathogen and species identification remains below 65%. Furthermore, fine-tuned VLMs significantly outperform purely visual models, demonstrating the critical role of linguistic information in enhancing diagnostic accuracy.

Technology Category

Application Category

📝 Abstract
Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90\% accuracy, while fine-grained pathogen and species identification remains below 65\%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
plant disease diagnosis
multimodal dataset
agricultural AI
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models
multimodal dataset
plant disease diagnosis
visual question answering
foundation models
🔎 Similar Papers
No similar papers found.