🤖 AI Summary
Agricultural applications face severe challenges due to scarce labeled data and difficulty in identifying plant stress phenotypes. Method: This work introduces AgEval—the first agriculture-specific visual language model (VLM) evaluation benchmark—covering 12 stress categories and systematically assessing zero-shot and few-shot (1–8 examples) generalization of leading VLMs (e.g., Claude, GPT, Gemini, LLaVA). We propose a novel coefficient-of-variation (CV)-based metric to quantify cross-category performance disparity. Contribution/Results: We reveal significant category bias in agricultural VLMs (CV = 26.02%–58.03%), the first such finding. Precise class-specific exemplars improve average F1 by 15.38%; the best-performing model achieves 73.37% F1 under the 8-shot setting—a 27.13% relative gain. AgEval establishes a new, reproducible paradigm for evaluating agriculture-oriented VLMs.
📝 Abstract
As Vision Language Models (VLMs) become increasingly accessible to farmers and agricultural experts, there is a growing need to evaluate their potential in specialized tasks. We present AgEval, a comprehensive benchmark for assessing VLMs' capabilities in plant stress phenotyping, offering a solution to the challenge of limited annotated data in agriculture. Our study explores how general-purpose VLMs can be leveraged for domain-specific tasks with only a few annotated examples, providing insights into their behavior and adaptability. AgEval encompasses 12 diverse plant stress phenotyping tasks, evaluating zero-shot and few-shot in-context learning performance of state-of-the-art models including Claude, GPT, Gemini, and LLaVA. Our results demonstrate VLMs' rapid adaptability to specialized tasks, with the best-performing model showing an increase in F1 scores from 46.24% to 73.37% in 8-shot identification. To quantify performance disparities across classes, we introduce metrics such as the coefficient of variation (CV), revealing that VLMs' training impacts classes differently, with CV ranging from 26.02% to 58.03%. We also find that strategic example selection enhances model reliability, with exact category examples improving F1 scores by 15.38% on average. AgEval establishes a framework for assessing VLMs in agricultural applications, offering valuable benchmarks for future evaluations. Our findings suggest that VLMs, with minimal few-shot examples, show promise as a viable alternative to traditional specialized models in plant stress phenotyping, while also highlighting areas for further refinement. Results and benchmark details are available at: https://github.com/arbab-ml/AgEval