๐ค AI Summary
This work addresses the challenge that general-purpose multimodal models struggle to integrate domain-specific agronomic knowledge with fine-grained visual reasoning required for plant phenotyping. To bridge this gap, the authors propose PlantXpertโthe first structured and reproducible multimodal benchmark tailored for soybean and cotton, comprising 385 drone-captured images and over 3,000 annotated samples spanning critical tasks such as disease, pest, weed detection, and yield estimation. Evaluating 11 state-of-the-art vision-language models (e.g., Qwen3-VL-4B/30B) with task-specific fine-tuning and multi-step reasoning protocols, the study achieves a post-fine-tuning accuracy of 78%. However, it reveals diminishing returns from model scaling and limited cross-crop generalization, highlighting that robust quantitative and biologically plausible reasoning remains a fundamental challenge in agricultural AI.
๐ Abstract
To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.