WisWheat: A Three-Tiered Vision-Language Dataset for Wheat Management

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Wheat management heavily relies on manual labor, and general-purpose vision-language models (VLMs) lack domain-specific quantitative reasoning capabilities for agricultural applications. Method: We construct the first three-tier wheat-management-oriented vision-language dataset, covering morphological pretraining, quantitative trait measurement, pest/disease/stress diagnosis, and growth-stage-specific decision-making—introducing a novel hierarchical VLM data architecture tailored to wheat. Leveraging open-source multimodal LMs (e.g., Qwen2.5-VL-7B), we apply staged pretraining, visual question answering (VQA) fine-tuning, and instruction tuning. Contribution/Results: The fine-tuned model achieves 79.2% accuracy on wheat stress identification and 84.6% on growth-stage dialogue tasks—significantly outperforming generalist models like GPT-4o. This work establishes the first fine-grained agricultural vision-language benchmark, bridging a critical gap in domain-specific evaluation and enhancing large models’ precision in crop understanding and executable agronomic decision-making.

Technology Category

Application Category

📝 Abstract
Wheat management strategies play a critical role in determining yield. Traditional management decisions often rely on labour-intensive expert inspections, which are expensive, subjective and difficult to scale. Recently, Vision-Language Models (VLMs) have emerged as a promising solution to enable scalable, data-driven management support. However, due to a lack of domain-specific knowledge, directly applying VLMs to wheat management tasks results in poor quantification and reasoning capabilities, ultimately producing vague or even misleading management recommendations. In response, we propose WisWheat, a wheat-specific dataset with a three-layered design to enhance VLM performance on wheat management tasks: (1) a foundational pretraining dataset of 47,871 image-caption pairs for coarsely adapting VLMs to wheat morphology; (2) a quantitative dataset comprising 7,263 VQA-style image-question-answer triplets for quantitative trait measuring tasks; and (3) an Instruction Fine-tuning dataset with 4,888 samples targeting biotic and abiotic stress diagnosis and management plan for different phenological stages. Extensive experimental results demonstrate that fine-tuning open-source VLMs (e.g., Qwen2.5 7B) on our dataset leads to significant performance improvements. Specifically, the Qwen2.5 VL 7B fine-tuned on our wheat instruction dataset achieves accuracy scores of 79.2% and 84.6% on wheat stress and growth stage conversation tasks respectively, surpassing even general-purpose commercial models such as GPT-4o by a margin of 11.9% and 34.6%.
Problem

Research questions and friction points this paper is trying to address.

Enhancing Vision-Language Models for wheat management tasks
Addressing poor quantification in wheat-specific VLM applications
Improving accuracy in stress diagnosis and growth stage analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-tiered dataset for wheat-specific VLM adaptation
Quantitative VQA dataset for trait measurement
Instruction fine-tuning for stress diagnosis and management
🔎 Similar Papers
No similar papers found.
Bowen Yuan
Bowen Yuan
The University of Queensland, Brisbane, Queensland, Australia
S
Selena Song
The University of Queensland, Brisbane, Queensland, Australia
J
Javier Fernandez
The University of Queensland, Brisbane, Queensland, Australia
Yadan Luo
Yadan Luo
ARC DECRA and Senior Lecturer, University of Queensland
Generalization3D VisionAutonomous Driving
Mahsa Baktashmotlagh
Mahsa Baktashmotlagh
University of Queensland
Machine LearningComputer Vision
Z
Zijian Wang
The University of Queensland, Brisbane, Queensland, Australia