π€ AI Summary
Agricultural large language models suffer from insufficient domain knowledge and scarcity of high-quality image-text pairs, severely limiting their conversational capabilities. To address this, we propose AgroGPTβthe first efficient vision-language dialogue model tailored for agriculture. Our method introduces a novel unsupervised multimodal alignment paradigm: leveraging solely unlabeled agricultural images paired with expert-level instructions auto-generated by large language models (LLMs), yielding 70K high-fidelity AgroInstruct samples. We further integrate multi-source agricultural image curation, LLM-driven domain knowledge injection, lightweight vision-language model (VLM) co-finetuning, and a dedicated agricultural evaluation benchmark, AgroEvals. Extensive experiments demonstrate that AgroGPT significantly outperforms both mainstream open-source and proprietary foundation models on fine-grained crop condition recognition and cross-modal agricultural question answering, achieving practical-level expertise in agricultural dialogue.
π Abstract
Significant progress has been made in advancing large multimodal conversational models (LMMs), capitalizing on vast repositories of image-text data available online. Despite this progress, these models often encounter substantial domain gaps, hindering their ability to engage in complex conversations across new domains. Recent efforts have aimed to mitigate this issue, albeit relying on domain-specific image-text data to curate instruction-tuning data. However, many domains, such as agriculture, lack such vision-language data. In this work, we propose an approach to construct instruction-tuning data that harnesses vision-only data for the agriculture domain. We utilize diverse agricultural datasets spanning multiple domains, curate class-specific information, and employ large language models (LLMs) to construct an expert-tuning set, resulting in a 70k expert-tuning dataset called AgroInstruct. Subsequently, we expert-tuned and created AgroGPT, an efficient LMM that can hold complex agriculture-related conversations and provide useful insights. We also develop AgroEvals for evaluation and compare {AgroGPT's} performance with large open and closed-source models. {AgroGPT} excels at identifying fine-grained agricultural concepts, can act as an agriculture expert, and provides helpful information for multimodal agriculture questions. The code, datasets, and models are available at https://github.com/awaisrauf/agroGPT.