AgriPestDatabase-v1.0: A Structured Insect Dataset for Training Agricultural Large Language Model

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of high-quality annotated data and limited expert support in agricultural pest management in rural areas by constructing a structured textual dataset of pests and generating semantic-oriented question-answer pairs grounded in expert-validated knowledge. These resources are used to fine-tune lightweight large language models via efficient domain adaptation using LoRA, targeting deployment on edge devices. Experimental results demonstrate that the fine-tuned Mistral-7B model achieves an 88.9% pass rate and a semantic similarity score of 0.865 on domain-specific QA tasks, significantly outperforming Qwen2.5-7B and LLaMA3.1-8B. To the best of our knowledge, this work presents the first edge-deployable agricultural domain language model tailored for in-field applications.

Technology Category

Application Category

📝 Abstract
Agricultural pest management increasingly relies on timely and accurate access to expert knowledge, yet high quality labeled data and continuous expert support remain limited, particularly for farmers operating in rural regions with unstable/no internet connectivity. At the same time, the rapid growth of AI and LLMs has created new opportunities to deliver practical decision support tools directly to end users in agriculture through compact and deployable systems. This work addresses (i) generating a structured insect information dataset, and (ii) adapting a lightweight LLM model ($\leq$ 7B) by fine tuning it for edge device uses in agricultural pest management. The textual data collection was done by reviewing and collecting information from available pest databases and published manuscripts on nine selected pest species. These structured reports were then reviewed and validated by a domain expert. From these reports, we constructed Q/A pairs to support model training and evaluation. A LoRA-based fine-tuning approach was applied to multiple lightweight LLMs and evaluated. Initial evaluation shows that Mistral 7B achieves an 88.9\% pass rate on the domain-specific Q/A task, substantially outperforming Qwen 2.5 7B (63.9\%), and LLaMA 3.1 8B (58.7\%). Notably, Mistral demonstrates higher semantic alignment (embedding similarity: 0.865) despite lower lexical overlap (BLEU: 0.097), indicating that semantic understanding and robust reasoning are more predictive of task success than surface-level conformity in specialized domains. By combining expert organized data, well-structured Q/A pairs, semantic quality control, and efficient model adaptation, this work contributes towards providing support for farmer facing agricultural decision support tools and demonstrates the feasibility of deploying compact, high-performing language models for practical field-level pest management guidance.
Problem

Research questions and friction points this paper is trying to address.

agricultural pest management
labeled data scarcity
expert knowledge access
rural connectivity
decision support
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured pest dataset
lightweight LLM
LoRA fine-tuning
edge deployment
semantic alignment
🔎 Similar Papers
No similar papers found.
Y
Yagizhan Bilal Durak
Sam Houston State University
A
Ahsan Ul Islam
Sam Houston State University
S
Shahidul Islam
Kennesaw State University
A
Ashley Morgan-Olvera
Sam Houston State University
I
Iftekhar Ibne Basith
Sam Houston State University
Syed Hasib Akhter Faruqui
Syed Hasib Akhter Faruqui
Assistant Professor at Sam Houston State University
Bayesian NetworkTime Series Data MiningBiomedical Image Processing