Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain

πŸ“… 2025-07-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Generic large language models (LLMs) exhibit weak regional adaptation, insufficient multilingual support, and poor domain knowledge generalization in agricultural question answering. To address these limitations, this work introduces a multilingual synthetic data generation method grounded in domain-specific agricultural documentation, producing a high-quality agricultural QA dataset covering English, Hindi, and Punjabi. Building upon this resource, we propose a language-specific fine-grained fine-tuning strategy to enhance the model’s capacity to capture localized farming practices, terminological consistency, and agricultural factual accuracy. Experimental results demonstrate that our approach significantly outperforms baseline models on a multilingual agricultural benchmark: factual accuracy improves by 18.7%, content relevance by 22.3%, and agricultural consensus by 15.9%. To our knowledge, this is the first method enabling native-level, high-precision, and strongly localized agricultural technical Q&A support in low-resource language settings.

Technology Category

Application Category

πŸ“ Abstract
Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Although large language models (LLMs) can be used to implement Question Answering (QA) systems, simply using publicly available general-purpose LLMs in agriculture typically offer generic advisories, lacking precision in local and multilingual contexts due to insufficient domain-specific training and scarcity of high-quality, region-specific datasets. Our study addresses these limitations by generating multilingual synthetic agricultural datasets (English, Hindi, Punjabi) from agriculture-specific documents and fine-tuning language-specific LLMs. Our evaluation on curated multilingual datasets demonstrates significant improvements in factual accuracy, relevance, and agricultural consensus for the fine-tuned models compared to their baseline counterparts. These results highlight the efficacy of synthetic data-driven, language-specific fine-tuning as an effective strategy to improve the performance of LLMs in agriculture, especially in multilingual and low-resource settings. By enabling more accurate and localized agricultural advisory services, this study provides a meaningful step toward bridging the knowledge gap in AI-driven agricultural solutions for diverse linguistic communities.
Problem

Research questions and friction points this paper is trying to address.

Improving multilingual agricultural QA accuracy for farmers
Addressing lack of localized datasets for LLMs in agriculture
Enhancing domain-specific LLM performance with synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating multilingual synthetic agricultural datasets
Fine-tuning language-specific LLMs for agriculture
Improving accuracy with synthetic data-driven fine-tuning
πŸ”Ž Similar Papers
No similar papers found.
Rishemjit Kaur
Rishemjit Kaur
CSIR-Central Scientific Instruments Organisation
Big Datamachine learningcomputational social science
A
Arshdeep Singh Bhankhar
CSIR-Central Scientific Instruments Organisation, India
Surangika Ranathunga
Surangika Ranathunga
Senior Lecturer, School of Mathematical and Computational Sciences, Massey University, New Zealand
Natural Language ProcessingMachine LearningLarge Language Models
J
Jashanpreet Singh Salh
CSIR-Central Scientific Instruments Organisation, India
S
Sudhir Rajput
CSIR-Central Scientific Instruments Organisation, India
V
Vidhi
CSIR-Central Scientific Instruments Organisation, India
K
Kashish Mahendra
CSIR-Central Scientific Instruments Organisation, India
B
Bhavika Berwal
CSIR-Central Scientific Instruments Organisation, India
R
Ritesh Kumar
CSIR-Central Scientific Instruments Organisation, India