Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Generic large language models (LLMs) exhibit weak regional adaptation, insufficient multilingual support, and poor domain knowledge generalization in agricultural question answering. To address these limitations, this work introduces a multilingual synthetic data generation method grounded in domain-specific agricultural documentation, producing a high-quality agricultural QA dataset covering English, Hindi, and Punjabi. Building upon this resource, we propose a language-specific fine-grained fine-tuning strategy to enhance the model’s capacity to capture localized farming practices, terminological consistency, and agricultural factual accuracy. Experimental results demonstrate that our approach significantly outperforms baseline models on a multilingual agricultural benchmark: factual accuracy improves by 18.7%, content relevance by 22.3%, and agricultural consensus by 15.9%. To our knowledge, this is the first method enabling native-level, high-precision, and strongly localized agricultural technical Q&A support in low-resource language settings.

Technology Category

Application Category

📝 Abstract

Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Although large language models (LLMs) can be used to implement Question Answering (QA) systems, simply using publicly available general-purpose LLMs in agriculture typically offer generic advisories, lacking precision in local and multilingual contexts due to insufficient domain-specific training and scarcity of high-quality, region-specific datasets. Our study addresses these limitations by generating multilingual synthetic agricultural datasets (English, Hindi, Punjabi) from agriculture-specific documents and fine-tuning language-specific LLMs. Our evaluation on curated multilingual datasets demonstrates significant improvements in factual accuracy, relevance, and agricultural consensus for the fine-tuned models compared to their baseline counterparts. These results highlight the efficacy of synthetic data-driven, language-specific fine-tuning as an effective strategy to improve the performance of LLMs in agriculture, especially in multilingual and low-resource settings. By enabling more accurate and localized agricultural advisory services, this study provides a meaningful step toward bridging the knowledge gap in AI-driven agricultural solutions for diverse linguistic communities.

Problem

Research questions and friction points this paper is trying to address.

Improving multilingual agricultural QA accuracy for farmers

Addressing lack of localized datasets for LLMs in agriculture

Enhancing domain-specific LLM performance with synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating multilingual synthetic agricultural datasets

Fine-tuning language-specific LLMs for agriculture

Improving accuracy with synthetic data-driven fine-tuning

🔎 Similar Papers

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models