🤖 AI Summary
To address the dual challenges of scarce annotated data and insufficient domain knowledge in low-resource languages (e.g., Hindi) for travel question answering, this paper proposes a synthetic-data-driven, multi-stage fine-tuning framework. First, high-quality Hindi travel QA synthetic data is generated using large language models—LLaMA-70B and Phi-1.4B. Then, a lightweight language model undergoes progressive domain adaptation via fine-tuning, jointly leveraging the synthetic data and a small set of real human-annotated examples. Experimental results demonstrate substantial improvements in accuracy and generalization performance of the small model on Hindi travel QA tasks. The findings validate the efficacy of the “large-model generation + small-model refinement” paradigm for low-resource, domain-specific applications. Moreover, the framework establishes a reproducible, scalable methodology for developing domain-specialized models in resource-constrained linguistic settings.
📝 Abstract
Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.