Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address the dual challenges of scarce annotated data and insufficient domain knowledge in low-resource languages (e.g., Hindi) for travel question answering, this paper proposes a synthetic-data-driven, multi-stage fine-tuning framework. First, high-quality Hindi travel QA synthetic data is generated using large language models—LLaMA-70B and Phi-1.4B. Then, a lightweight language model undergoes progressive domain adaptation via fine-tuning, jointly leveraging the synthetic data and a small set of real human-annotated examples. Experimental results demonstrate substantial improvements in accuracy and generalization performance of the small model on Hindi travel QA tasks. The findings validate the efficacy of the “large-model generation + small-model refinement” paradigm for low-resource, domain-specific applications. Moreover, the framework establishes a reproducible, scalable methodology for developing domain-specialized models in resource-constrained linguistic settings.

Technology Category

Application Category

📝 Abstract

Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.

Problem

Research questions and friction points this paper is trying to address.

Adapting small language models to low-resource domains

Addressing data scarcity in Hindi tourism question answering

Enhancing domain generalization using synthetic training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage finetuning strategy for lightweight models

Synthetic QA pairs generated using large LLMs

Augmenting limited datasets with synthetic data

🔎 Similar Papers

No similar papers found.