A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Systematic evaluation of LLM-based synthetic data generation strategies for low-resource languages remains lacking. Method: This study conducts the first comprehensive, cross-lingual assessment of multi-prompt strategies—including few-shot demonstration, label summarization, and self-correction—across 11 typologically diverse, low-resource languages and three core NLP tasks, using four open-source LLMs (Llama, Phi, Qwen, and others). Contribution/Results: We identify the “target-language demonstration + LLM self-correction” combination as significantly outperforming individual prompting strategies. Crucially, lightweight prompts substantially reduce dependence on model scale. Empirical results show that fine-tuning small models on data generated by this optimal strategy achieves performance within ≤5% of that attained using real human-annotated data. Moreover, small models augmented with intelligent prompting attain generation quality comparable to large models, yielding substantial reductions in computational cost and deployment overhead.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM data generation strategies for low-resource languages

Comparing prompting methods' effectiveness in synthetic data creation

Assessing performance gaps between generated and real training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluates generation strategies for low-resource languages

Combines target-language demonstrations with LLM-based revisions

Uses smart prompting to reduce reliance on larger LLMs

🔎 Similar Papers

No similar papers found.