A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Systematic evaluation of LLM-based synthetic data generation strategies for low-resource languages remains lacking. Method: This study conducts the first comprehensive, cross-lingual assessment of multi-prompt strategies—including few-shot demonstration, label summarization, and self-correction—across 11 typologically diverse, low-resource languages and three core NLP tasks, using four open-source LLMs (Llama, Phi, Qwen, and others). Contribution/Results: We identify the “target-language demonstration + LLM self-correction” combination as significantly outperforming individual prompting strategies. Crucially, lightweight prompts substantially reduce dependence on model scale. Empirical results show that fine-tuning small models on data generated by this optimal strategy achieves performance within ≤5% of that attained using real human-annotated data. Moreover, small models augmented with intelligent prompting attain generation quality comparable to large models, yielding substantial reductions in computational cost and deployment overhead.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM data generation strategies for low-resource languages
Comparing prompting methods' effectiveness in synthetic data creation
Assessing performance gaps between generated and real training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluates generation strategies for low-resource languages
Combines target-language demonstrations with LLM-based revisions
Uses smart prompting to reduce reliance on larger LLMs
🔎 Similar Papers
No similar papers found.
T
Tatiana Ankinina
German Research Institute for Artificial Intelligence (DFKI), Saarbrücken, Germany
J
Jan Cegin
Faculty of Information Technology, Brno University of Technology, Brno, Czechia; Kempelen Institute of Intelligent Technologies, Bratislava, Slovakia
Jakub Simko
Jakub Simko
Expert researcher, Kempelen Institute of Intelligent Technologies
user modellingdata analysismachine learningcrowdsourcingeye-tracking
S
Simon Ostermann
German Research Institute for Artificial Intelligence (DFKI), Saarbrücken, Germany