What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This study investigates how the diversity of synthetic data generated by large language models (LLMs) affects downstream model fine-tuning performance under data-scarce and high-labeling-cost regimes. Method: We conduct controlled experiments systematically varying both the diversity level of LLM-generated data and the mixing ratio of real versus synthetic samples, while holding other factors constant. Contribution/Results: We find a nonlinear relationship between synthetic-data diversity and downstream performance: moderate diversity significantly improves few-shot model accuracy, whereas excessive diversity degrades performance due to increased noise and distributional shift. Based on this, we propose a new “quality-over-quantity” paradigm for synthetic data curation, emphasizing that diversity must be aligned with the target task’s data distribution. Empirical results demonstrate that, when distributional shift is limited, optimizing synthetic-data diversity serves as an efficient and effective alternative to manual annotation—achieving competitive performance with substantially reduced labeling effort.

Technology Category

Application Category

📝 Abstract

With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that mixes different proportions of LLM-generated data, which we refer to as synthetic data. Our experimental results show that, with minimal distribution shift, moderately diverse LLM-generated data can enhance model performance in scenarios with insufficient labeled data, whereas highly diverse generated data has a negative impact. We hope our empirical findings will offer valuable guidance for future studies on LLMs as data generators.

Problem

Research questions and friction points this paper is trying to address.

Impact of LLM-generated data diversity on model performance

Effect of synthetic data mixing ratios on fine-tuning

Mitigating model collapse through controlled data diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Emphasizes diversity in LLM-generated training data

Mixes synthetic and real data proportions

Moderate diversity enhances model performance

🔎 Similar Papers

Unlocking Large Language Model's Planning Capabilities with Maximum Diversity Fine-tuning