🤖 AI Summary
This study investigates how the diversity of synthetic data generated by large language models (LLMs) affects downstream model fine-tuning performance under data-scarce and high-labeling-cost regimes. Method: We conduct controlled experiments systematically varying both the diversity level of LLM-generated data and the mixing ratio of real versus synthetic samples, while holding other factors constant. Contribution/Results: We find a nonlinear relationship between synthetic-data diversity and downstream performance: moderate diversity significantly improves few-shot model accuracy, whereas excessive diversity degrades performance due to increased noise and distributional shift. Based on this, we propose a new “quality-over-quantity” paradigm for synthetic data curation, emphasizing that diversity must be aligned with the target task’s data distribution. Empirical results demonstrate that, when distributional shift is limited, optimizing synthetic-data diversity serves as an efficient and effective alternative to manual annotation—achieving competitive performance with substantially reduced labeling effort.
📝 Abstract
With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that mixes different proportions of LLM-generated data, which we refer to as synthetic data. Our experimental results show that, with minimal distribution shift, moderately diverse LLM-generated data can enhance model performance in scenarios with insufficient labeled data, whereas highly diverse generated data has a negative impact. We hope our empirical findings will offer valuable guidance for future studies on LLMs as data generators.