Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work investigates distribution shift induced by synthetic data in recursive training of large language models (LLMs), focusing on how multidimensional characteristics of human-originated data—namely source, quality, lexical/semantic diversity, and political orientation—modulate shift dynamics. Through multi-round recursive fine-tuning experiments, cross-platform data comparisons (Twitter vs. Reddit), and quantitative analyses—including distributional distance metrics, bias evolution tracking, and diversity quantification—we systematically uncover previously unreported mechanisms: lexical diversity exacerbates text degeneration, whereas semantic diversity substantially mitigates it; political orientation exhibits a strong correlation with the direction of empirical distribution shift, and its evolutionary patterns—attenuation, amplification, or reversal—are predictable via modeling. Results further demonstrate that data source and quality critically govern shift velocity. These findings provide both theoretical foundations and empirical evidence for controllable, stable recursive LLM training.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly contributing to the creation of content on the Internet. This creates a feedback loop as subsequent generations of models will be trained on this generated, synthetic data. This phenomenon is receiving increasing interest, in particular because previous studies have shown that it may lead to distribution shift - models misrepresent and forget the true underlying distributions of human data they are expected to approximate (e.g. resulting in a drastic loss of quality). In this study, we study the impact of human data properties on distribution shift dynamics in iterated training loops. We first confirm that the distribution shift dynamics greatly vary depending on the human data by comparing four datasets (two based on Twitter and two on Reddit). We then test whether data quality may influence the rate of this shift. We find that it does on the twitter, but not on the Reddit datasets. We then focus on a Reddit dataset and conduct a more exhaustive evaluation of a large set of dataset properties. This experiment associated lexical diversity with larger, and semantic diversity with smaller detrimental shifts, suggesting that incorporating text with high lexical (but limited semantic) diversity could exacerbate the degradation of generated text. We then focus on the evolution of political bias, and find that the type of shift observed (bias reduction, amplification or inversion) depends on the political lean of the human (true) distribution. Overall, our work extends the existing literature on the consequences of recursive fine-tuning by showing that this phenomenon is highly dependent on features of the human data on which training occurs. This suggests that different parts of internet (e.g. GitHub, Reddit) may undergo different types of shift depending on their properties.

Problem

Research questions and friction points this paper is trying to address.

How human data properties affect distribution shift in LLM-generated content

Impact of data quality and diversity on recursive training loops

Evolution of political bias in synthetic data based on human data lean

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes human data impact on distribution shift

Links lexical diversity to larger detrimental shifts

Shows political bias shift depends on true distribution

🔎 Similar Papers

No similar papers found.