Escaping Collapse: The Strength of Weak Data for Large Language Model Training

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Synthetic data in large language model (LLM) training often leads to performance stagnation or collapse. To address this, we propose a lightweight, boosting-inspired framework for data filtering and dynamic focus training, requiring only minimal high-quality annotations to sustainably improve model performance. Theoretically, we establish the first analytical framework characterizing LLMs’ high tolerance to weak synthetic data, providing convergence guarantees under near-minimal data selection criteria and unifying the efficacy of diverse synthetic-data training approaches. Methodologically, our framework integrates iterative weighted training, dynamic hard-sample focusing, and joint optimization over synthetic and real data. Empirical results demonstrate that our approach significantly accelerates convergence, improves final task performance, and robustly prevents training collapse across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance to plateau, or even"collapse", after many training iterations. In this paper, we formalize this question and develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves. We find that the requirements are nearly minimal. We describe a training procedure that converges to an optimal LLM even if almost all of the non-synthetic training data is of poor quality. Our analysis is inspired by boosting, a classic machine learning technique that leverages a very weak learning algorithm to produce an arbitrarily good classifier. Our training procedure subsumes many recently proposed methods for training LLMs on synthetic data, and thus our analysis sheds light on why they are successful, and also suggests opportunities for future improvement. We present experiments that validate our theory, and show that dynamically focusing labeling resources on the most challenging examples -- in much the same way that boosting focuses the efforts of the weak learner -- leads to improved performance.
Problem

Research questions and friction points this paper is trying to address.

Synthetic data's role in LLM training
Minimal curation for performance improvement
Dynamic resource focus for challenging examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimal data curation required
Boosting-inspired training procedure
Dynamic focus on challenging examples
🔎 Similar Papers
No similar papers found.