🤖 AI Summary
This work addresses the unstable efficacy of synthetic data in large language model training, which lacks theoretical grounding. It presents the first information-theoretic analysis of information flow within the synthetic data generation and training loop, introducing an “information-open loop” mechanism that incorporates external supervisory signals—such as verifiers or scoring rules—to break information closure. The study establishes the principle of “information-efficiency-prior convergence,” demonstrating through theoretical analysis grounded in the data processing inequality and meta-supervision modeling that even coarse-grained supervision (e.g., binary correctness labels) suffices to enhance cross-task generalization. In contrast, closed-loop training inevitably leads to performance degradation. This work thus provides a foundational theoretical framework and design principles for efficient synthetic data utilization in language model training.
📝 Abstract
Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. We provide an information-theoretic account of this inconsistency: synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model's current distribution. When the loop is information-closed (relying on the model's own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied to any particular domain or surface form and generalizes naturally across tasks and domains. These observations lead to a guiding thesis: learning preferentially converges to the most information-efficient signal component available, which accelerates learning when that component is the intended one, but causes reward hacking when a spurious pattern happens to be simpler.