π€ AI Summary
This paper addresses the core challenge in post-training large language models (LLMs)βnamely, the empirically observed efficacy yet theoretical opacity of synthetic data. Methodologically, it introduces the first *anti-bottleneck analysis* framework, defines *Generalized Generalization Mutual Information* (GGMI), and establishes a quantitative link between a generative modelβs information retention capacity and downstream generalization performance. Leveraging information-theoretic modeling, mutual information analysis, and generalization error theory, it rigorously proves that synthetic data quality is fundamentally governed by *information gain* during generation. Contributions include: (i) the first principled theoretical foundation for synthetic data design in LLM post-training; (ii) the identification of information gain as the key mechanistic lever for controlling generalization; and (iii) an open-source verification framework enabling efficient synthetic data generation and optimization of post-training strategies. (132 words)
π Abstract
Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open-source our code at https://github.com/ZyGan1999/Towards-a-Theoretical-Understanding-of-Synthetic-Data-in-LLM-Post-Training.