Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether a phase transition emerges in knowledge acquisition by large language models (LLMs) trained on hybrid data—comprising web-scraped corpora and high-quality domain-specific knowledge. Method: Leveraging a synthetically constructed biographical dataset, controlled mixed-data training, information-theoretic modeling, and scaling-law analysis, we systematically vary model size and the proportion of domain-knowledge data. Contribution/Results: We empirically discover, for the first time, that knowledge retention exhibits an abrupt, non-smooth phase transition—rather than gradual scaling—with respect to both model scale and knowledge-data mixing ratio. We propose an information-theoretic “knapsack”-inspired capacity-allocation framework to model knowledge encoding constraints, enabling theoretical prediction of the phase-transition point. Critically, the critical mixing ratio follows a power-law relationship with model size. Empirical validation confirms reproducibility and predictability of the transition threshold across both dimensions. Moreover, optimal data-mixing strategies diverge significantly between small and large models, implying distinct data curation principles for different model scales.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.
Problem

Research questions and friction points this paper is trying to address.

Phase transitions in LLM knowledge acquisition from mixed data
Critical model size and mixing ratio affect memorization
Capacity allocation causes discontinuous knowledge scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data mixing induces phase transitions in learning
Model size and mixing ratio trigger critical thresholds
Capacity allocation explains discontinuous knowledge acquisition
🔎 Similar Papers
No similar papers found.
Xinran Gu
Xinran Gu
Tsinghua University
Distributed OptimizationDeep Learning Theory
Kaifeng Lyu
Kaifeng Lyu
Tsinghua University
J
Jiazheng Li
Beijing Institute of Technology
J
Jingzhao Zhang
Institute for Interdisciplinary Information Sciences, Tsinghua University; Shanghai Qizhi Institute; Shanghai AI Laboratory