π€ AI Summary
This paper investigates why language models can surpass the capabilities of their human-expert training data sources, focusing on how training data characteristics drive such βsuperhumanβ performance. We identify and formalize three distinct modes of capability transcendence: skill denoising, skill selection, and skill generalization. To rigorously test these mechanisms, we construct the first knowledge-graph-based controllable simulation environment that models heterogeneous expert behaviors, enabling generation of training datasets with tunable diversity. Through controlled ablation experiments, we empirically establish data diversity as the critical catalyst for model transcendence. Our contributions include: (i) the first formal definition and empirical validation of the three transcendence mechanisms; (ii) a reproducible, scalable experimental framework for studying model transcendence; and (iii) a data-driven perspective and actionable experimental paradigm for understanding emergent abilities in large language models. (149 words)
π Abstract
Although language models are trained to mimic humans, the resulting systems display capabilities beyond the scope of any one person. To understand this phenomenon, we use a controlled setting to identify properties of the training data that lead a model to transcend the performance of its data sources. We build on previous work to outline three modes of transcendence, which we call skill denoising, skill selection, and skill generalization. We then introduce a knowledge graph-based setting in which simulated experts generate data based on their individual expertise. We highlight several aspects of data diversity that help to enable the model's transcendent capabilities. Additionally, our data generation setting offers a controlled testbed that we hope is valuable for future research in the area.