A Taxonomy of Transcendence

πŸ“… 2025-08-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper investigates why language models can surpass the capabilities of their human-expert training data sources, focusing on how training data characteristics drive such β€œsuperhuman” performance. We identify and formalize three distinct modes of capability transcendence: skill denoising, skill selection, and skill generalization. To rigorously test these mechanisms, we construct the first knowledge-graph-based controllable simulation environment that models heterogeneous expert behaviors, enabling generation of training datasets with tunable diversity. Through controlled ablation experiments, we empirically establish data diversity as the critical catalyst for model transcendence. Our contributions include: (i) the first formal definition and empirical validation of the three transcendence mechanisms; (ii) a reproducible, scalable experimental framework for studying model transcendence; and (iii) a data-driven perspective and actionable experimental paradigm for understanding emergent abilities in large language models. (149 words)

Technology Category

Application Category

πŸ“ Abstract
Although language models are trained to mimic humans, the resulting systems display capabilities beyond the scope of any one person. To understand this phenomenon, we use a controlled setting to identify properties of the training data that lead a model to transcend the performance of its data sources. We build on previous work to outline three modes of transcendence, which we call skill denoising, skill selection, and skill generalization. We then introduce a knowledge graph-based setting in which simulated experts generate data based on their individual expertise. We highlight several aspects of data diversity that help to enable the model's transcendent capabilities. Additionally, our data generation setting offers a controlled testbed that we hope is valuable for future research in the area.
Problem

Research questions and friction points this paper is trying to address.

Identify training data properties enabling model transcendence
Outline three modes of transcendence in language models
Develop knowledge graph testbed for controlled transcendence research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge graph-based simulated expert data generation
Three transcendence modes: denoising, selection, generalization
Controlled testbed for studying data diversity effects
πŸ”Ž Similar Papers
No similar papers found.
Natalie Abreu
Natalie Abreu
Harvard University
E
Edwin Zhang
Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University
Eran Malach
Eran Malach
Apple
Machine Learning
N
Naomi Saphra
Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University