A Taxonomy of Transcendence

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This paper investigates why language models can surpass the capabilities of their human-expert training data sources, focusing on how training data characteristics drive such “superhuman” performance. We identify and formalize three distinct modes of capability transcendence: skill denoising, skill selection, and skill generalization. To rigorously test these mechanisms, we construct the first knowledge-graph-based controllable simulation environment that models heterogeneous expert behaviors, enabling generation of training datasets with tunable diversity. Through controlled ablation experiments, we empirically establish data diversity as the critical catalyst for model transcendence. Our contributions include: (i) the first formal definition and empirical validation of the three transcendence mechanisms; (ii) a reproducible, scalable experimental framework for studying model transcendence; and (iii) a data-driven perspective and actionable experimental paradigm for understanding emergent abilities in large language models. (149 words)

Technology Category

Application Category

📝 Abstract

Although language models are trained to mimic humans, the resulting systems display capabilities beyond the scope of any one person. To understand this phenomenon, we use a controlled setting to identify properties of the training data that lead a model to transcend the performance of its data sources. We build on previous work to outline three modes of transcendence, which we call skill denoising, skill selection, and skill generalization. We then introduce a knowledge graph-based setting in which simulated experts generate data based on their individual expertise. We highlight several aspects of data diversity that help to enable the model's transcendent capabilities. Additionally, our data generation setting offers a controlled testbed that we hope is valuable for future research in the area.

Problem

Research questions and friction points this paper is trying to address.

Identify training data properties enabling model transcendence

Outline three modes of transcendence in language models

Develop knowledge graph testbed for controlled transcendence research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge graph-based simulated expert data generation

Three transcendence modes: denoising, selection, generalization

Controlled testbed for studying data diversity effects

🔎 Similar Papers

A philosophical and ontological perspective on Artificial General Intelligence and the Metaverse