A Law of Next-Token Prediction in Large Language Models

📅 2024-08-24
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
The black-box nature of large language models (LLMs) severely limits their interpretability. Method: We conduct the first systematic empirical and formal analysis of per-layer contribution to next-token prediction across dominant architectures—including Transformer, RWKV, and Mamba—quantifying gradient flows, hidden-state evolution, and cross-architectural consistency. Contribution/Results: We discover and formally verify a universal linear scaling law: contextualized token embeddings’ predictive capacity increases approximately linearly with layer depth, implying near-uniform per-layer contribution to next-token prediction. This pattern holds consistently across models, training scales, and tasks. Our findings reveal a fundamental, quantifiable structure in LLM internal information flow—establishing a novel interpretability paradigm grounded in empirical regularity. The result provides theoretical foundations for principled model scaling laws, improved pretraining objectives, and controllable information-flow engineering in LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this paper, we introduce a precise and quantitative law that governs the learning of contextualized token embeddings through intermediate layers in pre-trained LLMs for next-token prediction. Our findings reveal that each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer -- a universal phenomenon observed across a diverse array of open-source LLMs, built on architectures such as Transformer, RWKV, and Mamba. We demonstrate that this law offers new perspectives and insights to inform and guide practices in LLM development and applications, including model scaling, pre-training tasks, and information flow. Overall, our law enables more fine-grained approaches to the design, training, and interpretation of LLMs through scrutinizing their internal data processing mechanisms.
Problem

Research questions and friction points this paper is trying to address.

Understanding internal token embedding learning in LLMs
Quantifying layer contributions to prediction accuracy
Establishing universal law across diverse model architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantitative law for token embedding learning
Equal layer contribution to prediction accuracy
Universal phenomenon across diverse LLM architectures
🔎 Similar Papers
No similar papers found.