๐ค AI Summary
This study investigates how large language models, trained solely via next-token prediction, acquire the ability to understand prompt semantics, perform in-context learning without parameter updates, and execute chain-of-thought reasoning. By integrating autoregressive modeling, Bayesian posterior concentration analysis, and task decomposition theory, the authors propose a โtask transfer probability inferenceโ mechanism that, for the first time, provides a unified theoretical explanation for these three phenomena. The analysis reveals that in-context learning enhances performance by reducing prompt ambiguity, while chain-of-thought reasoning leverages pre-trained subtask capabilities to enable complex inference. This work establishes a theoretical foundation for advanced prompting strategies and offers statistical performance guarantees.
๐ Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by addressing three critical questions: (1) How do LLMs accurately decode prompt semantics despite being trained solely on a next-token prediction objective? (2) Through what mechanism does ICL facilitate performance gains without explicit parameter updates? and (3) Why do intermediate reasoning steps in CoT prompting effectively unlock capabilities for complex, multi-step problems? Our results demonstrate that, through the autoregressive process, LLMs are capable of exactly inferring the transition probabilities between tokens across distinct tasks using provided prompts. We show that ICL enhances performance by reducing prompt ambiguity and facilitating posterior concentration on the intended task. Furthermore, we find that CoT prompting activates the model's capacity for task decomposition, breaking complex problems into a sequence of simpler sub-tasks that the model has mastered during the pretraining phase. By comparing their individual error bounds, we provide novel theoretical insights into the statistical superiority of advanced prompt engineering techniques.