🤖 AI Summary
This study investigates whether large language models (LLMs) spontaneously develop self-reflection and error-correction capabilities during pretraining. To this end, we propose a controllable error injection method grounded in chain-of-thought (CoT) reasoning and construct the first pretraining-stage benchmark for evaluating introspective abilities—covering six task categories: logical reasoning, mathematical problem solving, symbolic manipulation, and others. We track the evolution of these capabilities across the 4-trillion-token pretraining trajectory of OLMo2-7B. Results demonstrate that self-reflective capacity emerges early in pretraining without reinforcement learning fine-tuning; models consistently detect and correct injected reasoning errors, with self-correction accuracy improving steadily across all six tasks. This work provides the first empirical validation of intrinsic introspective capability in pretraining, offering both a novel perspective on LLM cognitive development and a reproducible, task-diverse evaluation framework for studying emergent self-monitoring behaviors.
📝 Abstract
A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.