🤖 AI Summary
This work investigates the emergence of unfaithful reasoning in large language models—where generated chains of thought exhibit logical inconsistencies or lack causal grounding in the final answer—by conducting synthetic experiments with small Transformers trained on noisy modular arithmetic tasks. The study systematically analyzes how reasoning fidelity arises during training and identifies a critical noise threshold: below this threshold, models learn concise, step-by-step reasoning that faithfully mirrors the ground-truth solution process; above it, a phase transition occurs, leading models to adopt shortcut reasoning strategies. Notably, during the transitional regime near this threshold, an implicit self-verification mechanism spontaneously emerges within the model’s internal representations. These findings reveal the pivotal role of training data noise in shaping reasoning behavior and highlight the emergent capacity of neural networks to encode uncertainty through internal mechanisms.
📝 Abstract
Chain-of-thought (CoT) reasoning generated by large language models (LLMs) is often unfaithful: intermediate steps can be logically inconsistent or fail to reflect the causal relationship leading to the final answer. Despite extensive empirical observations, a fundamental understanding of CoT is lacking--what constitutes faithful CoT reasoning, and how unfaithfulness emerges from autoregressive training. We study these questions using well-controlled synthetic experiments, training small transformers on noisy data to solve modular arithmetic expressions step by step, a task we term Arithmetic Expression Reasoning. We find that models can learn faithful reasoning that causally follows the underlying arithmetic rules, but only when the training noise is below a critical threshold, a phenomenon attributable to simplicity bias. At higher noise levels, training dynamics exhibit a transition from faithful stepwise reasoning to unfaithful skip-step reasoning via an intermediate mixed mode characterized by a transient increase in prediction entropy. Mechanistic analysis reveals that models learn to encode internal uncertainty by resolving inconsistent reasoning steps, which suggests the emergence of implicit self-verification from autoregressive training.