🤖 AI Summary
This work addresses the limited understanding of internal representational dynamics in large language models (LLMs) by moving beyond the prevailing black-box view of reasoning. The authors propose modeling reasoning as a dynamic trajectory of internal states and systematically investigate the continuous evolution of representational distributions during post-training phases through representational analysis, cross-stage comparisons, statistical correlation tests, and counterfactual interventions. Their findings reveal that improvements in reasoning capability primarily stem from training-induced shifts toward more favorable representational distributions, rather than mere refinement of initial representations. Moreover, the representational distribution immediately preceding final token generation exhibits a strong correlation with output correctness, and semantic content is identified as the key driver underlying this representational migration.
📝 Abstract
Large Language Models have achieved remarkable performance on reasoning tasks, motivating research into how this ability evolves during training. Prior work has primarily analyzed this evolution via explicit generation outcomes, treating the reasoning process as a black box and obscuring internal changes. To address this opacity, we introduce a representational perspective to investigate the dynamics of the model's internal states. Through comprehensive experiments across models at various training stages, we discover that post-training yields only limited improvement in static initial representation quality. Furthermore, we reveal that, distinct from non-reasoning tasks, reasoning involves a significant continuous distributional shift in representations during generation. Comparative analysis indicates that post-training empowers models to drive this transition toward a better distribution for task solving. To clarify the relationship between internal states and external outputs, statistical analysis confirms a high correlation between generation correctness and the final representations; while counterfactual experiments identify the semantics of the generated tokens, rather than additional computation during inference or intrinsic parameter differences, as the dominant driver of the transition. Collectively, we offer a novel understanding of the reasoning process and the effect of training on reasoning enhancement, providing valuable insights for future model analysis and optimization.