🤖 AI Summary
Current large reasoning models (LRMs) lack a cognitively grounded, fine-grained characterization of their atomic reasoning steps, hindering interpretability and evaluation of human-like reasoning behavior.
Method: We propose the first cognition-informed, fine-grained taxonomy of reasoning steps—comprising five primary categories and seventeen subcategories—grounded in human cognitive processes. To scale annotation, we introduce CAPO, a collaborative framework integrating expert annotation with LLM-powered automatic annotation, ensuring high efficiency and consistency. We construct a high-quality dataset of 277,534 annotated samples.
Contribution/Results: CAPO achieves significantly higher inter-annotator agreement than baseline methods. Empirical analysis reveals that LRMs’ self-verification remains largely superficial, prompting our proposal of multi-step deep reflection mechanisms. This work establishes a scalable theoretical framework and empirical foundation for interpretable reasoning modeling and evaluation.
📝 Abstract
Large reasoning models (LRMs) have garnered significant attention from researchers owing to their exceptional capability in addressing complex tasks. Motivated by the observed human-like behaviors in their reasoning processes, this paper introduces a comprehensive taxonomy to characterize atomic reasoning steps and probe the ``psyche'' of LRM intelligence. Specifically, it comprises five groups and seventeen categories derived from human mental processes, thereby grounding the understanding of LRMs in an interdisciplinary perspective. The taxonomy is then applied for an in-depth understanding of current LRMs, resulting in a distinct labeled dataset that comprises 277,534 atomic reasoning steps. Using this resource, we analyze contemporary LRMs and distill several actionable takeaways for improving training and post-training of reasoning models. Notably, our analysis reveals that prevailing post-answer ``double-checks'' (self-monitoring evaluations) are largely superficial and rarely yield substantive revisions. Thus, incentivizing comprehensive multi-step reflection, rather than simple self-monitoring, may offer a more effective path forward. To complement the taxonomy, an automatic annotation framework, named CAPO, is proposed to leverage large language models (LLMs) for generating the taxonomy-based annotations. Experimental results demonstrate that CAPO achieves higher consistency with human experts compared to baselines, facilitating a scalable and comprehensive analysis of LRMs from a human cognitive perspective. Together, the taxonomy, CAPO, and the derived insights provide a principled, scalable path toward understanding and advancing LRM reasoning.