🤖 AI Summary
Current understanding of large language models’ reasoning mechanisms remains limited.
Method: We propose the “reasoning graph” framework, which models reasoning paths via clustering of hidden states and—novelly—systematically quantifies the relationship between reasoning structure and capability using graph-theoretic topological metrics: cycle count, diameter, and small-world index.
Contribution/Results: Evaluated on GSM8K, MATH500, and AIME 2024, distilled models exhibit stronger cyclicality (mean cycle count ≈ 5), larger diameter (peaking at 32B), and pronounced small-world properties (index increase ≈ 6×). Crucially, graph diameter expansion strongly correlates with accuracy improvement (Pearson *r* > 0.9). This work establishes a novel, interpretable, and quantitative graph-theoretic paradigm for probing reasoning mechanisms, guiding data curation, and informing model optimization.
📝 Abstract
Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In this work, we introduce the notion of a reasoning graph, extracted by clustering hidden-state representations at each reasoning step, and systematically analyze three key graph-theoretic properties: cyclicity, diameter, and small-world index, across multiple tasks (GSM8K, MATH500, AIME 2024). Our findings reveal that distilled reasoning models (e.g., DeepSeek-R1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities. By bridging theoretical insights into reasoning graph structures with practical recommendations for data construction, our work advances both the interpretability and the efficacy of large reasoning models.