🤖 AI Summary
This work addresses the pervasive issue of over-reasoning in large reasoning models (LRMs), such as redundant computation and cyclic self-verification during chain-of-thought (CoT) generation, which existing evaluation methods struggle to automatically disentangle into necessary reasoning versus structural redundancy. To this end, the authors propose a graph-driven framework that transforms free-form CoT into directed dependency graphs and leverages graph-theoretic principles to extract the shortest effective path (SEP) leading to the correct answer. This approach introduces, for the first time, an interpretable efficiency signal that enables automatic, cross-model, and cross-task assessment of CoT necessity and redundancy. Experiments across 21 LRMs demonstrate the method’s effectiveness in identifying inefficient reasoning patterns—such as verification obsession and compensatory redundancy—thereby providing a quantitative foundation for model diagnosis and optimization.
📝 Abstract
Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal -- how much of a CoT is necessary versus structurally redundant -- that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.