ReEfBench: Quantifying the Reasoning Efficiency of LLMs

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Current evaluations of Chain-of-Thought (CoT) reasoning struggle to distinguish whether performance gains in large language models stem from genuine reasoning capabilities or merely from verbose, redundant outputs, lacking fine-grained, non-intrusive methods to assess the reasoning process itself. This work proposes the first neuro-symbolic evaluation framework focused on reasoning efficiency, enabling non-intrusive diagnosis of model reasoning mechanisms through process-oriented behavioral analysis and prototype clustering. The approach identifies four distinct reasoning behavior prototypes, revealing that longer CoT generations are not necessary for efficient reasoning. It further uncovers that hybrid training with both short and long CoT prompts often leads to performance saturation and that model distillation fails to effectively transfer logical reasoning abilities, thereby systematically exposing critical limitations in current CoT training and transfer paradigms.

Technology Category

Application Category

📝 Abstract

Test-time scaling has enabled Large Language Models (LLMs) to tackle complex reasoning, yet the limitations of current Chain-of-Thought (CoT) evaluation obscures whether performance gains stem from genuine reasoning or mere verbosity. To address this, (1) we propose a novel neuro-symbolic framework for the non-intrusive, comprehensive process-centric evaluation of reasoning. (2) Through this lens, we identify four distinct behavioral prototypes and diagnose the failure modes. (3) We examine the impact of inference mode, training strategy, and model scale. Our analysis reveals that extended token generation is not a prerequisite for deep reasoning. Furthermore, we reveal critical constraints: mixing long and short CoT data in training risks in premature saturation and collapse, while distillation into smaller models captures behavioral length but fails to replicate logical efficacy due to intrinsic capacity limits.

Problem

Research questions and friction points this paper is trying to address.

reasoning efficiency

Chain-of-Thought

Large Language Models

evaluation

test-time scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning efficiency

neuro-symbolic evaluation

Chain-of-Thought