🤖 AI Summary
Large language models (LLMs) exhibit redundant, costly, and unstable reasoning under chain-of-thought (CoT) prompting, with no unified, efficient evaluation benchmark. Method: We introduce the first comprehensive, efficient-reasoning benchmark covering mathematical, commonsense, and logical reasoning tasks; propose a standardized CoT annotation protocol; and design E3-Score—a novel composite metric quantifying Efficiency, Effectiveness, and Explainability—alongside a unified evaluation framework supporting chain-based prompting, structured modeling, and option-level analysis. Contribution/Results: Systematically evaluating seven efficient-reasoning methods across six open-source LLMs and four datasets, we find no globally optimal approach: performance is significantly modulated by model scale, task complexity, and architectural characteristics. Our work establishes a reproducible, extensible evaluation infrastructure and provides foundational insights for advancing efficient LLM reasoning research.
📝 Abstract
Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.