EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Large language models (LLMs) exhibit redundant, costly, and unstable reasoning under chain-of-thought (CoT) prompting, with no unified, efficient evaluation benchmark. Method: We introduce the first comprehensive, efficient-reasoning benchmark covering mathematical, commonsense, and logical reasoning tasks; propose a standardized CoT annotation protocol; and design E3-Score—a novel composite metric quantifying Efficiency, Effectiveness, and Explainability—alongside a unified evaluation framework supporting chain-based prompting, structured modeling, and option-level analysis. Contribution/Results: Systematically evaluating seven efficient-reasoning methods across six open-source LLMs and four datasets, we find no globally optimal approach: performance is significantly modulated by model scale, task complexity, and architectural characteristics. Our work establishes a reproducible, extensible evaluation infrastructure and provides foundational insights for advancing efficient LLM reasoning research.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning efficiency across diverse LLM methods

Addressing fragmented evaluation practices in efficient reasoning

Developing unified benchmark for cross-paradigm efficiency assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark for cross-paradigm efficient reasoning evaluation

Verified CoT annotations via structured pipeline with human verification

E3-Score metric modeling economic trade-offs for stable assessment

🔎 Similar Papers

No similar papers found.