EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit redundant, costly, and unstable reasoning under chain-of-thought (CoT) prompting, with no unified, efficient evaluation benchmark. Method: We introduce the first comprehensive, efficient-reasoning benchmark covering mathematical, commonsense, and logical reasoning tasks; propose a standardized CoT annotation protocol; and design E3-Score—a novel composite metric quantifying Efficiency, Effectiveness, and Explainability—alongside a unified evaluation framework supporting chain-based prompting, structured modeling, and option-level analysis. Contribution/Results: Systematically evaluating seven efficient-reasoning methods across six open-source LLMs and four datasets, we find no globally optimal approach: performance is significantly modulated by model scale, task complexity, and architectural characteristics. Our work establishes a reproducible, extensible evaluation infrastructure and provides foundational insights for advancing efficient LLM reasoning research.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning efficiency across diverse LLM methods
Addressing fragmented evaluation practices in efficient reasoning
Developing unified benchmark for cross-paradigm efficiency assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark for cross-paradigm efficient reasoning evaluation
Verified CoT annotations via structured pipeline with human verification
E3-Score metric modeling economic trade-offs for stable assessment
🔎 Similar Papers
No similar papers found.
Junquan Huang
Junquan Huang
South China Normal University
H
Haotian Wu
Nanyang Technological University
Y
Yubo Gao
Hong Kong University of Science and Technology (Guangzhou)
Yibo Yan
Yibo Yan
East China Normal University
High-dimensional Statistics
Junyan Zhang
Junyan Zhang
National University of Singapore
Large Language Model
Y
Yonghua Hei
Hong Kong University of Science and Technology (Guangzhou)
S
Song Dai
Hong Kong University of Science and Technology (Guangzhou)
J
Jie Zhang
Nanyang Technological University
Puay Siew Tan
Puay Siew Tan
SIMTech, Agency for Science Technology and Research (A*STAR)
Smart ManufacturingIndustry4.0Industry 5.0AI for Manufacturing
Xuming Hu
Xuming Hu
Assistant Professor, HKUST(GZ) / HKUST
Natural Language ProcessingLarge Language Model