Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) can effectively enhance complex reasoning performance through self-correction and systematically evaluates existing self-refinement methods. To this end, we introduce CorrectBench—the first multi-task benchmark covering commonsense, mathematical, and code reasoning—designed specifically for evaluating self-correction capabilities. We conduct comprehensive experiments across intrinsic correction, external feedback, fine-tuning, and chain-of-thought (CoT) baselines. Results show that hybrid self-correction yields only marginal accuracy gains while substantially increasing latency; state-of-the-art reasoning models (e.g., DeepSeek-R1) exhibit limited and inefficient improvements; in contrast, lightweight CoT baselines achieve superior trade-offs between accuracy and efficiency. Our core contributions include: (1) the first standardized evaluation framework for self-correction, (2) empirical identification of fundamental bottlenecks in reasoning optimization, and (3) methodological insights toward developing efficient and robust reasoning systems.

Technology Category

Application Category

📝 Abstract
Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM's reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: https://correctbench.github.io/
Problem

Research questions and friction points this paper is trying to address.

Evaluating self-correction effectiveness in large language models
Assessing intrinsic, external, and fine-tuned correction strategies
Benchmarking correction methods across reasoning and generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates intrinsic external fine-tuned self-correction
Mixed self-correction strategies improve accuracy reduce efficiency
Chain-of-thought baseline shows competitive accuracy and efficiency
🔎 Similar Papers
No similar papers found.
G
Guiyao Tie
Huazhong University of Science and Technology
Zenghui Yuan
Zenghui Yuan
Huazhong University of Science and Technology
AI SecurityBackdoor
Z
Zeli Zhao
Huazhong University of Science and Technology
C
Chaoran Hu
Huazhong University of Science and Technology
T
Tianhe Gu
Huazhong University of Science and Technology
R
Ruihang Zhang
Huazhong University of Science and Technology
Sizhe Zhang
Sizhe Zhang
Huazhong University of Science and Technology
Junran Wu
Junran Wu
National University of Singapore
Graph Neural NetworksNatural Language ProcessingTime Series
X
Xiaoyue Tu
Huazhong University of Science and Technology
M
Ming Jin
Griffith University
Q
Qingsong Wen
Squirrel Ai Learning
Lixing Chen
Lixing Chen
Associate Professor, Shanghai Jiao Tong University
AI for NetworkingCybersecurity
P
Pan Zhou
Huazhong University of Science and Technology
L
Lichao Sun
Lehigh University