RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluation frameworks for LLMs’ critical reasoning capabilities lack rigorous methodologies to distinguish genuine differences in critical analysis and self-correction efficacy between advanced reasoning models and standard ones. Method: We introduce the first correction-oriented, closed-loop benchmark for critical reasoning evaluation, featuring three novel mechanisms—self-critique, cross-model critique, and iterative critique—quantifying critique effectiveness through multi-round correction quality. Built upon eight challenging reasoning tasks, the benchmark integrates critique-correction pipeline validation, cross-model comparative analysis, and iterative feedback assessment. Contribution/Results: Experiments reveal that conventional LLMs systematically underperform o1-mini across all critique tasks; notably, in self-critique and iterative critique settings, their corrected outputs degrade relative to the original responses. This work establishes a reproducible, scalable paradigm for evaluating LLMs’ critical thinking abilities.

Technology Category

Application Category

📝 Abstract

Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at url{https://github.com/tangzhy/RealCritic}.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Critical Thinking

Performance Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

RealCritic

Critical Reasoning Assessment

Advanced Language Models

🔎 Similar Papers

No similar papers found.

Authors to Follow