Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing large language models (LLMs) suffer from coarse-grained process supervision in multi-step logical reasoning, as they rely solely on scalar rewards that lack fine-grained step-level assessment. To address this, we propose **Progressive and Natural-language self-critique (PANEL)**, a novel inference-time method that generates human-readable, qualitative, and self-explanatory natural language critiques for each candidate reasoning step—replacing conventional scalar process reward signals. PANEL is the first approach to integrate self-generated textual critiques directly into tree search without requiring additional training, task-specific verifiers, or supervised feedback. Its core innovations include an LLM-intrinsic stepwise self-critique mechanism, critique-guided beam search, and zero-shot process feedback modeling. On challenging benchmarks—including AIME and GPQA—PANEL significantly improves both answer accuracy and reasoning robustness. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Enhancing the reasoning capabilities of large language models (LLMs), particularly for complex tasks requiring multi-step logical deductions, remains a significant challenge. Traditional inference time scaling methods utilize scalar reward signals from process reward models to evaluate candidate reasoning steps, but these scalar rewards lack the nuanced qualitative information essential for understanding and justifying each step. In this paper, we propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL), which employs self-generated natural language critiques as feedback to guide the step-level search process. By generating rich, human-readable critiques for each candidate reasoning step, PANEL retains essential qualitative information, facilitating better-informed decision-making during inference. This approach bypasses the need for task-specific verifiers and the associated training overhead, making it broadly applicable across diverse tasks. Experimental results on challenging reasoning benchmarks, including AIME and GPQA, demonstrate that PANEL significantly enhances reasoning performance, outperforming traditional scalar reward-based methods. Our code is available at https://github.com/puddingyeah/PANEL to support and encourage future research in this promising field.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning for complex multi-step tasks

Replacing scalar rewards with natural language critiques

Eliminating need for task-specific verifiers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stepwise natural language self-critique feedback

Human-readable critiques for reasoning steps

No task-specific verifiers needed

🔎 Similar Papers

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic