🤖 AI Summary
Current large language models (LLMs) typically decouple reasoning from verification in complex reasoning tasks—either omitting self-checking entirely or relying on external verifiers—resulting in delayed feedback, architectural redundancy, and poor co-optimization. To address this, we propose Stepwise Think-Critique (STC), the first framework unifying fine-grained, step-level reasoning and self-critique within a single model. Its core contributions are: (1) an interleaved Think-Critique architecture enabling immediate, verifiable self-assessment after each reasoning step; (2) a hybrid reinforcement learning objective jointly optimizing reasoning quality and critique consistency; and (3) a critique consistency reward coupled with multi-stage trajectory supervision. Experiments on mathematical reasoning benchmarks demonstrate that STC significantly improves both answer accuracy and error detection capability, yielding more transparent, traceable, and robust reasoning chains—advancing the development of endogenous critical capabilities in LLMs.
📝 Abstract
Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) decouple reasoning from verification: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model. STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.