🤖 AI Summary
This work investigates the autonomous reasoning and self-correction capabilities of large language models (LLMs) without external feedback. Existing approaches rely on external reward models, hindering end-to-end self-evaluation and self-repair. To address this, we propose a two-stage, purely self-generated data training framework: (1) sequential rejection sampling to synthesize long chain-of-thought data with fine-grained self-assessment annotations; and (2) rule-guided reinforcement learning fine-tuning that jointly models self-rewarding and self-correcting behaviors. Our method is the first to enable LLMs to simultaneously generate reasoning traces, assess correctness in real time, detect errors, dynamically revise outputs, and autonomously terminate iterative refinement. Experiments on Llama-3 and Qwen-2.5 demonstrate substantial gains in self-correction performance over strong baselines and match the efficacy of state-of-the-art systems relying on external reward models.
📝 Abstract
We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.