🤖 AI Summary
Existing self-reward methods underperform on mathematical reasoning tasks—even degrading performance. This paper proposes a process-oriented self-reward paradigm that abandons conventional result-oriented reward modeling. Our method generates extended chain-of-thought reasoning traces, performs step-level LLM-as-a-Judge evaluation, and constructs fine-grained preference models to enable human-annotation-free iterative optimization. Crucially, we pioneer the shift of self-reward from outcome discrimination to *reasoning process* discrimination and optimization—thereby circumventing the performance ceiling imposed by human annotation quality. On multiple mathematical reasoning benchmarks—including GSM8K and MATH—our approach substantially outperforms prior self-reward methods and achieves performance comparable to, or even exceeding, that of human-annotated supervised fine-tuning models. These results empirically validate the feasibility of large language models autonomously attaining superhuman mathematical reasoning capabilities.
📝 Abstract
Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.