🤖 AI Summary
Current reward models are constrained in complex reasoning tasks by their strong reliance on labeled reference answers and fixed output formats, limiting data scalability and reasoning performance. To address this, we propose a generative reward modeling paradigm explicitly designed for reasoning: (1) we introduce Libra Bench, the first benchmark dedicated to reasoning evaluation; (2) we devise a “Learning-to-Think” self-supervised training strategy that enables reward learning from unlabeled reasoning trajectories; and (3) we integrate multi-stage reasoning modeling with generative reward prediction. The resulting Libra-RM model family achieves state-of-the-art performance across multiple mathematical and logical reasoning benchmarks. Moreover, it significantly enhances downstream reinforcement learning–based reasoning models. Our approach establishes a scalable, label-efficient, and highly generalizable technical pathway for reward modeling in large language models.
📝 Abstract
Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations: 1) the dependence on finely annotated reference answer to attain rewards; and 2) the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models, to address the limitations of existing reward model benchmarks in reasoning scenarios. We further introduce a novel approach for improving the generative reward model via learning-to-think methodologies. Based on the proposed approach, we develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks. Comprehensive downstream experiments are conducted and the experimental results demonstrate the correlation between our Libra Bench and downstream application, and the potential of Libra-RM to further improve reasoning models with unlabeled data.