Libra: Assessing and Improving Reward Model by Learning to Think

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current reward models are constrained in complex reasoning tasks by their strong reliance on labeled reference answers and fixed output formats, limiting data scalability and reasoning performance. To address this, we propose a generative reward modeling paradigm explicitly designed for reasoning: (1) we introduce Libra Bench, the first benchmark dedicated to reasoning evaluation; (2) we devise a “Learning-to-Think” self-supervised training strategy that enables reward learning from unlabeled reasoning trajectories; and (3) we integrate multi-stage reasoning modeling with generative reward prediction. The resulting Libra-RM model family achieves state-of-the-art performance across multiple mathematical and logical reasoning benchmarks. Moreover, it significantly enhances downstream reinforcement learning–based reasoning models. Our approach establishes a scalable, label-efficient, and highly generalizable technical pathway for reward modeling in large language models.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations: 1) the dependence on finely annotated reference answer to attain rewards; and 2) the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models, to address the limitations of existing reward model benchmarks in reasoning scenarios. We further introduce a novel approach for improving the generative reward model via learning-to-think methodologies. Based on the proposed approach, we develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks. Comprehensive downstream experiments are conducted and the experimental results demonstrate the correlation between our Libra Bench and downstream application, and the potential of Libra-RM to further improve reasoning models with unlabeled data.

Problem

Research questions and friction points this paper is trying to address.

Improving reward models in complex reasoning scenarios

Reducing dependence on annotated reference answers

Enhancing model reasoning with unlabeled data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces reasoning-oriented benchmark Libra Bench

Develops generative reward model via learning-to-think

Achieves state-of-the-art with Libra-RM series

🔎 Similar Papers

No similar papers found.

Authors to Follow