Discriminative Policy Optimization for Token-Level Reward Models

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address training instability and inaccurate credit assignment arising from the tight coupling between process reward modeling (PRM) and language generation, this paper proposes Q-RM: a decoupled discriminative policy optimization method that directly learns token-level Q-functions from preference data, bypassing explicit reward modeling and joint generation-reward optimization. Its key contribution is the first discriminative Q-function learning framework that requires no fine-grained annotations while maintaining theoretical consistency guarantees. Experiments on mathematical reasoning tasks demonstrate that Q-RM improves Pass@1 by 4.56–5.73 points over token-level PRM. Moreover, it achieves 12× faster convergence than ORM on GSM8K and 11× faster than step-level PRM on MATH, significantly enhancing both the stability and accuracy of fine-grained supervision.

Technology Category

Application Category

📝 Abstract
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.
Problem

Research questions and friction points this paper is trying to address.

Resolving conflict between generative modeling and reward modeling
Improving token-level reward assignment accuracy
Enhancing training efficiency in reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples reward modeling from language generation
Optimizes discriminative policy for token-level rewards
Uses Q-function Reward Model (Q-RM) for efficiency
🔎 Similar Papers
No similar papers found.
H
Hongzhan Chen
School of Computer Science and Engineering, Sun Yat-sen University, China
T
Tao Yang
Wechat Search, Tencent Inc, China
S
Shiping Gao
Ruijun Chen
Ruijun Chen
Sun Yat-sen university
Xiaojun Quan
Xiaojun Quan
Professor, School of Computer Science and Engineering, Sun Yat-sen University
natural language processingtext miningmachine learning
H
Hongtao Tian
T
Ting Yao