🤖 AI Summary
Existing reward models suffer from temporal inconsistency during reinforcement learning and reasoning verification, leading to inefficient policy updates and training instability. To address this, we propose the Temporal Difference Reward Model (TDRM), which explicitly captures the evolution of rewards across reasoning steps via a temporal difference regularization mechanism—thereby enhancing temporal smoothness of reward signals and alignment with long-horizon objectives. TDRM seamlessly integrates with process reward models (PRMs) and online Actor-Critic frameworks, and is compatible with both Best-of-N sampling and tree search inference. Experiments demonstrate consistent improvements: Best-of-N and tree search performance increase by 6.6% and 23.7%, respectively, across multiple base models. When combined with RLVR, TDRM achieves comparable performance to the baseline using only 2.5k samples—versus 50.1k required by the baseline—yielding substantial gains in data efficiency and policy quality.
📝 Abstract
Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences during training. This temporal-difference (TD) regularization produces smooth rewards and improves alignment with long-term objectives. Incorporating TDRM into the actor-critic style online RL loop yields consistent empirical gains. It is worth noting that TDRM is a supplement to verifiable reward methods, and both can be used in series. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies on 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.