TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reward models suffer from temporal inconsistency during reinforcement learning and reasoning verification, leading to inefficient policy updates and training instability. To address this, we propose the Temporal Difference Reward Model (TDRM), which explicitly captures the evolution of rewards across reasoning steps via a temporal difference regularization mechanism—thereby enhancing temporal smoothness of reward signals and alignment with long-horizon objectives. TDRM seamlessly integrates with process reward models (PRMs) and online Actor-Critic frameworks, and is compatible with both Best-of-N sampling and tree search inference. Experiments demonstrate consistent improvements: Best-of-N and tree search performance increase by 6.6% and 23.7%, respectively, across multiple base models. When combined with RLVR, TDRM achieves comparable performance to the baseline using only 2.5k samples—versus 50.1k required by the baseline—yielding substantial gains in data efficiency and policy quality.

Technology Category

Application Category

📝 Abstract
Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences during training. This temporal-difference (TD) regularization produces smooth rewards and improves alignment with long-term objectives. Incorporating TDRM into the actor-critic style online RL loop yields consistent empirical gains. It is worth noting that TDRM is a supplement to verifiable reward methods, and both can be used in series. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies on 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.
Problem

Research questions and friction points this paper is trying to address.

Addresses temporal inconsistency in reward models
Improves alignment with long-term objectives
Enhances data efficiency in reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal difference regularization for smooth rewards
Integration with actor-critic reinforcement learning loop
Combination with verifiable reward methods for efficiency
🔎 Similar Papers
No similar papers found.