Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing process reward models (PRMs) suffer from two key limitations: (1) they model reasoning steps in isolation, ignoring causal dependencies among them; and (2) they decouple process-level rewards from final outcomes, leading to ambiguous credit assignment and susceptibility to reward hacking. To address these issues, we propose Conditional Reward Modeling (CRM), a framework that treats the reasoning process as a time-series trajectory toward the correct answer. CRM explicitly encodes inter-step causal relationships via conditional probability modeling and enforces alignment between stepwise rewards and the final outcome. Crucially, CRM requires no verifiable ground-truth rewards and is compatible with Best-of-N sampling, beam search, and reinforcement learning. Experiments across diverse reasoning tasks demonstrate that CRM consistently outperforms state-of-the-art PRMs, yielding more robust reward signals, stable downstream performance gains, and significant mitigation of both reward hacking and credit assignment challenges.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer. However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome. Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment. These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance. In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer. The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. By enforcing conditional probability rules, our design captures the causal relationships among reasoning steps, with the link to the outcome allowing precise attribution of each intermediate step, thereby resolving credit assignment ambiguity. Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison. Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning. In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth.

Problem

Research questions and friction points this paper is trying to address.

Modeling step dependencies in LLM reasoning processes

Aligning process rewards with final outcome accuracy

Resolving credit assignment ambiguity in sequential reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional reward modeling for LLM reasoning processes

Rewards conditioned on preceding steps and final outcome

Enforces conditional probability rules for causal relationships

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting