CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing process reward models (PRMs) suffer from significant length bias in multi-step reasoning tasks (e.g., mathematical problem solving): they overestimate longer reasoning paths while neglecting semantic quality and logical correctness, yielding unreliable rewards and redundant outputs. To address this, we propose a counterfactual-guided debiasing framework that identifies spurious correlations between reward and reasoning length via causal graph modeling. Our method decouples semantic quality from step-count signals through three key components: (i) explicit length penalization, (ii) a learnable bias estimation network, and (iii) joint training with counterfactual data augmentation and length-invariance constraints. Evaluated on MATH500 and GSM-Plus, our approach reduces reward–length correlation by 42.6% and improves step-selection accuracy by 11.3%, producing more concise and logically rigorous reasoning paths. Results demonstrate both effectiveness and robustness across diverse reasoning scenarios.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning. These results demonstrate the effectiveness and practicality of CoLD in improving the fidelity and robustness of PRMs.

Problem

Research questions and friction points this paper is trying to address.

PRMs favor longer reasoning steps regardless of validity

Length bias reduces reward prediction reliability

Verbose outputs result from biased PRM scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicit length-penalty adjustment for debiasing

Learned bias estimator for spurious signals

Joint training for length-invariant reward predictions

🔎 Similar Papers

Post-hoc Reward Calibration: A Case Study on Length Bias