Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In external reasoning systems, process reward models (PRMs) are susceptible to semantic confounding features, leading to “reward hacking”—assigning high scores to logically flawed yet superficially plausible reasoning traces, thereby degrading mathematical problem-solving accuracy. This work is the first to formalize the issue through a causal inference lens and proposes Causal Reward Adjustment (CRA): it employs sparse autoencoders to disentangle PRM hidden-layer activations, identifies interpretable confounding features, and applies unbiased reward correction via the backdoor adjustment formula. CRA requires no modification to the policy model or retraining of the PRM, ensuring strong interpretability and plug-and-play deployment. Evaluated across multiple mathematical reasoning benchmarks, CRA significantly mitigates reward hacking and improves final answer accuracy, demonstrating both effectiveness and cross-task generalization.

Technology Category

Application Category

📝 Abstract
External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks such as mathematical problem solving. However, these systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers. From a causal inference perspective, we attribute this phenomenon primarily to the presence of confounding semantic features. To address it, we propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path. CRA trains sparse autoencoders on the PRM's internal activations to recover interpretable features, then corrects confounding by using backdoor adjustment. Experiments on math solving datasets demonstrate that CRA mitigates reward hacking and improves final accuracy, without modifying the policy model or retraining PRM.
Problem

Research questions and friction points this paper is trying to address.

Mitigates reward hacking in external reasoning systems
Corrects confounding semantic features via backdoor adjustment
Improves accuracy without retraining policy or reward models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoders for interpretable features
Applies backdoor adjustment to correct confounding
Estimates true reward without retraining PRM
🔎 Similar Papers
No similar papers found.
R
Ruike Song
University of Chinese Academy of Sciences, Institute of Software Chinese Academy of Sciences, Nankai University College of Software
Zeen Song
Zeen Song
Institute of Software Chinese Academy of Sciences
Machine Learning
H
Huijie Guo
University of Chinese Academy of Sciences, Institute of Software Chinese Academy of Sciences
Wenwen Qiang
Wenwen Qiang
Institute of Software, Chinese Academy of Sciences
Artificial IntelligenceMachine LearningCausal InferenceLLM/MLLM