Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

In RLHF, reward models are susceptible to spurious correlations—such as text length and flattery—leading to alignment distortion and bias. This paper proposes Causal Reward Modeling (CRM), the first framework to systematically integrate causal inference into LLM reward modeling: it identifies confounders via causal graphs and enforces counterfactual invariance to disentangle spurious correlations, enabling intervention-based debiasing. CRM is fully compatible with standard RLHF pipelines and requires no modifications to policy training. Evaluated on both synthetic and real-world datasets, CRM reduces length bias by 72% and flattery bias by 68%, while significantly improving preference prediction accuracy and cross-domain generalization. The approach enhances model fairness, robustness, and interpretability without compromising training efficiency or deployment compatibility.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causal inference to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.

Problem

Research questions and friction points this paper is trying to address.

Bias

Reinforcement Learning

Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Inference

Reinforcement Learning with Human Feedback (RLHF)

Stable Reward Score

🔎 Similar Papers

HAF-RM: A Hybrid Alignment Framework for Reward Model Training