Debiasing Reward Models by Representation Learning with Guarantees

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the misalignment in reward modeling caused by spurious correlations—such as response length, sycophancy, and conceptual biases—during training. To mitigate this, we propose a causal representation learning framework for debiasing. Our approach constructs a structured causal model of data generation, enabling theoretical identification and disentanglement of non-spurious latent factors without requiring proxy labels for spurious variables, thereby offering provable debiasing guarantees. We employ variational inference to jointly optimize latent representations and causal structure, yielding robust reward model training. Experiments on both synthetic and real-world datasets demonstrate that our method significantly alleviates spurious correlations, improving reward model stability, generalization, and consistency with human preferences.

Technology Category

Application Category

📝 Abstract

Recent alignment techniques, such as reinforcement learning from human feedback, have been widely adopted to align large language models with human preferences by learning and leveraging reward models. In practice, these models often exploit spurious correlations, involving, e.g., response length, discrimination, sycophancy, and conceptual bias, which is a problem that has received increasing attention. In this work, we propose a principled framework that mitigates these biases in reward models while preserving the underlying factors that reflect intended preferences. We first provide a formulation of the data-generating process, assuming that the observed data (e.g., text) is generated from both spurious and non-spurious latent variables. We show that, interestingly, these non-spurious latent variables can be theoretically identified from data, regardless of whether a surrogate for the spurious latent variables is available. This further inspires a practical method that uses variational inference to recover these variables and leverages them to train reward models. Experiments on synthetic and real-world datasets demonstrate that our method effectively mitigates spurious correlation issues and yields more robust reward models.

Problem

Research questions and friction points this paper is trying to address.

Mitigating spurious correlations in reward models

Identifying non-spurious latent variables from data

Training robust reward models using variational inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Debiasing reward models via variational inference

Identifying non-spurious latent variables theoretically

Mitigating biases while preserving intended preferences

🔎 Similar Papers

Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown