Causally Robust Reward Learning from Reason-Augmented Preference Feedback

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Preference learning is susceptible to causal confounding, leading reward models to rely on spurious features that co-occur with preferences and thereby degrading generalization under distributional shift. To address this, this work proposes ReCouPLe, a novel framework that introduces natural language rationales as causal signals in reward learning. By leveraging these rationales as guiding projection axes in the embedding space, ReCouPLe steers the reward model toward preference-aligned causal features while suppressing irrelevant contextual cues. Notably, the approach requires neither additional data nor fine-tuning of language models, and it enables cross-task reuse of causal directions and zero-shot preference transfer. Experiments demonstrate that ReCouPLe improves reward accuracy by up to 1.5× under distributional shift and boosts downstream policy performance by up to 2× on new tasks.

Technology Category

Application Category

📝 Abstract

Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g.,"avoids collisions","completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language-model fine-tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj-hwang/ReCouPLe

Problem

Research questions and friction points this paper is trying to address.

causal confusion

reward learning

preference feedback

spurious features

distribution shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reward learning

natural language rationales

preference-based reinforcement learning