Causally Robust Reward Learning from Reason-Augmented Preference Feedback

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Preference learning is susceptible to causal confounding, leading reward models to rely on spurious features that co-occur with preferences and thereby degrading generalization under distributional shift. To address this, this work proposes ReCouPLe, a novel framework that introduces natural language rationales as causal signals in reward learning. By leveraging these rationales as guiding projection axes in the embedding space, ReCouPLe steers the reward model toward preference-aligned causal features while suppressing irrelevant contextual cues. Notably, the approach requires neither additional data nor fine-tuning of language models, and it enables cross-task reuse of causal directions and zero-shot preference transfer. Experiments demonstrate that ReCouPLe improves reward accuracy by up to 1.5× under distributional shift and boosts downstream policy performance by up to 2× on new tasks.

Technology Category

Application Category

📝 Abstract
Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g.,"avoids collisions","completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language-model fine-tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj-hwang/ReCouPLe
Problem

Research questions and friction points this paper is trying to address.

causal confusion
reward learning
preference feedback
spurious features
distribution shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reward learning
natural language rationales
preference-based reinforcement learning
distributional robustness
cross-task transfer
🔎 Similar Papers
No similar papers found.
M
Minjune Hwang
Thomas Lord Department of Computer Science, University of Southern California
Y
Yigit Korkmaz
Thomas Lord Department of Computer Science, University of Southern California
Daniel Seita
Daniel Seita
University of Southern California
RoboticsMachine Learning
Erdem Bıyık
Erdem Bıyık
Assistant Professor, University of Southern California
RoboticsHuman-Robot InteractionMachine LearningArtificial Intelligence