Similarity as Reward Alignment: Robust and Versatile Preference-based Reinforcement Learning

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing preference-based reinforcement learning (PbRL) methods exhibit poor robustness to non-expert annotation noise and are constrained by rigid feedback formats (e.g., pairwise comparisons) and training paradigms (e.g., purely offline). To address these limitations, we propose SARA—a novel framework that decouples reward modeling into latent representation learning of preference samples and similarity-based metric learning, using similarity as a proxy reward for contrastive learning. This design inherently enhances robustness to label noise and natively supports heterogeneous feedback types (pairwise, multiple-choice, scalar) under a unified offline/online/cross-task paradigm. Integrated with trajectory filtering and reward shaping, SARA achieves significant improvements over state-of-the-art baselines on continuous-control offline RL benchmarks. We further validate its effectiveness on trajectory selection, cross-task preference transfer, and online reward shaping.

Technology Category

Application Category

📝 Abstract
Preference-based Reinforcement Learning (PbRL) entails a variety of approaches for aligning models with human intent to alleviate the burden of reward engineering. However, most previous PbRL work has not investigated the robustness to labeler errors, inevitable with labelers who are non-experts or operate under time constraints. Additionally, PbRL algorithms often target very specific settings (e.g. pairwise ranked preferences or purely offline learning). We introduce Similarity as Reward Alignment (SARA), a simple contrastive framework that is both resilient to noisy labels and adaptable to diverse feedback formats and training paradigms. SARA learns a latent representation of preferred samples and computes rewards as similarities to the learned latent. We demonstrate strong performance compared to baselines on continuous control offline RL benchmarks. We further demonstrate SARA's versatility in applications such as trajectory filtering for downstream tasks, cross-task preference transfer, and reward shaping in online learning.
Problem

Research questions and friction points this paper is trying to address.

Robustness to labeler errors in Preference-based Reinforcement Learning
Adaptability to diverse feedback formats and training paradigms
Alignment of models with human intent to reduce reward engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive framework for robust reward alignment
Resilient to noisy labels and diverse feedback
Latent representation for versatile preference learning
🔎 Similar Papers
No similar papers found.
S
Sara Rajaram
Institute for Computer Science and Campus Institute for Data Science, University of Göttingen, Göttingen, Germany; International Max Planck Research School for Intelligent Systems, Tübingen, Germany
R. James Cotton
R. James Cotton
Northwestern University / Shirley Ryan AbilityLab
NeuroscienceRehabilitationDeep Learning
F
Fabian H. Sinz
Institute for Computer Science and Campus Institute for Data Science, University of Göttingen, Göttingen, Germany