CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Human annotators struggle to distinguish between similar trajectories in preference labeling, leading to low label efficiency and poor generalization in offline preference-based reinforcement learning (PbRL). To address this, we propose CLARIFY, the first framework to integrate contrastive learning into offline PbRL. CLARIFY constructs a trajectory embedding space infused with preference information, explicitly disentangling ambiguous preferences and enhancing the model’s ability to identify query-level ambiguity. It employs a pairwise preference loss to optimize the embedding structure, yielding semantically clear and interpretable trajectory representations. Experiments under both imperfect teacher demonstrations and real human feedback demonstrate that CLARIFY significantly outperforms existing baselines: it improves query discriminability by 32%, while achieving higher labeling efficiency and superior policy generalization.

Technology Category

Application Category

📝 Abstract
Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL's real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.
Problem

Research questions and friction points this paper is trying to address.

Addresses ambiguity in human preference labels for reinforcement learning
Improves label efficiency in preference-based reward learning
Enhances trajectory embedding clarity for better query selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning for ambiguous feedback resolution
Offline preference-based reinforcement learning method
Trajectory embedding space with preference information
🔎 Similar Papers
No similar papers found.
N
Ni Mu
Department of Automation, Tsinghua University, Beijing, China
H
Hao Hu
Moonshot AI, Beijing, China
X
Xiao Hu
Department of Automation, Tsinghua University, Beijing, China
Yiqin Yang
Yiqin Yang
Assistant Professor,Institue of Automation,Chinese Academy of Sciences
Reinforcement LearningEmbodied Intelligence
B
Bo Xu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Q
Qing-Shan Jia
Department of Automation, Tsinghua University, Beijing, China