DAPPER: Discriminability-Aware Policy-to-Policy Preference-Based Reinforcement Learning for Query-Efficient Robot Skill Acquisition

πŸ“… 2025-05-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the low query efficiency of preference-based reinforcement learning (PbRL) in robotic skill acquisition, identifying its root cause as policy biasβ€”leading to insufficient trajectory diversity and diminished human preference discriminability. To resolve this, we first introduce a *preference discriminability metric*, then propose a *multi-policy parallel training framework*, and design a *discriminator-driven discriminative query sampling mechanism*. These components jointly optimize reward modeling and discriminability objectives, augmented by trajectory diversity regularization. Evaluated on both simulated and real-world quadrupedal robot tasks, our method significantly improves query efficiency: it achieves over 30% improvement over state-of-the-art approaches under low-discriminability preference settings. The approach establishes a new paradigm for efficient and robust human-feedback-driven learning in robotics.

Technology Category

Application Category

πŸ“ Abstract
Preference-based Reinforcement Learning (PbRL) enables policy learning through simple queries comparing trajectories from a single policy. While human responses to these queries make it possible to learn policies aligned with human preferences, PbRL suffers from low query efficiency, as policy bias limits trajectory diversity and reduces the number of discriminable queries available for learning preferences. This paper identifies preference discriminability, which quantifies how easily a human can judge which trajectory is closer to their ideal behavior, as a key metric for improving query efficiency. To address this, we move beyond comparisons within a single policy and instead generate queries by comparing trajectories from multiple policies, as training them from scratch promotes diversity without policy bias. We propose Discriminability-Aware Policy-to-Policy Preference-Based Efficient Reinforcement Learning (DAPPER), which integrates preference discriminability with trajectory diversification achieved by multiple policies. DAPPER trains new policies from scratch after each reward update and employs a discriminator that learns to estimate preference discriminability, enabling the prioritized sampling of more discriminable queries. During training, it jointly maximizes the preference reward and preference discriminability score, encouraging the discovery of highly rewarding and easily distinguishable policies. Experiments in simulated and real-world legged robot environments demonstrate that DAPPER outperforms previous methods in query efficiency, particularly under challenging preference discriminability conditions.
Problem

Research questions and friction points this paper is trying to address.

Improves query efficiency in preference-based reinforcement learning
Enhances trajectory diversity by comparing multiple policies
Maximizes preference reward and discriminability for robot skills
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares trajectories from multiple policies
Trains new policies from scratch
Prioritizes sampling discriminable queries
πŸ”Ž Similar Papers
No similar papers found.