CueLearner: Bootstrapping and local policy adaptation from relative feedback

๐Ÿ“… 2025-07-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In sparse-reward tasks, human feedback is often scarce and low-informative, hindering effective policy learning. Method: This paper proposes a novel off-policy reinforcement learning framework that integrates relative human feedback (e.g., โ€œmove slightly leftโ€) with policy optimization. Contribution/Results: First, it formalizes relative feedback within a general off-policy setting, supporting arbitrary policy classes. Second, it introduces a local policy adaptation mechanism enabling real-time responsiveness to dynamic environments and evolving user preferences. Third, it combines policy gradient updates with efficient feedback utilization strategies to significantly improve sample efficiency. Evaluated on two sparse-reward simulation tasks, the method reduces training steps by a large margin compared to baselines. Furthermore, it successfully transfers to a real-world robot navigation task, demonstrating strong generalization capability and practical applicability.

Technology Category

Application Category

๐Ÿ“ Abstract
Human guidance has emerged as a powerful tool for enhancing reinforcement learning (RL). However, conventional forms of guidance such as demonstrations or binary scalar feedback can be challenging to collect or have low information content, motivating the exploration of other forms of human input. Among these, relative feedback (i.e., feedback on how to improve an action, such as "more to the left") offers a good balance between usability and information richness. Previous research has shown that relative feedback can be used to enhance policy search methods. However, these efforts have been limited to specific policy classes and use feedback inefficiently. In this work, we introduce a novel method to learn from relative feedback and combine it with off-policy reinforcement learning. Through evaluations on two sparse-reward tasks, we demonstrate our method can be used to improve the sample efficiency of reinforcement learning by guiding its exploration process. Additionally, we show it can adapt a policy to changes in the environment or the user's preferences. Finally, we demonstrate real-world applicability by employing our approach to learn a navigation policy in a sparse reward setting.
Problem

Research questions and friction points this paper is trying to address.

Enhancing RL with relative human feedback for better guidance
Improving sample efficiency in sparse-reward RL tasks
Adapting policies to environment changes or user preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines relative feedback with off-policy RL
Improves RL sample efficiency via guided exploration
Adapts policies to environment or preference changes
๐Ÿ”Ž Similar Papers
No similar papers found.