🤖 AI Summary
This work addresses the insufficient policy alignment and safety in human-in-the-loop reinforcement learning (HIL-RL) under settings with no explicit reward signals. We propose Proxy Value Propagation (PVP), a novel method that encodes real-time human interventions and demonstrations as high/low binary value labels—without requiring an external reward function—and propagates these values across state-action pairs via temporal-difference (TD) learning. PVP is modular and seamlessly integrates with mainstream RL algorithms (e.g., SAC, PPO), and includes a lightweight human-in-the-loop interface. Experiments demonstrate that PVP significantly improves both policy safety and fidelity to human intent across continuous and discrete control benchmarks. Notably, in a realistic autonomous driving scenario within *Grand Theft Auto V*, PVP achieves rapid convergence with minimal human intervention, validating its practical deployability in complex, reward-free environments.
📝 Abstract
Learning from active human involvement enables the human subject to actively intervene and demonstrate to the AI agent during training. The interaction and corrective feedback from human brings safety and AI alignment to the learning process. In this work, we propose a new reward-free active human involvement method called Proxy Value Propagation for policy optimization. Our key insight is that a proxy value function can be designed to express human intents, wherein state-action pairs in the human demonstration are labeled with high values, while those agents' actions that are intervened receive low values. Through the TD-learning framework, labeled values of demonstrated state-action pairs are further propagated to other unlabeled data generated from agents' exploration. The proxy value function thus induces a policy that faithfully emulates human behaviors. Human-in-the-loop experiments show the generality and efficiency of our method. With minimal modification to existing reinforcement learning algorithms, our method can learn to solve continuous and discrete control tasks with various human control devices, including the challenging task of driving in Grand Theft Auto V. Demo video and code are available at: https://metadriverse.github.io/pvp