Learning from Active Human Involvement through Proxy Value Propagation

📅 2025-02-05

🏛️ Neural Information Processing Systems

📈 Citations: 10

✨ Influential: 1

career value

233K/year

🤖 AI Summary

This work addresses the insufficient policy alignment and safety in human-in-the-loop reinforcement learning (HIL-RL) under settings with no explicit reward signals. We propose Proxy Value Propagation (PVP), a novel method that encodes real-time human interventions and demonstrations as high/low binary value labels—without requiring an external reward function—and propagates these values across state-action pairs via temporal-difference (TD) learning. PVP is modular and seamlessly integrates with mainstream RL algorithms (e.g., SAC, PPO), and includes a lightweight human-in-the-loop interface. Experiments demonstrate that PVP significantly improves both policy safety and fidelity to human intent across continuous and discrete control benchmarks. Notably, in a realistic autonomous driving scenario within *Grand Theft Auto V*, PVP achieves rapid convergence with minimal human intervention, validating its practical deployability in complex, reward-free environments.

Technology Category

Application Category

📝 Abstract

Learning from active human involvement enables the human subject to actively intervene and demonstrate to the AI agent during training. The interaction and corrective feedback from human brings safety and AI alignment to the learning process. In this work, we propose a new reward-free active human involvement method called Proxy Value Propagation for policy optimization. Our key insight is that a proxy value function can be designed to express human intents, wherein state-action pairs in the human demonstration are labeled with high values, while those agents' actions that are intervened receive low values. Through the TD-learning framework, labeled values of demonstrated state-action pairs are further propagated to other unlabeled data generated from agents' exploration. The proxy value function thus induces a policy that faithfully emulates human behaviors. Human-in-the-loop experiments show the generality and efficiency of our method. With minimal modification to existing reinforcement learning algorithms, our method can learn to solve continuous and discrete control tasks with various human control devices, including the challenging task of driving in Grand Theft Auto V. Demo video and code are available at: https://metadriverse.github.io/pvp

Problem

Research questions and friction points this paper is trying to address.

Active human involvement in AI training

Proxy Value Propagation for policy optimization

Human behavior emulation in reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proxy Value Propagation method

TD-learning framework integration

Human-in-the-loop experiments

🔎 Similar Papers

Exploring Human-in-the-Loop Test-Time Adaptation by Synergizing Active Learning and Model Selection