Predictive Preference Learning from Human Interventions

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing interactive imitation learning corrects only the agent’s immediate behavior upon human intervention, neglecting proactive avoidance of future high-risk states. To address this, we propose a predictive preference learning framework that leverages implicit preference signals embedded in human interventions to forecast behavioral risk over an L-step lookahead horizon. It actively propagates each intervention to potentially hazardous trajectory segments via temporal preference propagation, intervention-guided trajectory prediction, and preference-aware policy optimization—thereby generalizing expert corrections to safety-critical regions. Theoretical analysis guides the optimal selection of the lookahead horizon L, balancing label fidelity and coverage. Experiments on autonomous driving and robotic manipulation tasks demonstrate that our method significantly reduces human interventions (by 37% on average), enhances policy safety and sample efficiency, and validates its generality and effectiveness across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent's action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: https://metadriverse.github.io/ppl
Problem

Research questions and friction points this paper is trying to address.

Corrects agent actions in future hazardous states using human interventions
Propagates expert corrections to safety-critical regions for improved efficiency
Balances risky state coverage with label correctness through preference horizon
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bootstrapping human interventions to future time steps
Applying preference optimization on future states
Balancing risky state coverage with label correctness
🔎 Similar Papers
No similar papers found.