MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

📅 2024-06-24

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the challenge of efficiently aligning robot policies with human preferences in human-robot collaboration, this paper proposes a sample-efficient alignment framework. Methodologically, it introduces residual reward modeling—the first such application—where only the discrepancy between human preferences and a prior policy’s reward function is learned, eliminating redundant reward reconstruction. The framework further integrates maximum-entropy inverse reinforcement learning with residual Q-learning (RQL) to enhance data efficiency from sparse human interventions. Additionally, interactive imitation learning and intervention feedback modeling are incorporated to enable robust policy calibration. Evaluated on both simulated and real-robot tasks, the method reduces required human intervention samples by over 50% compared to state-of-the-art baselines, significantly accelerating policy alignment while improving generalization across unseen scenarios.

Technology Category

Application Category

📝 Abstract

Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention.

Problem

Research questions and friction points this paper is trying to address.

Aligns robot behavior with human preferences efficiently

Infers residual reward from human expert interventions

Improves sample efficiency in policy alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

MEReQ infers residual reward function

Uses Residual Q-Learning for alignment

Achieves sample-efficient policy alignment

🔎 Similar Papers

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment