🤖 AI Summary
To address low learning efficiency in deep reinforcement learning (DRL) agents for human-robot collaboration—caused by sparse extrinsic rewards and unpredictable human behavior—this paper proposes a behavior- and context-aware dual intrinsic reward mechanism. Our method introduces, for the first time, a synergistic intrinsic reward that jointly models human motivation and AI-driven self-motivation, coupled with a learning-progress-sensitive dynamic context-weighting strategy to simultaneously mitigate insufficient exploration and exploitation bias. The framework integrates intrinsic motivation modeling, context-aware weight optimization, and logarithmic sparse-reward capture. Evaluated on the Overcooked benchmark, our approach achieves approximately 20% higher cumulative sparse reward and reduces policy convergence time by 67% compared to state-of-the-art methods, significantly improving both learning efficiency and robustness of collaborative policies.
📝 Abstract
Deep reinforcement Learning (DRL) offers a powerful framework for training AI agents to coordinate with human partners. However, DRL faces two critical challenges in human-AI coordination (HAIC): sparse rewards and unpredictable human behaviors. These challenges significantly limit DRL to identify effective coordination policies, due to its impaired capability of optimizing exploration and exploitation. To address these limitations, we propose an innovative behavior- and context-aware reward (BCR) for DRL, which optimizes exploration and exploitation by leveraging human behaviors and contextual information in HAIC. Our BCR consists of two components: (i)~Novel dual intrinsic rewards to enhance exploration. This scheme composes an AI self-motivated intrinsic reward and a human-motivated intrinsic reward, which are designed to increase the capture of sparse rewards by a logarithmic-based strategy; and (ii)~New context-aware weights for the designed rewards to improve exploitation. This mechanism helps the AI agent prioritize actions that better coordinate with the human partner by utilizing contextual information that can reflect the evolution of learning in HAIC. Extensive simulations in the Overcooked environment demonstrate that our approach can increase the cumulative sparse rewards by approximately 20% and reduce the convergence time by about 67% compared to state-of-the-art baselines.