🤖 AI Summary
Imitation learning suffers from low sample efficiency and difficulty in surpassing expert performance. To address these challenges, we propose a dual-exploration framework: (1) an optimism-driven objective constructed from policy uncertainty to accelerate convergence toward expert behavior; and (2) a curiosity-based exploration reward that actively visits state regions unobserved in expert demonstrations, enabling performance breakthroughs. Our method integrates uncertainty-regularized policy optimization within a reinforcement learning framework. Evaluated on Atari and MuJoCo benchmarks, it achieves superior performance using only a small number of expert demonstrations—significantly outperforming existing state-of-the-art methods and attaining super-expert-level results. Theoretical analysis establishes a sublinear regret bound with respect to the number of episodes, providing a principled foundation for efficient imitation learning. This work introduces a novel paradigm that jointly leverages epistemic uncertainty and intrinsic motivation to bridge the gap between imitation and autonomous improvement.
📝 Abstract
Imitation learning is a central problem in reinforcement learning where the goal is to learn a policy that mimics the expert's behavior. In practice, it is often challenging to learn the expert policy from a limited number of demonstrations accurately due to the complexity of the state space. Moreover, it is essential to explore the environment and collect data to achieve beyond-expert performance. To overcome these challenges, we propose a novel imitation learning algorithm called Imitation Learning with Double Exploration (ILDE), which implements exploration in two aspects: (1) optimistic policy optimization via an exploration bonus that rewards state-action pairs with high uncertainty to potentially improve the convergence to the expert policy, and (2) curiosity-driven exploration of the states that deviate from the demonstration trajectories to potentially yield beyond-expert performance. Empirically, we demonstrate that ILDE outperforms the state-of-the-art imitation learning algorithms in terms of sample efficiency and achieves beyond-expert performance on Atari and MuJoCo tasks with fewer demonstrations than in previous work. We also provide a theoretical justification of ILDE as an uncertainty-regularized policy optimization method with optimistic exploration, leading to a regret growing sublinearly in the number of episodes.