🤖 AI Summary
This work addresses the challenges of low exploration efficiency and insufficient experience reuse that commonly hinder reinforcement learning in sparse-reward environments. The authors propose a PPO-based on-policy self-imitation algorithm that unifies handling of both dense and sparse reward settings: in dense-reward scenarios, it matches the state visitation distribution of optimal trajectories via optimal transport, while in sparse-reward settings, it uniformly replays successful trajectories to encourage structured exploration. By integrating self-imitation learning, optimal transport theory, and experience replay mechanisms, the method substantially improves sample efficiency and convergence speed. Empirical results demonstrate superior performance over existing self-imitation RL approaches on benchmarks including MuJoCo, Animal-AI Olympics, and PointMaze, achieving faster convergence and higher success rates.
📝 Abstract
Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systematically build on previously successful experiences, thereby reducing sample efficiency. To tackle this issue, we propose a self-imitating on-policy algorithm that enhances exploration and sample efficiency by leveraging past high-reward state-action pairs to guide policy updates. Our method incorporates self-imitation by using optimal transport distance in dense reward environments to prioritize state visitation distributions that match the most rewarding trajectory. In sparse-reward environments, we uniformly replay successful self-encountered trajectories to facilitate structured exploration. Experimental results across diverse environments demonstrate substantial improvements in learning efficiency, including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards. Our approach achieves faster convergence and significantly higher success rates compared to state-of-the-art self-imitating RL baselines. These findings underscore the potential of self-imitation as a robust strategy for enhancing exploration in RL, with applicability to more complex tasks.