🤖 AI Summary
In sparse-reward environments, reinforcement learning suffers from inefficient exploration and poor sample efficiency due to infrequent and delayed feedback. To address this, we propose a value function initialization method that leverages a small number of (even suboptimal) successful demonstrations to estimate state-action values offline; these estimates serve as informative priors for online Q-learning, forming a lightweight “offline warm-start + online fine-tuning” paradigm. Our approach requires no architectural modifications or auxiliary modules, substantially reducing early-stage exploration burden. On standard benchmark tasks, it achieves significantly faster convergence compared to baseline algorithms. Moreover, it demonstrates strong robustness to variations in both the quantity and quality of demonstrations. Empirical results validate the effectiveness and practicality of value priors in sparse-reward settings, offering a simple yet powerful mechanism to improve learning efficiency without increasing model complexity.
📝 Abstract
Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.