🤖 AI Summary
In sparse-reward reinforcement learning, dynamically determining *when* to imitate expert demonstrations—rather than relying solely on the agent’s autonomous policy—remains a critical challenge. This paper introduces SPReD (Sparse-policy Regularization via Distributional imitation), a framework that models the distribution of Q-values using an ensemble-based approach to quantify uncertainty in action advantage estimation. Leveraging this uncertainty, SPReD generates continuous, proportionally scaled imitation weights, enabling smooth, uncertainty-aware regularization between the learned policy and expert demonstrations. Unlike conventional binary imitation decisions, this uncertainty-aware mechanism substantially reduces gradient variance and enables robust learning from limited expert demonstrations. Empirical evaluation across eight robotic manipulation tasks demonstrates that SPReD achieves up to 14× performance improvement over state-of-the-art methods on complex tasks, while exhibiting strong robustness to variations in both demonstration quality and quantity.
📝 Abstract
In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.