Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In sparse-reward reinforcement learning, dynamically determining *when* to imitate expert demonstrations—rather than relying solely on the agent’s autonomous policy—remains a critical challenge. This paper introduces SPReD (Sparse-policy Regularization via Distributional imitation), a framework that models the distribution of Q-values using an ensemble-based approach to quantify uncertainty in action advantage estimation. Leveraging this uncertainty, SPReD generates continuous, proportionally scaled imitation weights, enabling smooth, uncertainty-aware regularization between the learned policy and expert demonstrations. Unlike conventional binary imitation decisions, this uncertainty-aware mechanism substantially reduces gradient variance and enables robust learning from limited expert demonstrations. Empirical evaluation across eight robotic manipulation tasks demonstrates that SPReD achieves up to 14× performance improvement over state-of-the-art methods on complex tasks, while exhibiting strong robustness to variations in both demonstration quality and quantity.

Technology Category

Application Category

📝 Abstract
In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.
Problem

Research questions and friction points this paper is trying to address.

Determining when to imitate demonstrations versus follow own policy
Addressing imitation timing in reinforcement learning with sparse rewards
Using uncertainty modeling to guide demonstration imitation decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble methods model Q-value distributions uncertainty
Continuous uncertainty-proportional regularization weights applied
Probabilistic and advantage-based approaches for imitation scaling
🔎 Similar Papers
No similar papers found.
Y
Yujie Zhu
Department of Statistics, University of Warwick
C
Charles A. Hepburn
Department of Statistics, University of Warwick
Matthew Thorpe
Matthew Thorpe
Associate Professor in Statistics, University of Warwick
Giovanni Montana
Giovanni Montana
Professor of Data Science, University of Warwick
Data ScienceMachine LearningDigital Healthcare