Reinforcement Learning via Implicit Imitation Guidance

๐Ÿ“… 2025-06-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Offline demonstration-guided sample-efficient reinforcement learning suffers from performance degradation due to the misalignment between behavior cloning and reward optimization objectives in existing imitation learning approaches. Method: We propose Data-Guided Noise (DGN), the first method that leverages offline demonstrations solely to identify high-value action directionsโ€”fully decoupling exploration guidance from reward maximization. Within a policy gradient framework, DGN dynamically injects action-space noise informed by statistical analysis of demonstration trajectories, implicitly steering policy exploration toward high-return regions without explicit imitation. Contribution/Results: Evaluated on seven continuous-control benchmark tasks, DGN achieves 2โ€“3ร— higher sample efficiency than state-of-the-art offline RL algorithms, while significantly improving policy generalization and asymptotic performance.

Technology Category

Application Category

๐Ÿ“ Abstract
We study the problem of sample efficient reinforcement learning, where prior data such as demonstrations are provided for initialization in lieu of a dense reward signal. A natural approach is to incorporate an imitation learning objective, either as regularization during training or to acquire a reference policy. However, imitation learning objectives can ultimately degrade long-term performance, as it does not directly align with reward maximization. In this work, we propose to use prior data solely for guiding exploration via noise added to the policy, sidestepping the need for explicit behavior cloning constraints. The key insight in our framework, Data-Guided Noise (DGN), is that demonstrations are most useful for identifying which actions should be explored, rather than forcing the policy to take certain actions. Our approach achieves up to 2-3x improvement over prior reinforcement learning from offline data methods across seven simulated continuous control tasks.
Problem

Research questions and friction points this paper is trying to address.

Sample efficient reinforcement learning with prior data
Avoiding imitation learning degradation of long-term performance
Guiding exploration via noise instead of behavior cloning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses prior data for exploration guidance
Avoids explicit behavior cloning constraints
Adds noise to policy for action exploration