Accelerated Online Reinforcement Learning using Auxiliary Start State Distributions

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper addresses the low sample efficiency of online reinforcement learning in sparse-reward environments. We propose a novel exploration-guidance method that synergistically integrates expert demonstrations with simulator reset capability. Our core innovation lies in designing an auxiliary initial-state distribution grounded in a formal notion of safety, and employing episode length as a lightweight proxy signal for dynamic reset selection: resets are prioritized near termination states of short episodes to focus exploration on high-potential yet under-explored regions. The method requires no additional labeling or auxiliary model training, significantly improving policy exploration efficiency. Evaluated on multiple challenging sparse-reward benchmarks demanding extensive exploration, our approach achieves state-of-the-art sample efficiency, empirically validating the effectiveness and practicality of the safety-driven reset paradigm.

Technology Category

Application Category

📝 Abstract

A long-standing problem in online reinforcement learning (RL) is of ensuring sample efficiency, which stems from an inability to explore environments efficiently. Most attempts at efficient exploration tackle this problem in a setting where learning begins from scratch, without prior information available to bootstrap learning. However, such approaches fail to leverage expert demonstrations and simulators that can reset to arbitrary states. These affordances are valuable resources that offer enormous potential to guide exploration and speed up learning. In this paper, we explore how a small number of expert demonstrations and a simulator allowing arbitrary resets can accelerate learning during online RL. We find that training with a suitable choice of an auxiliary start state distribution that may differ from the true start state distribution of the underlying Markov Decision Process can significantly improve sample efficiency. We find that using a notion of safety to inform the choice of this auxiliary distribution significantly accelerates learning. By using episode length information as a way to operationalize this notion, we demonstrate state-of-the-art sample efficiency on a sparse-reward hard-exploration environment.

Problem

Research questions and friction points this paper is trying to address.

Improving sample efficiency in online reinforcement learning

Leveraging expert demonstrations for faster exploration

Using auxiliary start state distributions to accelerate learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses expert demonstrations to guide exploration

Employs simulator with arbitrary state resets

Optimizes auxiliary start state distribution for safety

🔎 Similar Papers

State-Constrained Offline Reinforcement Learning