Safe Reinforcement Learning with Minimal Supervision

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Safety-critical reinforcement learning (RL) faces significant challenges in real-world settings where expert demonstrations are sparse or entirely unavailable. Method: This paper proposes an unsupervised offline data collection framework coupled with an optimistic forgetting mechanism for online safe learning. Without requiring handcrafted controllers or human demonstrations, the approach jointly optimizes exploration diversity and safety through safe-set modeling and online constrained optimization. Crucially, the novel optimistic forgetting mechanism dynamically relaxes outdated constraints under limited samples, accelerating safe policy convergence. Contributions/Results: Experiments demonstrate substantial improvements in both success rate and convergence speed of safe policies under low demonstration budgets. The work is the first to empirically validate the critical trade-off between data quality and quantity in safe RL performance. It establishes a new paradigm for scalable, safe exploration in complex and extensible goal-oriented environments.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) in the real world necessitates the development of procedures that enable agents to explore without causing harm to themselves or others. The most successful solutions to the problem of safe RL leverage offline data to learn a safe-set, enabling safe online exploration. However, this approach to safe-learning is often constrained by the demonstrations that are available for learning. In this paper we investigate the influence of the quantity and quality of data used to train the initial safe learning problem offline on the ability to learn safe-RL policies online. Specifically, we focus on tasks with spatially extended goal states where we have few or no demonstrations available. Classically this problem is addressed either by using hand-designed controllers to generate data or by collecting user-generated demonstrations. However, these methods are often expensive and do not scale to more complex tasks and environments. To address this limitation we propose an unsupervised RL-based offline data collection procedure, to learn complex and scalable policies without the need for hand-designed controllers or user demonstrations. Our research demonstrates the significance of providing sufficient demonstrations for agents to learn optimal safe-RL policies online, and as a result, we propose optimistic forgetting, a novel online safe-RL approach that is practical for scenarios with limited data. Further, our unsupervised data collection approach highlights the need to balance diversity and optimality for safe online exploration.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Safe Learning

Limited Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic Forgetting

Safe Online Learning

Balanced Diversity and Optimality

🔎 Similar Papers

Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding