🤖 AI Summary
Target-oriented goal-conditioned reinforcement learning (GCRL) poses significant safety risks in real-world applications due to trial-and-error exploration. To address this, we propose a two-stage safe exploration framework: (1) pretraining distributed safety policies to proactively avoid failure states; and (2) integrating a dynamic dual-policy arbitration mechanism during goal-conditioned policy execution, enabling real-time switching to safe actions. Our key innovation lies in embedding distributed safety evaluation directly into the action selection process—achieving, for the first time, strong safety guarantees where exploration never incurs failures. Moreover, we decouple safety learning from goal-directed learning, enabling cross-task transfer of safety policies. Experiments in simulation demonstrate high coverage of the target space and near-zero failure rates, substantially outperforming baseline GCRL methods. Ablation studies confirm the efficacy of each component.
📝 Abstract
Goal-Conditioned Reinforcement Learning (GCRL) provides a versatile framework for developing unified controllers capable of handling wide ranges of tasks, exploring environments, and adapting behaviors. However, its reliance on trial-and-error poses challenges for real-world applications, as errors can result in costly and potentially damaging consequences. To address the need for safer learning, we propose a method that enables agents to learn goal-conditioned behaviors that explore without the risk of making harmful mistakes. Exploration without risks can seem paradoxical, but environment dynamics are often uniform in space, therefore a policy trained for safety without exploration purposes can still be exploited globally. Our proposed approach involves two distinct phases. First, during a pretraining phase, we employ safe reinforcement learning and distributional techniques to train a safety policy that actively tries to avoid failures in various situations. In the subsequent safe exploration phase, a goal-conditioned (GC) policy is learned while ensuring safety. To achieve this, we implement an action-selection mechanism leveraging the previously learned distributional safety critics to arbitrate between the safety policy and the GC policy, ensuring safe exploration by switching to the safety policy when needed. We evaluate our method in simulated environments and demonstrate that it not only provides substantial coverage of the goal space but also reduces the occurrence of mistakes to a minimum, in stark contrast to traditional GCRL approaches. Additionally, we conduct an ablation study and analyze failure modes, offering insights for future research directions.