🤖 AI Summary
This work addresses the challenge in reachability control where reinforcement learning often underperforms in low-probability yet safe states due to misalignment between safe-set coverage objectives and policy optimization, compounded by unknown feasibility of initial conditions. To overcome this, the paper proposes Feasibility-Guided Exploration (FGE), a novel framework that unifies feasible initial condition identification with robust safe policy learning for the first time. FGE simultaneously discovers the feasible region and maximizes safe coverage in avoidance tasks under parameter uncertainty. By integrating deep reinforcement learning, robust optimization, and directed exploration, FGE operates effectively in high-dimensional environments—such as MuJoCo and Kinetix—and supports pixel-based observations. Experiments demonstrate that, under challenging initial conditions, FGE improves safe state coverage by over 50% compared to the current state-of-the-art methods.
📝 Abstract
Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions. Empirical results demonstrate that FGE learns policies with over 50% more coverage than the best existing method for challenging initial conditions across tasks in the MuJoCo simulator and the Kinetix simulator with pixel observations.