π€ AI Summary
This work addresses the challenge of learning policies that simultaneously achieve high reward and satisfy safety constraints in Markov decision processes where the cost function is unknown and constraints are unobservable. To this end, the authors propose SafeQIL, an algorithm that, for the first time, integrates safety assessment into a Q-learning-based inverse constrained reinforcement learning framework. SafeQIL jointly models task rewards and safety constraints by introducing a βcommitmentβ mechanism over state-action pairs and optimizes the model via maximum likelihood estimation using expert demonstrations. Evaluated across multiple benchmark tasks, SafeQIL significantly outperforms existing methods, achieving both enhanced policy performance and guaranteed safety, thereby effectively balancing conservative constraint adherence with exploratory high-reward behavior.
π Abstract
Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories trading the balance between being conservative and increasing significantly the likelihood of high-rewarding trajectories but with potentially unsafe steps. Having these objectives, we aim towards learning a policy that maximizes the probability of the most $promising$ trajectories with respect to the demonstrations. In so doing, we formulate the ``promise"of individual state-action pairs in terms of $Q$ values, which depend on task-specific rewards as well as on the assessment of states'safety, mixing expectations in terms of rewards and safety. This entails a safe Q-learning perspective of the inverse learning problem under constraints: The devised Safe $Q$ Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.