Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge of learning policies that simultaneously achieve high reward and satisfy safety constraints in Markov decision processes where the cost function is unknown and constraints are unobservable. To this end, the authors propose SafeQIL, an algorithm that, for the first time, integrates safety assessment into a Q-learning-based inverse constrained reinforcement learning framework. SafeQIL jointly models task rewards and safety constraints by introducing a “commitment” mechanism over state-action pairs and optimizes the model via maximum likelihood estimation using expert demonstrations. Evaluated across multiple benchmark tasks, SafeQIL significantly outperforms existing methods, achieving both enhanced policy performance and guaranteed safety, thereby effectively balancing conservative constraint adherence with exploratory high-reward behavior.

Technology Category

Application Category

📝 Abstract

Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories trading the balance between being conservative and increasing significantly the likelihood of high-rewarding trajectories but with potentially unsafe steps. Having these objectives, we aim towards learning a policy that maximizes the probability of the most $promising$ trajectories with respect to the demonstrations. In so doing, we formulate the ``promise"of individual state-action pairs in terms of $Q$ values, which depend on task-specific rewards as well as on the assessment of states'safety, mixing expectations in terms of rewards and safety. This entails a safe Q-learning perspective of the inverse learning problem under constraints: The devised Safe $Q$ Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.

Problem

Research questions and friction points this paper is trying to address.

constrained MDP

unknown constraints

inverse reinforcement learning

safety

expert demonstrations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Safe Q-learning

Inverse Reinforcement Learning

Unknown Constraints