Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of efficiently learning policies from static datasets in offline reinforcement learning that satisfy stringent safety constraints while avoiding high inference latency and runtime violations. The authors propose constructing a safety value function via Hamilton-Jacobi reachability analysis and learning the safety boundary through a self-consistent Bellman recursion. They further train a one-step flow policy using behavior cloning and distill it into an efficient action selector that eliminates the need for rejection sampling. By integrating conformal prediction calibration, the method provides probabilistic safety guarantees under limited data. Evaluated on vessel navigation and Safety Gymnasium MuJoCo benchmarks, the approach matches or exceeds state-of-the-art performance while significantly reducing both constraint violation rates and inference latency.

Technology Category

Application Category

📝 Abstract

Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.

Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning

safety constraints

safe policy learning

real-time control

constraint violations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Safe Flow Q-Learning

Offline Safe Reinforcement Learning

Hamilton-Jacobi Reachability