Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently learning policies from static datasets in offline reinforcement learning that satisfy stringent safety constraints while avoiding high inference latency and runtime violations. The authors propose constructing a safety value function via Hamilton-Jacobi reachability analysis and learning the safety boundary through a self-consistent Bellman recursion. They further train a one-step flow policy using behavior cloning and distill it into an efficient action selector that eliminates the need for rejection sampling. By integrating conformal prediction calibration, the method provides probabilistic safety guarantees under limited data. Evaluated on vessel navigation and Safety Gymnasium MuJoCo benchmarks, the approach matches or exceeds state-of-the-art performance while significantly reducing both constraint violation rates and inference latency.

Technology Category

Application Category

📝 Abstract
Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.
Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning
safety constraints
safe policy learning
real-time control
constraint violations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Safe Flow Q-Learning
Offline Safe Reinforcement Learning
Hamilton-Jacobi Reachability
Conformal Prediction
One-step Flow Policy
🔎 Similar Papers
No similar papers found.