🤖 AI Summary
This work addresses offline safe reinforcement learning, aiming to learn high-reward policies that strictly satisfy safety constraints solely from a fixed, pre-collected dataset—thereby eliminating risky online exploration. Methodologically, it introduces the first application of diffusion models for behavioral policy modeling, yielding a simplified inference architecture; further, it integrates constrained optimization with gradient redirection to dynamically balance reward maximization and cost constraint satisfaction. Key contributions include: (1) the first diffusion-based framework for offline safe policy learning; (2) zero-hyperparameter adaptation across diverse tasks; and (3) state-of-the-art performance on robotic control benchmarks—achieving strict cost constraint satisfaction, efficient inference, SOTA reward performance, and strong generalization.
📝 Abstract
Constrained reinforcement learning (RL) seeks high-performance policies under safety constraints. We focus on an offline setting where the agent has only a fixed dataset -- common in realistic tasks to prevent unsafe exploration. To address this, we propose Diffusion-Regularized Constrained Offline Reinforcement Learning (DRCORL), which first uses a diffusion model to capture the behavioral policy from offline data and then extracts a simplified policy to enable efficient inference. We further apply gradient manipulation for safety adaptation, balancing the reward objective and constraint satisfaction. This approach leverages high-quality offline data while incorporating safety requirements. Empirical results show that DRCORL achieves reliable safety performance, fast inference, and strong reward outcomes across robot learning tasks. Compared to existing safe offline RL methods, it consistently meets cost limits and performs well with the same hyperparameters, indicating practical applicability in real-world scenarios.