Online Optimization for Offline Safe Reinforcement Learning

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses offline safe reinforcement learning (OSRL), where the goal is to learn a policy from a fixed static dataset that maximizes expected reward while satisfying cumulative cost constraints. We propose the first OSRL framework integrating online convex optimization: the problem is formulated as a min-max optimization—inner-loop policy evaluation is performed via an approximate offline RL oracle, while the outer loop dynamically adjusts Lagrange multipliers using no-regret online algorithms (e.g., Follow-the-Regularized-Leader) to enforce safety constraints. This design eliminates reliance on exact offline policy evaluation, achieving both theoretical near-optimality and practical robustness. Experiments on the DSRL benchmark demonstrate that our method consistently satisfies stringent cost limits and achieves significantly higher reward performance than existing baselines. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
We study the problem of Offline Safe Reinforcement Learning (OSRL), where the goal is to learn a reward-maximizing policy from fixed data under a cumulative cost constraint. We propose a novel OSRL approach that frames the problem as a minimax objective and solves it by combining offline RL with online optimization algorithms. We prove the approximate optimality of this approach when integrated with an approximate offline RL oracle and no-regret online optimization. We also present a practical approximation that can be combined with any offline RL algorithm, eliminating the need for offline policy evaluation. Empirical results on the DSRL benchmark demonstrate that our method reliably enforces safety constraints under stringent cost budgets, while achieving high rewards. The code is available at https://github.com/yassineCh/O3SRL.
Problem

Research questions and friction points this paper is trying to address.

Learning safe policies from fixed offline data
Enforcing cumulative cost constraints during training
Maximizing rewards while satisfying safety requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines offline RL with online optimization
Uses minimax objective for safe learning
Eliminates need for offline policy evaluation