Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge of offline reinforcement learning in high-stakes environments, where the scarcity of unsafe samples hinders the identification of latent constraint-violating states and often leads to deployment failures. To overcome this limitation, the authors propose the PROCO framework, which first learns a dynamics model from offline data and then leverages a large language model (LLM) to inject safety priors expressed in natural language. These priors inform a conservative cost function, enabling the generation of counterfactual unsafe trajectories through model-based rollouts. This approach proactively delineates the feasible region and facilitates safe policy learning without relying on real-world violations. PROCO is the first method to integrate semantic safety knowledge from LLMs into offline safe reinforcement learning, achieving risk anticipation and synthetic unsafe sample generation under near-zero violation conditions. Experiments on the Safety-Gymnasium benchmark demonstrate that PROCO substantially reduces constraint violations and seamlessly enhances diverse offline safe RL algorithms, outperforming both their original variants and behavior cloning baselines.

📝 Abstract

Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.

Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning

safe policy learning

limited violation data

constraint satisfaction

safe-but-infeasible states

Innovation

Methods, ideas, or system contributions that make the work stand out.

model-based offline RL

large language models

counterfactual unsafe synthesis