🤖 AI Summary
This work addresses the challenge of offline reinforcement learning in high-stakes environments, where the scarcity of unsafe samples hinders the identification of latent constraint-violating states and often leads to deployment failures. To overcome this limitation, the authors propose the PROCO framework, which first learns a dynamics model from offline data and then leverages a large language model (LLM) to inject safety priors expressed in natural language. These priors inform a conservative cost function, enabling the generation of counterfactual unsafe trajectories through model-based rollouts. This approach proactively delineates the feasible region and facilitates safe policy learning without relying on real-world violations. PROCO is the first method to integrate semantic safety knowledge from LLMs into offline safe reinforcement learning, achieving risk anticipation and synthetic unsafe sample generation under near-zero violation conditions. Experiments on the Safety-Gymnasium benchmark demonstrate that PROCO substantially reduces constraint violations and seamlessly enhances diverse offline safe RL algorithms, outperforming both their original variants and behavior cloning baselines.
📝 Abstract
Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.