🤖 AI Summary
This work addresses two key challenges in offline safe reinforcement learning: the difficulty of effectively stitching high-quality transitions from suboptimal trajectories and the reliance on manually specified trade-offs when reward and cost objectives conflict. To overcome these limitations, we propose Goal-Assisted Stitching (GAS), an algorithm that enhances trajectory stitching through transition-level data augmentation and relabeling. GAS introduces an expectile regression-based objective to automatically estimate attainable reward-cost targets, enabling adaptive guidance of policy optimization without human-specified goals. By innovatively integrating data augmentation, goal learning, and distribution reshaping, GAS establishes a conditional generative framework for safe policies that operates entirely offline. Experimental results demonstrate that GAS significantly outperforms existing methods across multiple benchmark tasks, achieving higher cumulative rewards while strictly satisfying safety constraints.
📝 Abstract
Offline Safe Reinforcement Learning (OSRL) aims to learn a policy to achieve high performance in sequential decision-making while satisfying constraints, using only pre-collected datasets. Recent works, inspired by the strong capabilities of Generative Models (GMs), reformulate decision-making in OSRL as a conditional generative process, where GMs generate desirable actions conditioned on predefined reward and cost values. However, GM-assisted methods face two major challenges in OSRL: (1) lacking the ability to"stitch"optimal transitions from suboptimal trajectories within the dataset, and (2) struggling to balance reward targets with cost targets, particularly when they are conflict. To address these issues, we propose Goal-Assisted Stitching (GAS), a novel algorithm designed to enhance stitching capabilities while effectively balancing reward maximization and constraint satisfaction. To enhance the stitching ability, GAS first augments and relabels the dataset at the transition level, enabling the construction of high-quality trajectories from suboptimal ones. GAS also introduces novel goal functions, which estimate the optimal achievable reward and cost goals from the dataset. These goal functions, trained using expectile regression on the relabeled and augmented dataset, allow GAS to accommodate a broader range of reward-cost return pairs and achieve a better tradeoff between reward maximization and constraint satisfaction compared to human-specified values. The estimated goals then guide policy training, ensuring robust performance under constrained settings. Furthermore, to improve training stability and efficiency, we reshape the dataset to achieve a more uniform reward-cost return distribution. Empirical results validate the effectiveness of GAS, demonstrating superior performance in balancing reward maximization and constraint satisfaction compared to existing methods.