๐ค AI Summary
In offline safe reinforcement learning, existing sequence models symmetrically condition on return-to-go (RTG) and cost-to-go (CTG), overlooking their fundamental asymmetry: RTG represents a soft performance objective, whereas CTG must enforce a hard safety constraint. This modeling bias leads to constraint violations under out-of-distribution costs. To address this, we propose the Boundary-to-Region (B2R) frameworkโthe first to explicitly model CTG as a safety boundary rather than a symmetric target. B2R unifies the cost distribution over feasible trajectories via cost-signal realignment and enhances exploration within the safety region using rotational positional encoding. Evaluated on 38 safety-critical tasks, B2R strictly satisfies safety constraints in 35 tasks, achieving significantly higher constraint satisfaction rates and reward performance compared to all baselines.
๐ Abstract
Offline safe reinforcement learning aims to learn policies that satisfy predefined safety constraints from static datasets. Existing sequence-model-based methods condition action generation on symmetric input tokens for return-to-go and cost-to-go, neglecting their intrinsic asymmetry: return-to-go (RTG) serves as a flexible performance target, while cost-to-go (CTG) should represent a rigid safety boundary. This symmetric conditioning leads to unreliable constraint satisfaction, especially when encountering out-of-distribution cost trajectories. To address this, we propose Boundary-to-Region (B2R), a framework that enables asymmetric conditioning through cost signal realignment . B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories while preserving reward structures. Combined with rotary positional embeddings , it enhances exploration within the safe region. Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks while achieving superior reward performance over baseline methods. This work highlights the limitations of symmetric token conditioning and establishes a new theoretical and practical approach for applying sequence models to safe RL. Our code is available at https://github.com/HuikangSu/B2R.