Boundary-to-Region Supervision for Offline Safe Reinforcement Learning

๐Ÿ“… 2025-09-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In offline safe reinforcement learning, existing sequence models symmetrically condition on return-to-go (RTG) and cost-to-go (CTG), overlooking their fundamental asymmetry: RTG represents a soft performance objective, whereas CTG must enforce a hard safety constraint. This modeling bias leads to constraint violations under out-of-distribution costs. To address this, we propose the Boundary-to-Region (B2R) frameworkโ€”the first to explicitly model CTG as a safety boundary rather than a symmetric target. B2R unifies the cost distribution over feasible trajectories via cost-signal realignment and enhances exploration within the safety region using rotational positional encoding. Evaluated on 38 safety-critical tasks, B2R strictly satisfies safety constraints in 35 tasks, achieving significantly higher constraint satisfaction rates and reward performance compared to all baselines.

Technology Category

Application Category

๐Ÿ“ Abstract
Offline safe reinforcement learning aims to learn policies that satisfy predefined safety constraints from static datasets. Existing sequence-model-based methods condition action generation on symmetric input tokens for return-to-go and cost-to-go, neglecting their intrinsic asymmetry: return-to-go (RTG) serves as a flexible performance target, while cost-to-go (CTG) should represent a rigid safety boundary. This symmetric conditioning leads to unreliable constraint satisfaction, especially when encountering out-of-distribution cost trajectories. To address this, we propose Boundary-to-Region (B2R), a framework that enables asymmetric conditioning through cost signal realignment . B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories while preserving reward structures. Combined with rotary positional embeddings , it enhances exploration within the safe region. Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks while achieving superior reward performance over baseline methods. This work highlights the limitations of symmetric token conditioning and establishes a new theoretical and practical approach for applying sequence models to safe RL. Our code is available at https://github.com/HuikangSu/B2R.
Problem

Research questions and friction points this paper is trying to address.

Addresses unreliable constraint satisfaction in offline safe RL
Redefines cost-to-go as rigid boundary constraint for safety
Enables asymmetric conditioning through cost signal realignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric conditioning through cost signal realignment
Redefining cost-to-go as boundary constraint
Combining rotary embeddings for safe region exploration
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Huikang Su
Harbin Institute of Technology, Weihai
Dengyun Peng
Dengyun Peng
Harbin Institute of Technology
Zifeng Zhuang
Zifeng Zhuang
Westlake University
Reinforcement Learning
Y
Yuhan Liu
Harbin Institute of Technology, Weihai
Qiguang Chen
Qiguang Chen
Harbin Institute of Technology
Chain-of-ThoughtReasoningMultilingual LLMMulti-modal LLM
D
Donglin Wang
Westlake University, Hangzhou
Q
Qinghe Liu
Harbin Institute of Technology, Weihai