🤖 AI Summary
This work addresses the challenge in offline safe reinforcement learning where existing diffusion-based planning methods struggle to simultaneously satisfy safety constraints and optimize rewards under dynamic safety budgets. To overcome this, the authors propose Safe Decoupled Guidance Diffusion (SDGD), a novel framework that treats cost constraints as a conditional generation mechanism. SDGD decouples guidance into two distinct components: cost-conditioned guidance ensures trajectory feasibility, while reward-gradient guidance enhances performance. Additionally, the method introduces Feasible Trajectory Relabeling (FTR) to mitigate cost drift induced by reward-oriented guidance. Theoretical analysis and empirical evaluation demonstrate that SDGD strictly satisfies safety constraints in 36 out of 38 (94.7%) tasks on the DSRL benchmark and achieves the highest reward in 21 tasks.
📝 Abstract
Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and constraint satisfaction as competing gradient objectives, which can lead to unreliable safety compliance under cost limits. We reinterpret adaptive safe trajectory generation as sampling from a constrained trajectory distribution, where the budget restricts the trajectory region, and reward shapes preferences within that region. This perspective motivates Safe Decoupled Guidance Diffusion (SDGD), which conditions classifier-free guidance on the cost limit to bias sampling toward trajectories satisfying the specified limit, while using reward-gradient guidance to refine trajectories for higher return. Because direct reward guidance can increase return while also steering samples toward trajectories with higher cumulative cost, we introduce Feasible Trajectory Relabeling (FTR) to reshape reward targets and discourage such directions. We further provide a first-order sampling-time analysis showing that FTR suppresses reward-induced cost drift under a prefix-restorative alignment condition. Extensive evaluations on the DSRL benchmark show that SDGD achieves the strongest safety compliance among baselines, satisfying the constraint on 94.7% of tasks (36/38), while obtaining the highest reward among safe methods on 21 tasks.