๐ค AI Summary
This work addresses the safety risks in offline reinforcement learning agents arising from distributional shifts between training and deployment. The authors propose the SAS framework, which uniquely integrates Lyapunov stability conditions with a test-time self-alignment mechanism. Specifically, a pretrained Transformer-based agent generates imagined trajectories that satisfy Lyapunov stability criteria, and these trajectories are cyclically injected into the context as control-invariant promptsโenabling safe adaptation without any parameter updates. This approach endows the Transformer with an interpretable hierarchical Bayesian inference structure. Empirical evaluations on Safety Gymnasium and MuJoCo benchmarks demonstrate that SAS significantly reduces both constraint violation costs and task failure rates while maintaining or even improving task returns.
๐ Abstract
Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self-alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in-context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates. In effect, SAS turns Lyapunov-guided imagination into control-invariant prompts, and its transformer architecture admits a hierarchical RL interpretation where prompting functions as Bayesian inference over latent skills. Across Safety Gymnasium and MuJoCo benchmarks, SAS consistently reduces cost and failure while maintaining or improving return.