🤖 AI Summary
This work addresses the poor off-policy scalability in constrained reinforcement learning caused by additive cost constraints and dual-based methods. The authors propose a control-as-inference framework grounded in stochastic decision horizons, wherein constraint violations are modeled as reward attenuation and effective planning horizon reduction through state-action-dependent survival probabilities. A replay-compatible, survival-weighted objective is introduced to facilitate sample-efficient learning. The approach further integrates two constraint semantics—absorbing and virtual termination—enabling seamless adaptation to SAC- or MPO-style policy optimization algorithms while preserving the original return structure. Experimental results demonstrate significant improvements in sample efficiency and the trade-off between returns and constraint satisfaction on standard benchmarks, with VT-MPO exhibiting particularly strong scalability in high-dimensional musculoskeletal environments such as Hyfydy.
📝 Abstract
Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.