Stochastic Decision Horizons for Constrained Reinforcement Learning

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the poor off-policy scalability in constrained reinforcement learning caused by additive cost constraints and dual-based methods. The authors propose a control-as-inference framework grounded in stochastic decision horizons, wherein constraint violations are modeled as reward attenuation and effective planning horizon reduction through state-action-dependent survival probabilities. A replay-compatible, survival-weighted objective is introduced to facilitate sample-efficient learning. The approach further integrates two constraint semantics—absorbing and virtual termination—enabling seamless adaptation to SAC- or MPO-style policy optimization algorithms while preserving the original return structure. Experimental results demonstrate significant improvements in sample efficiency and the trade-off between returns and constraint satisfaction on standard benchmarks, with VT-MPO exhibiting particularly strong scalability in high-dimensional musculoskeletal environments such as Hyfydy.

Technology Category

Application Category

📝 Abstract

Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.

Problem

Research questions and friction points this paper is trying to address.

Constrained Reinforcement Learning

Off-policy Learning

Scalability

Sample Efficiency

Constraint Violation

Innovation

Methods, ideas, or system contributions that make the work stand out.

stochastic decision horizons

constrained reinforcement learning

survival-weighted objectives