🤖 AI Summary
This work addresses the limitation of conventional constrained reinforcement learning, which relies on expected cost constraints and struggles to prevent rare yet high-impact safety violations arising from tail events. To this end, the paper introduces extreme value theory (EVT) into safe reinforcement learning for the first time and proposes the Extreme Value Optimization (EVO) algorithm. EVO models the extremal behavior of both rewards and costs through an optimization objective based on extreme quantiles and incorporates a prioritized experience replay mechanism that emphasizes extreme samples. Theoretically, the method guarantees safety at a zero-violation quantile level. Empirical results demonstrate that EVO significantly reduces both the probability and variance of constraint violations during training while maintaining policy performance comparable to baseline approaches.
📝 Abstract
Ensuring safety is a critical challenge in applying Reinforcement Learning (RL) to real-world scenarios. Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints, typically formulated as the expected cumulative cost. However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution, such as black swan incidents, which can lead to severe constraint violations. To address this issue, we propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples, reducing constraint violations. EVO introduces an extreme quantile optimization objective to explicitly capture extreme samples in the cost tail distribution. Additionally, we propose an extreme prioritization mechanism during replay, amplifying the learning signal from rare but high-impact extreme samples. Theoretically, we establish upper bounds on expected constraint violations during policy updates, guaranteeing strict constraint satisfaction at a zero-violation quantile level. Further, we demonstrate that EVO achieves a lower probability of constraint violations than expectation-based methods and exhibits lower variance than quantile regression methods. Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines.