🤖 AI Summary
This work addresses the challenge that static Conditional Value-at-Risk (CVaR) objectives in Markov decision processes lack a recursive Bellman structure, hindering efficient tail-risk optimization. By augmenting the state space, the authors reformulate the static CVaR objective and, for the first time, construct a CVaR Bellman operator that is contractive over the complete space of bounded functions, thereby overcoming issues of sparse rewards and degenerate fixed points. Building on this operator, they propose a risk-averse value iteration algorithm and a model-free Q-learning method, both accompanied by convergence guarantees and discretization error bounds in the L∞ space. Empirical results demonstrate that the learned policies effectively balance performance and safety while exhibiting sensitivity to CVaR.
📝 Abstract
Tail-end risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastrophic events. Unlike risk-neutral objectives, the static CVaR of the return depends on entire trajectories without admitting a recursive Bellman decomposition in the underlying Markov decision process. A classical resolution relies on state augmentation with a continuous variable. However, unless restricted to a specialized class of admissible value functions, this formulation induces sparse rewards and degenerate fixed points. In this work, we propose a novel formulation of the static CVaR objective based on augmentation. Our alternative approach leads to a Bellman operator with: (1) dense per-step rewards; (2) contracting properties on the full space of bounded value functions. Building on this theoretical foundation, we develop risk-averse value iteration and model-free Q-learning algorithms that rely on discretized augmented states. We further provide convergence guarantees and approximation error bounds due to discretization. Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.