Reward Redistribution for CVaR MDPs using a Bellman Operator on L-infinity

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge that static Conditional Value-at-Risk (CVaR) objectives in Markov decision processes lack a recursive Bellman structure, hindering efficient tail-risk optimization. By augmenting the state space, the authors reformulate the static CVaR objective and, for the first time, construct a CVaR Bellman operator that is contractive over the complete space of bounded functions, thereby overcoming issues of sparse rewards and degenerate fixed points. Building on this operator, they propose a risk-averse value iteration algorithm and a model-free Q-learning method, both accompanied by convergence guarantees and discretization error bounds in the L∞ space. Empirical results demonstrate that the learned policies effectively balance performance and safety while exhibiting sensitivity to CVaR.

Technology Category

Application Category

📝 Abstract

Tail-end risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastrophic events. Unlike risk-neutral objectives, the static CVaR of the return depends on entire trajectories without admitting a recursive Bellman decomposition in the underlying Markov decision process. A classical resolution relies on state augmentation with a continuous variable. However, unless restricted to a specialized class of admissible value functions, this formulation induces sparse rewards and degenerate fixed points. In this work, we propose a novel formulation of the static CVaR objective based on augmentation. Our alternative approach leads to a Bellman operator with: (1) dense per-step rewards; (2) contracting properties on the full space of bounded value functions. Building on this theoretical foundation, we develop risk-averse value iteration and model-free Q-learning algorithms that rely on discretized augmented states. We further provide convergence guarantees and approximation error bounds due to discretization. Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.

Problem

Research questions and friction points this paper is trying to address.

CVaR

Markov decision process

Bellman operator

risk-averse reinforcement learning

tail-end risk

Innovation

Methods, ideas, or system contributions that make the work stand out.

CVaR MDPs

Bellman operator

reward redistribution