Efficient Algorithms for Mitigating Uncertainty and Risk in Reinforcement Learning

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This paper addresses model uncertainty and risk-sensitive decision-making in reinforcement learning. We propose a unified framework integrating dynamic programming (DP) and policy gradients. Our contributions are threefold: (1) We introduce the Certainty-Equivalent Actor-Dynamic Programming (CADP) algorithm, establishing—for the first time—theoretical connections between policy gradients and DP under general risk measures. (2) Under non-convex risk measures—specifically Expected Risk Minimization (ERM) and Entropic Value-at-Risk (EVaR)—we rigorously prove existence of optimal policies and devise convergent Q-learning variants, overcoming convergence barriers posed by non-contractive Bellman operators. (3) Leveraging coordinate-ascent DP, exponential-value iteration, policy iteration, and linear programming, we develop ERM-TRC and EVaR-TRC risk-control frameworks that guarantee monotonic policy improvement and local optimality under average uncertainty. Experiments demonstrate the efficacy and theoretical soundness of our approach in computing robust stationary policies.

Technology Category

Application Category

📝 Abstract

This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.

Problem

Research questions and friction points this paper is trying to address.

Developing algorithms to maximize discounted returns under model uncertainty

Establishing conditions for risk-averse Bellman operators to be contractions

Creating model-free Q-learning methods for risk-averse reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

CADP algorithm optimizes policies via dynamic programming

Exponential value iteration computes risk-averse optimal policies

Model-free Q-learning converges to risk-averse value functions

🔎 Similar Papers

No similar papers found.