🤖 AI Summary
This paper addresses model uncertainty and risk-sensitive decision-making in reinforcement learning. We propose a unified framework integrating dynamic programming (DP) and policy gradients. Our contributions are threefold: (1) We introduce the Certainty-Equivalent Actor-Dynamic Programming (CADP) algorithm, establishing—for the first time—theoretical connections between policy gradients and DP under general risk measures. (2) Under non-convex risk measures—specifically Expected Risk Minimization (ERM) and Entropic Value-at-Risk (EVaR)—we rigorously prove existence of optimal policies and devise convergent Q-learning variants, overcoming convergence barriers posed by non-contractive Bellman operators. (3) Leveraging coordinate-ascent DP, exponential-value iteration, policy iteration, and linear programming, we develop ERM-TRC and EVaR-TRC risk-control frameworks that guarantee monotonic policy improvement and local optimality under average uncertainty. Experiments demonstrate the efficacy and theoretical soundness of our approach in computing robust stationary policies.
📝 Abstract
This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.