Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses robust policy learning for average-reward reinforcement learning under environmental uncertainty, considering ambiguity sets defined by total variation (TV) and Wasserstein distances. Methodologically, it proposes the first non-asymptotically convergent robust Q-learning and robust actor-critic algorithms. By constructing a quotient seminorm, the authors establish strict contractivity of the robust Q-Bellman operator—overcoming a key theoretical barrier in finite-sample analysis of distributionally robust RL under the average-reward criterion. The algorithms achieve sample complexities of $ ilde{O}(varepsilon^{-2})$ and $ ilde{O}(varepsilon^{-3})$, respectively, yielding $varepsilon$-optimal robust policies efficiently. The core contribution lies in the novel integration of quotient seminorm analysis, robust Bellman theory, and stochastic approximation, thereby establishing the first non-asymptotic convergence framework for distributionally robust RL in the average-reward setting.

Technology Category

Application Category

📝 Abstract
We present the first $Q$-learning and actor-critic algorithms for robust average reward Markov Decision Processes (MDPs) with non-asymptotic convergence under contamination, TV distance and Wasserstein distance uncertainty sets. We show that the robust $Q$ Bellman operator is a strict contractive mapping with respect to a carefully constructed semi-norm with constant functions being quotiented out. This property supports a stochastic approximation update, that learns the optimal robust $Q$ function in $ ilde{cO}(epsilon^{-2})$ samples. We also show that the same idea can be used for robust $Q$ function estimation, which can be further used for critic estimation. Coupling it with theories in robust policy mirror descent update, we present a natural actor-critic algorithm that attains an $epsilon$-optimal robust policy in $ ilde{cO}(epsilon^{-3})$ samples. These results advance the theory of distributionally robust reinforcement learning in the average reward setting.
Problem

Research questions and friction points this paper is trying to address.

Develop robust Q-learning for average reward MDPs
Prove robust Q Bellman operator contractive mapping
Achieve ε-optimal robust policy efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust Q-learning with non-asymptotic convergence
Contractive Q Bellman operator via semi-norm
Actor-critic algorithm for ε-optimal robust policy
🔎 Similar Papers
No similar papers found.