Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the sample complexity of distributionally robust (DR) average-reward reinforcement learning, motivated by real-world applications—such as robotics and healthcare—that demand long-term performance stability. Addressing the lack of finite-sample theoretical guarantees in existing DR RL literature, we propose two near-optimal algorithms: (i) the first to introduce an *anchored-state mechanism*, which stabilizes transition kernel perturbations within KL- and *f*-divergence-based uncertainty sets; and (ii) the first to establish finite-sample convergence guarantees for DR average-reward MDPs. Under uniform ergodicity, we derive a sample complexity upper bound of $widetilde{O}(|S||A|t_{mathrm{mix}}^2varepsilon^{-2})$, achieved via reduction to discounted MDPs and mixing-time analysis. Numerical experiments corroborate the theoretically predicted convergence rate.

Technology Category

Application Category

📝 Abstract
Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $widetilde{O}left(|mathbf{S}||mathbf{A}| t_{mathrm{mix}}^2varepsilon^{-2} ight)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $varepsilon$ is the target accuracy, $|mathbf{S}|$ and $|mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.
Problem

Research questions and friction points this paper is trying to address.

Study distributionally robust average-reward reinforcement learning for stable long-term performance
Propose algorithms with near-optimal sample complexity for robust MDPs
Establish finite-sample convergence guarantees under KL and f-divergence uncertainty sets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reduces DR average-reward to discounted MDP
Introduces anchoring state for kernel stability
Achieves near-optimal sample complexity guarantee
🔎 Similar Papers
No similar papers found.
Z
Zijun Chen
Department of Computer Science and Engineering, HKUST
S
Shengbo Wang
Department of Management Science and Engineering, Stanford University
Nian Si
Nian Si
Hong Kong University of Science and Technology
Applied ProbabilityExperimental DesignCausal Inference