A Finite-Sample Analysis of Distributionally Robust Average-Reward Reinforcement Learning

πŸ“… 2025-05-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of finite-sample complexity analysis for long-horizon decision-making in distributionally robust reinforcement learning (DR-RL) under the average-reward criterion. We propose the Robust Halpern Iteration (RHI) algorithmβ€”the first provably sample-efficient DR-RL method that requires no prior knowledge of the environment. RHI integrates distributionally robust optimization with average-reward MDP modeling, employing both a contamination model and an β„“β‚š-norm uncertainty set to quantify distributional ambiguity. Under an MDP with $S$ states, $A$ actions, and bias span $mathcal{H}$, RHI computes an $varepsilon$-optimal policy with sample complexity $widetilde{mathcal{O}}(SAmathcal{H}^2 / varepsilon^2)$, achieving a polynomial upper bound. This establishes the first finite-sample guarantee for DR-RL in the average-reward setting, bridging a critical theoretical gap while preserving practical deployability.

Technology Category

Application Category

πŸ“ Abstract
Robust reinforcement learning (RL) under the average-reward criterion is crucial for long-term decision making under potential environment mismatches, yet its finite-sample complexity study remains largely unexplored. Existing works offer algorithms with asymptotic guarantees, but the absence of finite-sample analysis hinders its principled understanding and practical deployment, especially in data-limited settings. We close this gap by proposing Robust Halpern Iteration (RHI), the first algorithm with provable finite-sample complexity guarantee. Under standard uncertainty sets -- including contamination sets and $ell_p$-norm balls -- RHI attains an $epsilon$-optimal policy with near-optimal sample complexity of $ ilde{mathcal O}left(frac{SAmathcal H^{2}}{epsilon^{2}} ight)$, where $S$ and $A$ denote the numbers of states and actions, and $mathcal H$ is the robust optimal bias span. This result gives the first polynomial sample complexity guarantee for robust average-reward RL. Moreover, our RHI's independence from prior knowledge distinguishes it from many previous average-reward RL studies. Our work thus constitutes a significant advancement in enhancing the practical applicability of robust average-reward methods to complex, real-world problems.
Problem

Research questions and friction points this paper is trying to address.

Finite-sample analysis for robust average-reward RL
Lack of finite-sample guarantees in existing methods
Proposing RHI with polynomial sample complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Robust Halpern Iteration (RHI) algorithm
Achieves near-optimal sample complexity guarantee
Independent of prior knowledge for robustness
πŸ”Ž Similar Papers
No similar papers found.
Zachary Roch
Zachary Roch
PhD student at the University of Central Florida
FinTechReinforcement LearningOptimizationBlockchain
C
Chi Zhang
Department of Electrical and Computer Engineering, University of Central Florida
George Atia
George Atia
Professor, University of Central Florida
Machine LearningExplainable AIRobust Learning and InferenceStatistical Signal Processing
Y
Yue Wang
Department of Electrical and Computer Engineering, Department of Computer Science, University of Central Florida