π€ AI Summary
This work addresses the lack of finite-sample complexity analysis for long-horizon decision-making in distributionally robust reinforcement learning (DR-RL) under the average-reward criterion. We propose the Robust Halpern Iteration (RHI) algorithmβthe first provably sample-efficient DR-RL method that requires no prior knowledge of the environment. RHI integrates distributionally robust optimization with average-reward MDP modeling, employing both a contamination model and an ββ-norm uncertainty set to quantify distributional ambiguity. Under an MDP with $S$ states, $A$ actions, and bias span $mathcal{H}$, RHI computes an $varepsilon$-optimal policy with sample complexity $widetilde{mathcal{O}}(SAmathcal{H}^2 / varepsilon^2)$, achieving a polynomial upper bound. This establishes the first finite-sample guarantee for DR-RL in the average-reward setting, bridging a critical theoretical gap while preserving practical deployability.
π Abstract
Robust reinforcement learning (RL) under the average-reward criterion is crucial for long-term decision making under potential environment mismatches, yet its finite-sample complexity study remains largely unexplored. Existing works offer algorithms with asymptotic guarantees, but the absence of finite-sample analysis hinders its principled understanding and practical deployment, especially in data-limited settings. We close this gap by proposing Robust Halpern Iteration (RHI), the first algorithm with provable finite-sample complexity guarantee. Under standard uncertainty sets -- including contamination sets and $ell_p$-norm balls -- RHI attains an $epsilon$-optimal policy with near-optimal sample complexity of $ ilde{mathcal O}left(frac{SAmathcal H^{2}}{epsilon^{2}}
ight)$, where $S$ and $A$ denote the numbers of states and actions, and $mathcal H$ is the robust optimal bias span. This result gives the first polynomial sample complexity guarantee for robust average-reward RL. Moreover, our RHI's independence from prior knowledge distinguishes it from many previous average-reward RL studies. Our work thus constitutes a significant advancement in enhancing the practical applicability of robust average-reward methods to complex, real-world problems.