A Finite-Sample Analysis of Distributionally Robust Average-Reward Reinforcement Learning

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the lack of finite-sample complexity analysis for long-horizon decision-making in distributionally robust reinforcement learning (DR-RL) under the average-reward criterion. We propose the Robust Halpern Iteration (RHI) algorithm—the first provably sample-efficient DR-RL method that requires no prior knowledge of the environment. RHI integrates distributionally robust optimization with average-reward MDP modeling, employing both a contamination model and an ℓₚ-norm uncertainty set to quantify distributional ambiguity. Under an MDP with $S$ states, $A$ actions, and bias span $mathcal{H}$, RHI computes an $varepsilon$-optimal policy with sample complexity $widetilde{mathcal{O}}(SAmathcal{H}^2 / varepsilon^2)$, achieving a polynomial upper bound. This establishes the first finite-sample guarantee for DR-RL in the average-reward setting, bridging a critical theoretical gap while preserving practical deployability.

Technology Category

Application Category

📝 Abstract

Robust reinforcement learning (RL) under the average-reward criterion is crucial for long-term decision making under potential environment mismatches, yet its finite-sample complexity study remains largely unexplored. Existing works offer algorithms with asymptotic guarantees, but the absence of finite-sample analysis hinders its principled understanding and practical deployment, especially in data-limited settings. We close this gap by proposing Robust Halpern Iteration (RHI), the first algorithm with provable finite-sample complexity guarantee. Under standard uncertainty sets -- including contamination sets and $ell_p$-norm balls -- RHI attains an $epsilon$-optimal policy with near-optimal sample complexity of $ ilde{mathcal O}left(frac{SAmathcal H^{2}}{epsilon^{2}} ight)$, where $S$ and $A$ denote the numbers of states and actions, and $mathcal H$ is the robust optimal bias span. This result gives the first polynomial sample complexity guarantee for robust average-reward RL. Moreover, our RHI's independence from prior knowledge distinguishes it from many previous average-reward RL studies. Our work thus constitutes a significant advancement in enhancing the practical applicability of robust average-reward methods to complex, real-world problems.

Problem

Research questions and friction points this paper is trying to address.

Finite-sample analysis for robust average-reward RL

Lack of finite-sample guarantees in existing methods

Proposing RHI with polynomial sample complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Robust Halpern Iteration (RHI) algorithm

Achieves near-optimal sample complexity guarantee

Independent of prior knowledge for robustness

🔎 Similar Papers

The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards