Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This paper addresses the problem of learning an $varepsilon$-optimal policy in average-reward Markov decision processes (MDPs) without prior knowledge of the optimal bias function’s span $H$. We propose the first span-based algorithm achieving minimax-optimal sample complexity independent of $H$. Our method introduces three key innovations: (1) a time-domain calibration mechanism integrating discounted MDP transformation with empirical confidence intervals; (2) a variance-aware empirical span penalty that enables adaptive acceleration in benign environments; and (3) an oracle inequality guaranteeing theoretical robustness under both worst-case and favorable scenarios. The algorithm achieves a sample complexity of $widetilde{O}(SAH/varepsilon^2)$, matching the information-theoretic lower bound both for fixed sample budgets and fixed $varepsilon$, thereby establishing the first tight optimality result without requiring prior knowledge of $H$.

Technology Category

Application Category

📝 Abstract

We study the sample complexity of finding an $varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $widetilde{O}(SAH/varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term horizon calibration. We also develop an empirical span penalization approach, inspired by sample variance penalization, which satisfies an oracle inequality performance guarantee. In particular this algorithm can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.

Problem

Research questions and friction points this paper is trying to address.

Optimal sample complexity in average-reward MDPs

Algorithm without prior knowledge of span

Empirical span penalization for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Horizon calibration technique

Empirical span penalization approach

Discounted reduction method

🔎 Similar Papers

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach