Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

📅 2024-07-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This paper studies online learning in adversarial Markov decision processes (ADMDPs) under full information against oblivious adversaries, where the reward function is only observed after each episode. The goal is to achieve low regret. We propose a novel algorithm that avoids occupancy-measure-based formulations and instead integrates dynamic programming, black-box online linear optimization, advantage-function estimation, and martingale analysis with a refined transition kernel estimator. Our approach achieves the first regret bound of $ ilde{O}(mathrm{poly}(H)sqrt{SAT})$, improving upon the prior best by a factor of $sqrt{S}$ and matching the minimax lower bound in $S$, $A$, and $T$. This significantly narrows the theoretical gap between adversarial and stochastic MDPs. The algorithm is conceptually simple, eschews complex constrained optimization or occupancy-space modeling, and is practically implementable.

Technology Category

Application Category

📝 Abstract

We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during $T$ episodes, each of which consists of $H$ stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order $ ilde{mathcal{O}}(mathrm{poly}(H)sqrt{SAT})$, where $S$ and $A$ are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of $sqrt{S}$, bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound $Omega(sqrt{H^3SAT})$ as far as the dependencies in $S,A,T$ are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynamic programming and on a black-box online linear optimization strategy run over estimated advantage functions, making it easy to implement. The analysis leverages two recent techniques: policy optimization based on online linear optimization strategies (Jonckheere et al., 2023) and a refined martingale analysis of the impact on values of estimating transitions kernels (Zhang et al., 2023).

Problem

Research questions and friction points this paper is trying to address.

Learning in adversarial MDPs with oblivious adversary

Achieving improved regret bound via policy optimization

Bridging gap between adversarial and stochastic MDPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

APO-MVP algorithm improves regret bound

Dynamic programming and online linear optimization

Avoids occupancy measures, uses advantage functions

🔎 Similar Papers

Strongly-Polynomial Time and Validation Analysis of Policy Gradient Methods