Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

📅 2024-04-02

📈 Citations: 3

✨ Influential: 0

career value

214K/year

🤖 AI Summary

We address online policy optimization for infinite-horizon average-reward Markov decision processes (MDPs). We propose two novel policy gradient algorithms: one employs implicit gradient transport (IGT) for variance reduction, and the other incorporates Hessian-assisted updates to accelerate convergence. Theoretically, we establish, for the first time without assuming $L$-smoothness of the policy parameterization, that the average-reward objective satisfies an approximate $L$-smoothness property—circumventing a key limitation of prior analyses. This enables derivation of two expected regret bounds: $ ilde{mathcal{O}}(T^{2/3})$ and $ ilde{mathcal{O}}(sqrt{T})$, with the latter matching the information-theoretic lower bound and strictly improving upon the previous best-known rate of $ ilde{mathcal{O}}(T^{3/4})$. Empirical evaluations confirm the efficacy of our methods. Our work provides a more rigorous theoretical foundation for policy gradient methods in average-reward MDPs.

Technology Category

Application Category

📝 Abstract

We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $ ilde{mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $ ilde{mathcal{O}}(sqrt{T})$. These results significantly improve the state-of-the-art $ ilde{mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.

Problem

Research questions and friction points this paper is trying to address.

Develops novel policy gradient algorithms for infinite-horizon MDPs

Achieves order-optimal regret bounds with variance reduction

Proves L-smoothness of average-reward function theoretically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Gradient Transport reduces variance

Hessian-based techniques optimize regret bounds

Proves average-reward function L-smoothness

🔎 Similar Papers

No similar papers found.