Achieving $widetilde{mathcal{O}}(sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper studies infinite-horizon average-reward partially observable Markov decision processes (POMDPs) with unknown transition dynamics but known observation models. Existing approaches rely either on stochastic policies or strong estimation consistency assumptions, hindering derivation of tight regret bounds. To address this, we propose the first action-wise optimistic adaptive algorithm (Action-wise OAS-UCRL), which integrates action-separate transition matrix estimation, deterministic belief-state policy optimization, and optimistic value function construction—thereby eliminating dependence on stochastic policies and high-precision model estimation. We establish a regret upper bound of $mathcal{O}(sqrt{T log T})$, matching the current state-of-the-art. Empirical evaluations demonstrate that Action-wise OAS-UCRL significantly outperforms mainstream baseline methods across diverse benchmarks.

Technology Category

Application Category

📝 Abstract

We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic policies having a minimum probability of choosing each action, and (ii) Bayesian approaches employing the optimal policy class but requiring strong assumptions about the consistency of employed estimators. Our work removes these limitations by proving convenient estimation guarantees for the transition model and introducing an optimistic algorithm that leverages the optimal class of deterministic belief-based policies. We introduce modifications to existing estimation techniques providing theoretical guarantees separately for each estimated action transition matrix. Unlike existing estimation methods that are unable to use samples from different policies, we present a novel and simple estimator that overcomes this barrier. This new data-efficient technique, combined with the proposed emph{Action-wise OAS-UCRL} algorithm and a tighter theoretical analysis, leads to the first approach enjoying a regret guarantee of order $mathcal{O}(sqrt{T ,log T})$ when compared against the optimal policy, thus improving over state of the art techniques. Finally, theoretical results are validated through numerical simulations showing the efficacy of our method against baseline methods.

Problem

Research questions and friction points this paper is trying to address.

Average Reward

POMDP

Regret Minimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

POMDP

Optimistic Algorithm

Regret Bound

🔎 Similar Papers

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs