🤖 AI Summary
This paper studies ε-optimal policy learning for weakly communicating average-reward MDPs under a generative model. For the model-free setting, we propose the first prior-free, deterministic, finite-horizon terminating algorithm—combining Halpern anchoring iteration with recursive variance reduction sampling and introducing span-norm analysis of the bias vector. The algorithm achieves sample and time complexity Õ(|S||A|∥h*∥ₛₚ/ε²), matching the information-theoretic lower bound (up to a ∥h*∥ₛₚ factor) both in high-probability and expectation guarantees, thereby attaining optimal sample efficiency. We further extend the framework to discounted MDPs, demonstrating its broad applicability.
📝 Abstract
We study a new model-free algorithm to compute $varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating case. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an $varepsilon$-optimal policy with sample and time complexity $widetilde{O}(|mathcal{S}||mathcal{A}||h^*|_{ ext{sp}}^{2}/varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $|h^*|_{ ext{sp}}$. Although the complexity bound involves the span seminorm $|h^*|_{ ext{sp}}$ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.