Near-Optimal Sample Complexity for MDPs via Anchoring

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This paper studies ε-optimal policy learning for weakly communicating average-reward MDPs under a generative model. For the model-free setting, we propose the first prior-free, deterministic, finite-horizon terminating algorithm—combining Halpern anchoring iteration with recursive variance reduction sampling and introducing span-norm analysis of the bias vector. The algorithm achieves sample and time complexity Õ(|S||A|∥h*∥ₛₚ/ε²), matching the information-theoretic lower bound (up to a ∥h*∥ₛₚ factor) both in high-probability and expectation guarantees, thereby attaining optimal sample efficiency. We further extend the framework to discounted MDPs, demonstrating its broad applicability.

Technology Category

Application Category

📝 Abstract

We study a new model-free algorithm to compute $varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating case. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an $varepsilon$-optimal policy with sample and time complexity $widetilde{O}(|mathcal{S}||mathcal{A}||h^*|_{ ext{sp}}^{2}/varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $|h^*|_{ ext{sp}}$. Although the complexity bound involves the span seminorm $|h^*|_{ ext{sp}}$ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.

Problem

Research questions and friction points this paper is trying to address.

Develops model-free algorithm for optimal policies in MDPs.

Achieves near-optimal sample and time complexity bounds.

Adapts techniques for both average and discounted reward MDPs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-free algorithm for MDPs

Combines recursive sampling with Halpern's iteration

Achieves near-optimal sample and time complexity

🔎 Similar Papers

No similar papers found.