🤖 AI Summary
This work addresses the lack of polynomial sample complexity guarantees in planning for unregularized Markov decision processes (MDPs) and two-player zero-sum games by proposing the SmoothCruiser algorithm. Leveraging the smoothness induced by entropy regularization in the associated Bellman operators, SmoothCruiser achieves efficient planning under a generative model. It establishes the first problem-independent sample complexity bound of $\widetilde{O}(1/\varepsilon^4)$ for entropy-regularized MDPs and games, thereby overcoming a fundamental theoretical barrier present in the unregularized setting. This result provides the first planning method for sequential decision-making problems with provable polynomial sample complexity guarantees.
📝 Abstract
We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the environment. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order O~(1/epsilon^4) for a desired accuracy epsilon, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sample complexity in the worst case.