🤖 AI Summary
This paper investigates how imprecise transition matrix predictions can improve sample efficiency in policy learning for discounted infinite-horizon Markov decision processes (DMDPs) equipped with generative models. First, it introduces the “prediction-augmented MDP” theoretical framework. Second, it proves that—without prior knowledge of prediction accuracy—the classical sample complexity lower bound is unimprovable. Third, it proposes a prediction-enhanced policy iteration algorithm based on minimax optimization, achieving an error-sensitive sample complexity of $ ilde{O}((1-gamma)^{-3} N varepsilon^{-2})$, which improves upon the best-known convex-optimization-based bound of $ ilde{O}((1-gamma)^{-4} N varepsilon^{-2})$. Finally, both theoretical analysis and empirical evaluation confirm that the algorithm significantly accelerates convergence when predictions are available and exhibits robustness to prediction errors.
📝 Abstract
We study infinite-horizon Discounted Markov Decision Processes (DMDPs) under a generative model. Motivated by the Algorithm with Advice framework Mitzenmacher and Vassilvitskii 2022, we propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on the DMDPs with $N$ state-action pairs and discounted factor $gamma$. Firstly, we provide an impossibility result that, without prior knowledge of the prediction accuracy, no sampling policy can compute an $epsilon$-optimal policy with a sample complexity bound better than $ ilde{O}((1-gamma)^{-3} Nepsilon^{-2})$, which matches the state-of-the-art minimax sample complexity bound with no prediction. In complement, we propose an algorithm based on minimax optimization techniques that leverages the prediction on the transition matrix. Our algorithm achieves a sample complexity bound depending on the prediction error, and the bound is uniformly better than $ ilde{O}((1-gamma)^{-4} N epsilon^{-2})$, the previous best result derived from convex optimization methods. These theoretical findings are further supported by our numerical experiments.