🤖 AI Summary
Classical MDP solution methods—value iteration (VI) and policy iteration (PI)—lack rigorous convergence rate guarantees, particularly under realistic structural assumptions.
Method: Leveraging the geometric structure of MDPs, we reformulate the problem using geometric tools—including discount factor transformation—and uncover an implicit rotational dynamical system underlying VI updates. Under the assumption that the Markov reward process induced by an optimal policy is irreducible and aperiodic, we conduct a spectral analysis grounded in linear operator theory and MRP spectral theory.
Contribution/Results: We establish the first asymptotic convergence rate for VI strictly faster than the classical bound γ, yielding a tighter, more interpretable convergence guarantee. Our geometric framework significantly strengthens theoretical convergence assurances for both VI and PI across diverse MDP classes, offering a novel geometric perspective and analytical paradigm for reinforcement learning algorithm design.
📝 Abstract
We build on a recently introduced geometric interpretation of Markov Decision Processes (MDPs) to analyze classical MDP-solving algorithms: Value Iteration (VI) and Policy Iteration (PI). First, we develop a geometry-based analytical apparatus, including a transformation that modifies the discount factor $gamma$, to improve convergence guarantees for these algorithms in several settings. In particular, one of our results identifies a rotation component in the VI method, and as a consequence shows that when a Markov Reward Process (MRP) induced by the optimal policy is irreducible and aperiodic, the asymptotic convergence rate of value iteration is strictly smaller than $gamma$.