π€ AI Summary
This work addresses the theoretical and practical bias in policy gradient methods for discounted reinforcement learning, arising from state-action distribution mismatchβi.e., discrepancies between the stationary distributions of the behavior and target policies. Methodologically, we first prove, for the tabular setting, that policy gradients converge to the global optimum despite distribution mismatch. We then extend the analysis to general function approximation via a biased stochastic gradient descent (Biased SGD) framework, characterizing how gradient bias affects convergence rate and optimality. Leveraging the policy gradient theorem, Markov chain stationary distribution modeling, and generalization error analysis, we quantitatively expose the gap between classical theoretical assumptions (e.g., exact gradient estimation) and practical implementations. Our contributions provide novel theoretical guarantees on the robustness of policy gradient algorithms and inform the design of more reliable approximate policy optimization methods.
π Abstract
Policy gradient methods are one of the most successful methods for solving challenging reinforcement learning problems. However, despite their empirical successes, many SOTA policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on the policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the methods under the mismatch remain globally optimal. Then, we extend this analysis to more general parameterizations by leveraging the theory of biased stochastic gradient descent. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.