Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch

πŸ“… 2025-03-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the theoretical and practical bias in policy gradient methods for discounted reinforcement learning, arising from state-action distribution mismatchβ€”i.e., discrepancies between the stationary distributions of the behavior and target policies. Methodologically, we first prove, for the tabular setting, that policy gradients converge to the global optimum despite distribution mismatch. We then extend the analysis to general function approximation via a biased stochastic gradient descent (Biased SGD) framework, characterizing how gradient bias affects convergence rate and optimality. Leveraging the policy gradient theorem, Markov chain stationary distribution modeling, and generalization error analysis, we quantitatively expose the gap between classical theoretical assumptions (e.g., exact gradient estimation) and practical implementations. Our contributions provide novel theoretical guarantees on the robustness of policy gradient algorithms and inform the design of more reliable approximate policy optimization methods.

Technology Category

Application Category

πŸ“ Abstract
Policy gradient methods are one of the most successful methods for solving challenging reinforcement learning problems. However, despite their empirical successes, many SOTA policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on the policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the methods under the mismatch remain globally optimal. Then, we extend this analysis to more general parameterizations by leveraging the theory of biased stochastic gradient descent. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.
Problem

Research questions and friction points this paper is trying to address.

Analyzes impact of distribution mismatch on policy gradient methods
Examines global optimality in tabular parameterizations under mismatch
Extends analysis to general cases using biased SGD theory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes policy gradient under distribution mismatch
Proves global optimality in tabular cases
Extends analysis using biased SGD theory
πŸ”Ž Similar Papers
No similar papers found.
W
Weizhen Wang
Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
J
Jianping He
Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
Xiaoming Duan
Xiaoming Duan
Shanghai Jiao Tong University