Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work investigates the instability and slow convergence commonly observed in policy gradient methods during late-stage training, which stems from a dramatic increase in the noise-to-signal ratio (NSR) of gradient estimators. For the first time, the authors derive closed-form or exact numerical solutions for the NSR of the REINFORCE estimator in linear and polynomial systems, and establish a general variance upper bound applicable to arbitrary nonlinear systems. By analyzing Gaussian policies and leveraging moment calculations combined with optimization trajectory tracking (e.g., via SGD or Adam), they demonstrate that the NSR significantly escalates—and may even diverge—as the policy approaches optimality, often leading to training collapse. This study provides a rigorous theoretical foundation for understanding policy gradient instability and offers clear directions for future algorithmic improvements.

Technology Category

Application Category

📝 Abstract

Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of how NSR varies across policy parameters and how it evolves along optimization trajectories (e.g. SGD and Adam). Across a range of examples, we find that the NSR landscape is highly non-uniform and typically increases as the policy approaches an optimum; in some regimes it blows up, which can trigger training instability and policy collapse.

Problem

Research questions and friction points this paper is trying to address.

noise-to-signal ratio

policy gradient

REINFORCE

training instability

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

noise-to-signal ratio

REINFORCE

policy gradient