🤖 AI Summary
This work investigates the equivalence between stochastic and deterministic policy gradients in continuous control. For a class of Markov decision processes (MDPs) driven by Gaussian noise and featuring quadratic costs, we establish, for the first time, rigorous equivalence across three fundamental levels: policy gradients, natural gradients, and state-value functions. Building on this insight, we propose a novel paradigm that constructs an equivalent deterministic MDP via sufficient statistics, thereby unifying stochastic policy optimization within a deterministic framework. This approach eliminates variance induced by policy stochasticity and yields a more stable, efficiently estimable optimization path guided by state-value functions. Our results provide a unified theoretical foundation for policy gradient methods and advance the principled, reliable deployment of deterministic policies in practical reinforcement learning systems.
📝 Abstract
Policy gradients in continuous control have been derived for both stochastic and deterministic policies. Here we study the relationship between the two. In a widely-used family of MDPs involving Gaussian control noise and quadratic control costs, we show that the stochastic and deterministic policy gradients, natural gradients, and state value functions are identical; while the state-control value functions are different. We then develop a general procedure for constructing an MDP with deterministic policy that is equivalent to a given MDP with stochastic policy. The controls of this new MDP are the sufficient statistics of the stochastic policy in the original MDP. Our results suggest that policy gradient methods can be unified by approximating state value functions rather than state-control value functions.