Equivalence of stochastic and deterministic policy gradients

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the equivalence between stochastic and deterministic policy gradients in continuous control. For a class of Markov decision processes (MDPs) driven by Gaussian noise and featuring quadratic costs, we establish, for the first time, rigorous equivalence across three fundamental levels: policy gradients, natural gradients, and state-value functions. Building on this insight, we propose a novel paradigm that constructs an equivalent deterministic MDP via sufficient statistics, thereby unifying stochastic policy optimization within a deterministic framework. This approach eliminates variance induced by policy stochasticity and yields a more stable, efficiently estimable optimization path guided by state-value functions. Our results provide a unified theoretical foundation for policy gradient methods and advance the principled, reliable deployment of deterministic policies in practical reinforcement learning systems.

Technology Category

Application Category

📝 Abstract
Policy gradients in continuous control have been derived for both stochastic and deterministic policies. Here we study the relationship between the two. In a widely-used family of MDPs involving Gaussian control noise and quadratic control costs, we show that the stochastic and deterministic policy gradients, natural gradients, and state value functions are identical; while the state-control value functions are different. We then develop a general procedure for constructing an MDP with deterministic policy that is equivalent to a given MDP with stochastic policy. The controls of this new MDP are the sufficient statistics of the stochastic policy in the original MDP. Our results suggest that policy gradient methods can be unified by approximating state value functions rather than state-control value functions.
Problem

Research questions and friction points this paper is trying to address.

Comparing stochastic and deterministic policy gradients in MDPs
Equivalence of policy gradients under Gaussian noise and quadratic costs
Unifying policy gradient methods via state value function approximation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Equivalence of stochastic and deterministic policy gradients
MDP construction with deterministic policy equivalence
Unifying policy gradients via state value approximation