π€ AI Summary
This work addresses the issue of undefined and unstable gradients in deterministic policy gradient methods under sparse or discrete reward settings, where the Q-function is non-differentiable with respect to actions. To overcome this limitation, the paper proposes Soft Deterministic Policy Gradient (Soft-DPG), which introduces Gaussian smoothing into the deterministic policy gradient framework for the first time. By constructing a smoothed Bellman equation and redefining the action-value function, Soft-DPG circumvents the explicit reliance on the gradient of the Q-function with respect to actions. Theoretical analysis demonstrates that the proposed method ensures well-defined policy gradients even when the Q-function is non-smooth. Empirical results show that Soft-DPG achieves competitive performance in standard continuous control tasks with dense rewards and significantly outperforms DDPG in environments with sparse or discrete rewards.
π Abstract
Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in practical control problems involving sparse or discrete rewards, leading to ill-defined policy gradients and unstable learning. To address these challenges, we propose a principled alternative based on a smoothed Bellman equation formulated via Gaussian smoothing. Specifically, we define a novel action-value function based on a smoothed Bellman equation and derive the soft deterministic policy gradient (Soft-DPG). Our formulation eliminates explicit dependence on critic action-gradients and ensures that the gradient remains well-defined even for non-smooth Q-functions. We instantiate this framework into a deep reinforcement learning algorithm, which we call soft deep deterministic policy gradient (Soft DDPG). Empirical evaluations on standard continuous control benchmarks and their discretized-reward variants show that Soft DDPG remains competitive in dense-reward settings and provides clear gains in most discretized-reward environments, where standard DDPG is more sensitive to irregular critic landscapes.