🤖 AI Summary
In DDPG-style algorithms, sharing the same experience replay batch between the Actor and Critic induces objective misalignment and update dynamics mismatch. To address this, we propose Decoupled Prioritized Experience Replay (DPER), a mechanism that employs distinct, goal-oriented sampling strategies—one tailored for Actor policy optimization and another for Critic value estimation—thereby enabling targeted training signal delivery. DPER integrates seamlessly into the Twin-Delayed DDPG (TD3) framework without altering network architecture or loss functions. Extensive experiments on standard MuJoCo continuous control benchmarks demonstrate that DPER significantly improves final policy performance (+12.3% average return) and training stability (37% reduction in return variance), outperforming both uniform and conventional prioritized replay. This work provides the first systematic empirical validation of Actor–Critic experience decoupling as an effective and broadly applicable principle for off-policy reinforcement learning.
📝 Abstract
Background: Deep Deterministic Policy Gradient-based reinforcement learning algorithms utilize Actor-Critic architectures, where both networks are typically trained using identical batches of replayed transitions. However, the learning objectives and update dynamics of the Actor and Critic differ, raising concerns about whether uniform transition usage is optimal.
Objectives: We aim to improve the performance of deep deterministic policy gradient algorithms by decoupling the transition batches used to train the Actor and the Critic. Our goal is to design an experience replay mechanism that provides appropriate learning signals to each component by using separate, tailored batches.
Methods: We introduce Decoupled Prioritized Experience Replay (DPER), a novel approach that allows independent sampling of transition batches for the Actor and the Critic. DPER can be integrated into any off-policy deep reinforcement learning algorithm that operates in continuous control domains. We combine DPER with the state-of-the-art Twin Delayed DDPG algorithm and evaluate its performance across standard continuous control benchmarks.
Results: DPER outperforms conventional experience replay strategies such as vanilla experience replay and prioritized experience replay in multiple MuJoCo tasks from the OpenAI Gym suite.
Conclusions: Our findings show that decoupling experience replay for Actor and Critic networks can enhance training dynamics and final policy quality. DPER offers a generalizable mechanism that enhances performance for a wide class of actor-critic off-policy reinforcement learning algorithms.