D2 Actor Critic: Diffusion Actor Meets Distributional Critic

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the instability in online reinforcement learning training caused by high variance in policy gradients and the complexity of temporal backpropagation, this paper proposes D2AC—a novel model-free algorithm based on diffusion policies. Methodologically, D2AC (1) employs a denoising diffusion model as the policy network to explicitly represent multimodal action distributions, and (2) introduces a distributionally robust critic that integrates distributional reinforcement learning with truncated double Q-learning for low-variance, off-policy robust value estimation. By circumventing reliance on higher-order derivatives and long-horizon backpropagation through time—hallmarks of conventional policy gradient methods—D2AC significantly improves training stability and generalization. Evaluated on 18 challenging continuous-control benchmarks—including humanoid and quadrupedal robots, biomimetic robotic hands, and predator-prey biological simulations—D2AC consistently outperforms state-of-the-art diffusion-based policies and mainstream RL algorithms, demonstrating superior expressive capacity and behavioral robustness.

Technology Category

Application Category

📝 Abstract
We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.
Problem

Research questions and friction points this paper is trying to address.

Develops model-free RL algorithm for expressive diffusion policies
Addresses high variance in policy gradients via stable improvement objective
Enables robust performance across dense-reward and goal-conditioned tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion policy trained with online reinforcement learning
Distributional critic fused with clipped double Q-learning
Avoids high variance policy gradients and backpropagation complexity
🔎 Similar Papers
No similar papers found.
Lunjun Zhang
Lunjun Zhang
University of Toronto
Artificial intelligenceRobotics
S
Shuo Han
Department of Statistics, Northwestern University
H
Hanrui Lyu
Department of Statistics, Northwestern University
B
Bradly C Stadie
Department of Statistics, Northwestern University