Distributional Soft Actor-Critic with Diffusion Policy

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional reinforcement learning often models value functions with unimodal distributions, leading to estimation bias and difficulty in representing multimodal policies. To address this, we propose Distributed Soft Actor-Critic with Diffusion (DSAC-D). Our method introduces: (1) a novel dual-diffusion mechanism jointly optimizing value and policy networks, where a diffusion model performs reverse sampling to generate reward-conditioned samples and constructs a diffusion-based value network—enabling joint modeling of multimodal value and policy distributions; and (2) an entropy-regularized policy optimization framework with distributional convergence guarantees to ensure training stability. Evaluated on nine MuJoCo benchmark tasks, DSAC-D achieves state-of-the-art performance, improving average return by over 10%. Real-world vehicle experiments further demonstrate its capability to accurately capture diverse driving styles, significantly enhancing policy diversity and generalization.

Technology Category

Application Category

📝 Abstract
Reinforcement learning has been proven to be highly effective in handling complex control tasks. Traditional methods typically use unimodal distributions, such as Gaussian distributions, to model the output of value distributions. However, unimodal distribution often and easily causes bias in value function estimation, leading to poor algorithm performance. This paper proposes a distributional reinforcement learning algorithm called DSAC-D (Distributed Soft Actor Critic with Diffusion Policy) to address the challenges of estimating bias in value functions and obtaining multimodal policy representations. A multimodal distributional policy iteration framework that can converge to the optimal policy was established by introducing policy entropy and value distribution function. A diffusion value network that can accurately characterize the distribution of multi peaks was constructed by generating a set of reward samples through reverse sampling using a diffusion model. Based on this, a distributional reinforcement learning algorithm with dual diffusion of the value network and the policy network was derived. MuJoCo testing tasks demonstrate that the proposed algorithm not only learns multimodal policy, but also achieves state-of-the-art (SOTA) performance in all 9 control tasks, with significant suppression of estimation bias and total average return improvement of over 10% compared to existing mainstream algorithms. The results of real vehicle testing show that DSAC-D can accurately characterize the multimodal distribution of different driving styles, and the diffusion policy network can characterize multimodal trajectories.
Problem

Research questions and friction points this paper is trying to address.

Addresses bias in value function estimation using multimodal distributions
Develops a diffusion policy for accurate multimodal policy representation
Improves algorithm performance in complex control tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal distributional policy iteration framework
Diffusion value network for multi-peak distributions
Dual diffusion in value and policy networks
🔎 Similar Papers
No similar papers found.
T
Tong Liu
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
Yinuo Wang
Yinuo Wang
Tsinghua University
LLMReinforcement LearningAutonomous DrivingDiffusion Model
X
Xujie Song
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
W
Wenjun Zou
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
L
Liangfa Chen
School of Mechanical Engineering, University of Science and Technology Beijing, Beijing 100083, China
L
Likun Wang
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
Bin Shuai
Bin Shuai
清华大学
Reinforcement LearningAutonomous VehicleOptimal Control
Jingliang Duan
Jingliang Duan
University of Science and Technology Beijing
S
Shengbo Eben Li
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China