Exploratory Diffusion Policy for Unsupervised Reinforcement Learning

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Unsupervised reinforcement learning (RL) suffers from weak pre-trained policy fitting capacity and poor generalization to heterogeneous exploration data. Method: This paper introduces diffusion models into the unsupervised RL pre-training framework for the first time, proposing a score-matching-based intrinsic reward mechanism and a theoretically grounded diffusion policy distillation fine-tuning algorithm. The method jointly optimizes the Q-function and policy distillation to efficiently model and transfer reward-free exploration trajectories. Contribution/Results: Across multiple benchmark environments, the approach achieves broader state coverage during pre-training and improves downstream task sample efficiency by over 30% during fine-tuning. It significantly enhances exploration quality and adaptation speed, demonstrating superior generalization and transferability of pre-trained representations.

Technology Category

Application Category

📝 Abstract
Unsupervised reinforcement learning (RL) aims to pre-train agents by exploring states or skills in reward-free environments, facilitating the adaptation to downstream tasks. However, existing methods often overlook the fitting ability of pre-trained policies and struggle to handle the heterogeneous pre-training data, which are crucial for achieving efficient exploration and fast fine-tuning. To address this gap, we propose Exploratory Diffusion Policy (EDP), which leverages the strong expressive ability of diffusion models to fit the explored data, both boosting exploration and obtaining an efficient initialization for downstream tasks. Specifically, we estimate the distribution of collected data in the replay buffer with the diffusion policy and propose a score intrinsic reward, encouraging the agent to explore unseen states. For fine-tuning the pre-trained diffusion policy on downstream tasks, we provide both theoretical analyses and practical algorithms, including an alternating method of Q function optimization and diffusion policy distillation. Extensive experiments demonstrate the effectiveness of EDP in efficient exploration during pre-training and fast adaptation during fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Enhance unsupervised RL exploration efficiency
Improve pre-trained policy fitting ability
Facilitate fast fine-tuning on downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages diffusion models for data fitting
Proposes score intrinsic reward for exploration
Alternates Q function optimization and policy distillation
🔎 Similar Papers
No similar papers found.
Chengyang Ying
Chengyang Ying
Tsinghua university
Machine LearningReinforcement LearningEmbodied AI
Huayu Chen
Huayu Chen
Tsinghua University
Reinforcement LearningDeep Generative ModelsMachine Learning
X
Xinning Zhou
Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University
Zhongkai Hao
Zhongkai Hao
Tsinghua University
machine learningAI for Sciencephysics-informed machine learning
H
Hang Su
Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University
J
Jun Zhu
Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University