Exploratory Diffusion Policy for Unsupervised Reinforcement Learning

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Unsupervised reinforcement learning (RL) suffers from weak pre-trained policy fitting capacity and poor generalization to heterogeneous exploration data. Method: This paper introduces diffusion models into the unsupervised RL pre-training framework for the first time, proposing a score-matching-based intrinsic reward mechanism and a theoretically grounded diffusion policy distillation fine-tuning algorithm. The method jointly optimizes the Q-function and policy distillation to efficiently model and transfer reward-free exploration trajectories. Contribution/Results: Across multiple benchmark environments, the approach achieves broader state coverage during pre-training and improves downstream task sample efficiency by over 30% during fine-tuning. It significantly enhances exploration quality and adaptation speed, demonstrating superior generalization and transferability of pre-trained representations.

Technology Category

Application Category

📝 Abstract

Unsupervised reinforcement learning (RL) aims to pre-train agents by exploring states or skills in reward-free environments, facilitating the adaptation to downstream tasks. However, existing methods often overlook the fitting ability of pre-trained policies and struggle to handle the heterogeneous pre-training data, which are crucial for achieving efficient exploration and fast fine-tuning. To address this gap, we propose Exploratory Diffusion Policy (EDP), which leverages the strong expressive ability of diffusion models to fit the explored data, both boosting exploration and obtaining an efficient initialization for downstream tasks. Specifically, we estimate the distribution of collected data in the replay buffer with the diffusion policy and propose a score intrinsic reward, encouraging the agent to explore unseen states. For fine-tuning the pre-trained diffusion policy on downstream tasks, we provide both theoretical analyses and practical algorithms, including an alternating method of Q function optimization and diffusion policy distillation. Extensive experiments demonstrate the effectiveness of EDP in efficient exploration during pre-training and fast adaptation during fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Enhance unsupervised RL exploration efficiency

Improve pre-trained policy fitting ability

Facilitate fast fine-tuning on downstream tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages diffusion models for data fitting

Proposes score intrinsic reward for exploration

Alternates Q function optimization and policy distillation

🔎 Similar Papers

No similar papers found.