🤖 AI Summary
Unsupervised reinforcement learning (RL) suffers from weak pre-trained policy fitting capacity and poor generalization to heterogeneous exploration data. Method: This paper introduces diffusion models into the unsupervised RL pre-training framework for the first time, proposing a score-matching-based intrinsic reward mechanism and a theoretically grounded diffusion policy distillation fine-tuning algorithm. The method jointly optimizes the Q-function and policy distillation to efficiently model and transfer reward-free exploration trajectories. Contribution/Results: Across multiple benchmark environments, the approach achieves broader state coverage during pre-training and improves downstream task sample efficiency by over 30% during fine-tuning. It significantly enhances exploration quality and adaptation speed, demonstrating superior generalization and transferability of pre-trained representations.
📝 Abstract
Unsupervised reinforcement learning (RL) aims to pre-train agents by exploring states or skills in reward-free environments, facilitating the adaptation to downstream tasks. However, existing methods often overlook the fitting ability of pre-trained policies and struggle to handle the heterogeneous pre-training data, which are crucial for achieving efficient exploration and fast fine-tuning. To address this gap, we propose Exploratory Diffusion Policy (EDP), which leverages the strong expressive ability of diffusion models to fit the explored data, both boosting exploration and obtaining an efficient initialization for downstream tasks. Specifically, we estimate the distribution of collected data in the replay buffer with the diffusion policy and propose a score intrinsic reward, encouraging the agent to explore unseen states. For fine-tuning the pre-trained diffusion policy on downstream tasks, we provide both theoretical analyses and practical algorithms, including an alternating method of Q function optimization and diffusion policy distillation. Extensive experiments demonstrate the effectiveness of EDP in efficient exploration during pre-training and fast adaptation during fine-tuning.