Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

📅 2024-11-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
To address insufficient exploration in non-grasping dexterous manipulation—leading to poor skill transfer and out-of-distribution generalization—this paper proposes an off-policy reinforcement learning framework with a hybrid discrete-continuous action space. The core contribution is the first integration of diffusion models into a hybrid-action RL architecture to model high-dimensional continuous motion parameters; coupled with maximum-entropy Q-learning, we derive a structured variational inference-based lower bound on the maximum-entropy objective, enabling end-to-end co-optimization of discrete decisions (e.g., contact point selection) and continuous motion generation. Evaluated in simulation and zero-shot sim-to-real transfer, our method significantly improves policy diversity and generalization: real-world 6D pose alignment success rises from 53% to 72%.

Technology Category

Application Category

📝 Abstract
Learning diverse policies for non-prehensile manipulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance exploration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffusion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real-world 6D pose alignment task. Project page: https://leh2rng.github.io/hydo
Problem

Research questions and friction points this paper is trying to address.

Learning diverse policies for non-prehensile manipulation tasks
Enhancing exploration in hybrid discrete-continuous action spaces
Improving skill transfer and generalization in RL frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework for discrete and continuous action spaces
Diffusion model for continuous motion parameter policy
Maximum entropy RL unifying discrete and continuous components