Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

📅 2024-11-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient exploration in non-grasping dexterous manipulation—leading to poor skill transfer and out-of-distribution generalization—this paper proposes an off-policy reinforcement learning framework with a hybrid discrete-continuous action space. The core contribution is the first integration of diffusion models into a hybrid-action RL architecture to model high-dimensional continuous motion parameters; coupled with maximum-entropy Q-learning, we derive a structured variational inference-based lower bound on the maximum-entropy objective, enabling end-to-end co-optimization of discrete decisions (e.g., contact point selection) and continuous motion generation. Evaluated in simulation and zero-shot sim-to-real transfer, our method significantly improves policy diversity and generalization: real-world 6D pose alignment success rises from 53% to 72%.

Technology Category

Application Category

📝 Abstract
Learning diverse policies for non-prehensile manipulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance exploration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffusion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real-world 6D pose alignment task. Project page: https://leh2rng.github.io/hydo
Problem

Research questions and friction points this paper is trying to address.

Learning diverse policies for non-prehensile manipulation tasks
Enhancing exploration in hybrid discrete-continuous action spaces
Improving skill transfer and generalization in RL frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework for discrete and continuous action spaces
Diffusion model for continuous motion parameter policy
Maximum entropy RL unifying discrete and continuous components
🔎 Similar Papers
No similar papers found.
H
Huy Le
Bosch Center for Artificial Intelligence, Renningen, Germany and also with the Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany
M
Miroslav Gabriel
Bosch Center for Artificial Intelligence, Renningen 71272, Germany
Tai Hoang
Tai Hoang
PhD Student, Karlsruhe Institute of Technology
Machine LearningRobotics
Gerhard Neumann
Gerhard Neumann
Professor, Karlsruhe Institute of Technology (KIT)
RoboticsMachine Learning
Ngo Anh Vien
Ngo Anh Vien
VinRobotics & VinUni, ex-BCAI
machine learningrobotics