Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

📅 2024-11-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

237K/year

🤖 AI Summary

To address insufficient exploration in non-grasping dexterous manipulation—leading to poor skill transfer and out-of-distribution generalization—this paper proposes an off-policy reinforcement learning framework with a hybrid discrete-continuous action space. The core contribution is the first integration of diffusion models into a hybrid-action RL architecture to model high-dimensional continuous motion parameters; coupled with maximum-entropy Q-learning, we derive a structured variational inference-based lower bound on the maximum-entropy objective, enabling end-to-end co-optimization of discrete decisions (e.g., contact point selection) and continuous motion generation. Evaluated in simulation and zero-shot sim-to-real transfer, our method significantly improves policy diversity and generalization: real-world 6D pose alignment success rises from 53% to 72%.

Technology Category

Application Category

📝 Abstract

Learning diverse policies for non-prehensile manipulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance exploration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffusion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real-world 6D pose alignment task. Project page: https://leh2rng.github.io/hydo

Problem

Research questions and friction points this paper is trying to address.

Learning diverse policies for non-prehensile manipulation tasks

Enhancing exploration in hybrid discrete-continuous action spaces

Improving skill transfer and generalization in RL frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework for discrete and continuous action spaces

Diffusion model for continuous motion parameter policy

Maximum entropy RL unifying discrete and continuous components

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15