Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

📅 2024-09-27
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
In multi-task robotic manipulation, action distributions exhibit strong multimodality and task coupling, severely limiting policy generalization. To address this, we propose a vector quantization (VQ)-based discretized policy learning framework: continuous action sequences are mapped onto a disentangled discrete latent space, where task-specific latent codes are explicitly modeled; visual–linguistic joint encoding and conditional latent-space reconstruction jointly enable instruction-driven action generation. This work is the first to introduce VQ into multi-task robotic policy learning, effectively breaking the action-distribution coupling bottleneck. Evaluated on real-robot setups, our method achieves a 26% absolute success rate improvement over Diffusion Policy on 5-task benchmarks and a 32.5% gain on 12-task benchmarks. It consistently outperforms state-of-the-art methods—including ACT, Octo, and OpenVLA—across both simulation and physical single- and dual-arm platforms.

Technology Category

Application Category

📝 Abstract
Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose extbf{Discrete Policy}, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is 26% higher than Diffusion Policy and 15% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5%, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents.
Problem

Research questions and friction points this paper is trying to address.

Multimodal action distribution
Multi-task robotic manipulation
Learning task-specific codes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Policy method
Vector quantization mapping
Task-specific action reconstruction
🔎 Similar Papers
No similar papers found.
K
Kun Wu
Syracuse University, NY, USA
Y
Yichen Zhu
Midea Group, AI Research Center, China
Jinming Li
Jinming Li
Shanghai University
Embodied IntellengienceRobotics
Junjie Wen
Junjie Wen
East China Normal University, China
N
Ning Liu
Midea Group, AI Research Center, China
Z
Zhiyuan Xu
Beijing Innovation Center of Humanoid Robotics, Beijing, China
Qinru Qiu
Qinru Qiu
Professor of Computer Engineering, Syracuse University
Neuromorphic ComputingEnergy Efficient ComputingSystem-on-chip
J
Jian Tang
Beijing Innovation Center of Humanoid Robotics, Beijing, China