🤖 AI Summary
In multi-task robotic manipulation, action distributions exhibit strong multimodality and task coupling, severely limiting policy generalization. To address this, we propose a vector quantization (VQ)-based discretized policy learning framework: continuous action sequences are mapped onto a disentangled discrete latent space, where task-specific latent codes are explicitly modeled; visual–linguistic joint encoding and conditional latent-space reconstruction jointly enable instruction-driven action generation. This work is the first to introduce VQ into multi-task robotic policy learning, effectively breaking the action-distribution coupling bottleneck. Evaluated on real-robot setups, our method achieves a 26% absolute success rate improvement over Diffusion Policy on 5-task benchmarks and a 32.5% gain on 12-task benchmarks. It consistently outperforms state-of-the-art methods—including ACT, Octo, and OpenVLA—across both simulation and physical single- and dual-arm platforms.
📝 Abstract
Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose extbf{Discrete Policy}, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is 26% higher than Diffusion Policy and 15% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5%, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents.