MM-ACT: Learn from Multimodal Parallel Generation to Act

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of developing general-purpose robotic policies that jointly possess semantic understanding and environmental interaction capabilities. We propose a context-shared, unified vision-language-action modeling framework. Methodologically, we design a multimodal joint token space and introduce a re-masked parallel decoding mechanism to enable synchronous generation of text, images, and multidimensional actions—including end-effector poses, joint angles, and gripper states—while facilitating cross-modal knowledge transfer. Our core innovations are single-step parallel action decoding and a semantic-action co-generation paradigm. Evaluated on the LIBERO simulation benchmark, our method achieves a 96.3% task success rate; on the Franka real-world platform, it attains a 72.0% average success rate across three tasks; and on the RoboTwin2.0 bimanual robot, it reaches 52.38%. Cross-modal learning yields a +9.25% performance gain, significantly improving generalization and deployment efficiency.

Technology Category

Application Category

📝 Abstract
A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.
Problem

Research questions and friction points this paper is trying to address.

Develops a unified Vision-Language-Action model for robotic task planning and interaction
Integrates text, image, and action generation in a shared token space
Enhances robotic action efficiency and cross-modal learning through parallel decoding strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified VLA model integrates text, image, action in shared token space
Re-mask parallel decoding for text and image generation improves efficiency
Context-Shared Multimodal Learning supervises generation across all three modalities
🔎 Similar Papers
No similar papers found.
H
Haotian Liang
Shanghai AI Laboratory
X
Xinyi Chen
Shanghai AI Laboratory
B
Bin Wang
Shanghai AI Laboratory
M
Mingkang Chen
The University of Hong Kong
Y
Yitian Liu
Shanghai Jiao Tong University
Y
Yuhao Zhang
Shanghai Jiao Tong University
Zanxin Chen
Zanxin Chen
Shenzhen University
Embodied AI
T
Tianshuo Yang
Shanghai AI Laboratory
Y
Yilun Chen
Shanghai AI Laboratory
J
Jiangmiao Pang
Shanghai AI Laboratory
D
Dong Liu
University of Science and Technology of China
X
Xiaokang Yang
Shanghai Jiao Tong University
Y
Yao Mu
Shanghai Jiao Tong University
Wenqi Shao
Wenqi Shao
Researcher at Shanghai AI Laboratory
Foundation Model EvaluationLLM CompressionEfficient AdaptationMultimodal Learning
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing