MM-ACT: Learn from Multimodal Parallel Generation to Act

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge of developing general-purpose robotic policies that jointly possess semantic understanding and environmental interaction capabilities. We propose a context-shared, unified vision-language-action modeling framework. Methodologically, we design a multimodal joint token space and introduce a re-masked parallel decoding mechanism to enable synchronous generation of text, images, and multidimensional actions—including end-effector poses, joint angles, and gripper states—while facilitating cross-modal knowledge transfer. Our core innovations are single-step parallel action decoding and a semantic-action co-generation paradigm. Evaluated on the LIBERO simulation benchmark, our method achieves a 96.3% task success rate; on the Franka real-world platform, it attains a 72.0% average success rate across three tasks; and on the RoboTwin2.0 bimanual robot, it reaches 52.38%. Cross-modal learning yields a +9.25% performance gain, significantly improving generalization and deployment efficiency.

Technology Category

Application Category

📝 Abstract

A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.

Problem

Research questions and friction points this paper is trying to address.

Develops a unified Vision-Language-Action model for robotic task planning and interaction

Integrates text, image, and action generation in a shared token space

Enhances robotic action efficiency and cross-modal learning through parallel decoding strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified VLA model integrates text, image, action in shared token space

Re-mask parallel decoding for text and image generation improves efficiency

Context-Shared Multimodal Learning supervises generation across all three modalities

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models

2024-05-16Citations: 2

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)