MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses key limitations in existing vision–language–action (VLA) models—namely architectural redundancy, temporal inconsistency, long-horizon error accumulation, and the absence of explicit environment dynamics modeling—by introducing a natively pre-trained, large-scale diffusion-based VLA model. For the first time, language, vision, and action modalities are unified within a single discrete diffusion framework, where multimodal inputs are embedded into a shared token space. The model leverages masked token denoising to jointly and parallelly generate future observations and action sequences without requiring auxiliary modules. Critically, predicted visual outcomes directly guide action generation, enabling global, unordered iterative refinement. The approach achieves state-of-the-art performance, attaining a 98.0% average success rate on the LIBERO benchmark and an average task length of 4.78 on CALVIN.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models aim to control robots for manipulation from visual observations and natural-language instructions. However, existing hierarchical and autoregressive paradigms often introduce architectural overhead, suffer from temporal inconsistency and long-horizon error accumulation, and lack a mechanism to capture environment dynamics without extra modules. To this end, we present MMaDA-VLA, a fully native pre-trained large diffusion VLA model that unifies multi-modal understanding and generation in a single framework. Our key idea is a native discrete diffusion formulation that embeds language, images, and continuous robot controls into one discrete token space and trains a single backbone with masked token denoising to jointly generate a future goal observation and an action chunk in parallel. Iterative denoising enables global, order-free refinement, improving long-horizon consistency while grounding actions in predicted future visual outcomes without auxiliary world models. Experiments across simulation benchmarks and real-world tasks show state-of-the-art performance, achieving 98.0% average success on LIBERO and 4.78 average length on CALVIN.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

temporal inconsistency

error accumulation

environment dynamics

robot manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model

vision-language-action

discrete tokenization