dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge of cross-modal coordination among vision, language, and action to enhance robotic generalization to novel instructions and unseen objects. We propose a unified Vision-Language-Action (VLA) diffusion model, introducing two key innovations: (1) a multi-modal chain-of-thought mechanism that explicitly models sequential reasoning across modalities, and (2) a single-objective joint optimization framework enabling end-to-end co-training of perception, reasoning, and control. To improve inference efficiency, we incorporate prefix attention masking and KV caching. The architecture integrates a visual encoder, a language understanding module, and an action generation network. Evaluated on the LIBERO benchmark, our model achieves a 96.4% average success rate—substantially outperforming existing discrete and continuous policy-based approaches. Furthermore, it successfully executes complex, multi-step planning and grasping tasks on a physical Franka robot.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understanding, and action under a single diffusion objective, enabling stronger cross-modal reasoning and better generalization to novel instructions and objects. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies, a prefix attention mask and KV caching, yielding up to around times speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark, it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics.

Problem

Research questions and friction points this paper is trying to address.

Unifying visual perception, language reasoning, and robotic control in one system

Improving cross-modal reasoning and generalization to new instructions and objects

Reducing inference latency for practical deployment of vision-language-action models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based VLA unifies perception, reasoning, and control

Joint optimization under single diffusion objective enhances generalization

Prefix attention mask and KV caching accelerate inference latency

🔎 Similar Papers

Mamba Fusion: Learning Actions Through Questioning

2024-09-17arXiv.orgCitations: 0

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Natural Language Processing Researcher

Kitware

Arlington, Virginia

Authors to Follow