🤖 AI Summary
This work addresses the challenge of cross-modal coordination among vision, language, and action to enhance robotic generalization to novel instructions and unseen objects. We propose a unified Vision-Language-Action (VLA) diffusion model, introducing two key innovations: (1) a multi-modal chain-of-thought mechanism that explicitly models sequential reasoning across modalities, and (2) a single-objective joint optimization framework enabling end-to-end co-training of perception, reasoning, and control. To improve inference efficiency, we incorporate prefix attention masking and KV caching. The architecture integrates a visual encoder, a language understanding module, and an action generation network. Evaluated on the LIBERO benchmark, our model achieves a 96.4% average success rate—substantially outperforming existing discrete and continuous policy-based approaches. Furthermore, it successfully executes complex, multi-step planning and grasping tasks on a physical Franka robot.
📝 Abstract
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understanding, and action under a single diffusion objective, enabling stronger cross-modal reasoning and better generalization to novel instructions and objects. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies, a prefix attention mask and KV caching, yielding up to around times speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark, it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics.