🤖 AI Summary
Current vision-language-action (VLA) models suffer from weak spatial reasoning—inheritated from vision-language models (VLMs)—and rely heavily on large-scale action data for pretraining to establish 3D spatial foundations, resulting in limited spatial understanding accuracy and low training efficiency. To address this, we propose a depth-enhanced hybrid Transformer architecture that unifies a VLM backbone, a lightweight depth prediction module, and an action expert, jointly optimized end-to-end via shared cross-modal attention to explicitly model 3D spatial structure. Crucially, we integrate a pretrained monocular depth estimator to augment spatial perception with dense depth maps, enhancing geometric reasoning without requiring additional action annotations. Our method achieves a 78.5% success rate on real-robot tasks and attains state-of-the-art performance on simulation benchmarks—94.9% on LIBERO and 74.8% on Simpler—significantly outperforming prior approaches.
📝 Abstract
Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.