DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current vision-language-action (VLA) models suffer from weak spatial reasoning—inheritated from vision-language models (VLMs)—and rely heavily on large-scale action data for pretraining to establish 3D spatial foundations, resulting in limited spatial understanding accuracy and low training efficiency. To address this, we propose a depth-enhanced hybrid Transformer architecture that unifies a VLM backbone, a lightweight depth prediction module, and an action expert, jointly optimized end-to-end via shared cross-modal attention to explicitly model 3D spatial structure. Crucially, we integrate a pretrained monocular depth estimator to augment spatial perception with dense depth maps, enhancing geometric reasoning without requiring additional action annotations. Our method achieves a 78.5% success rate on real-robot tasks and attains state-of-the-art performance on simulation benchmarks—94.9% on LIBERO and 74.8% on Simpler—significantly outperforming prior approaches.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Addresses limited spatial reasoning in Vision-Language-Action models

Enhances spatial awareness using pretrained depth prediction modules

Improves accuracy in real-world and simulated manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates pretrained depth prediction for spatial awareness

Uses mixture-of-transformers with fully shared attentions

Unifies vision, depth, and action experts end-to-end

🔎 Similar Papers

Understanding Depth and Height Perception in Large Visual-Language Models