DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models suffer from weak spatial reasoning—inheritated from vision-language models (VLMs)—and rely heavily on large-scale action data for pretraining to establish 3D spatial foundations, resulting in limited spatial understanding accuracy and low training efficiency. To address this, we propose a depth-enhanced hybrid Transformer architecture that unifies a VLM backbone, a lightweight depth prediction module, and an action expert, jointly optimized end-to-end via shared cross-modal attention to explicitly model 3D spatial structure. Crucially, we integrate a pretrained monocular depth estimator to augment spatial perception with dense depth maps, enhancing geometric reasoning without requiring additional action annotations. Our method achieves a 78.5% success rate on real-robot tasks and attains state-of-the-art performance on simulation benchmarks—94.9% on LIBERO and 74.8% on Simpler—significantly outperforming prior approaches.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited spatial reasoning in Vision-Language-Action models
Enhances spatial awareness using pretrained depth prediction modules
Improves accuracy in real-world and simulated manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates pretrained depth prediction for spatial awareness
Uses mixture-of-transformers with fully shared attentions
Unifies vision, depth, and action experts end-to-end
🔎 Similar Papers
No similar papers found.
Tianyuan Yuan
Tianyuan Yuan
Tsinghua University
Computer Vision
Yicheng Liu
Yicheng Liu
Tsinghua University
Robotics
Chenhao Lu
Chenhao Lu
Tsinghua University
Artificial Intelligence
Z
Zhuoguang Chen
IIIS, Tsinghua University
T
Tao Jiang
Galaxea AI
H
Hang Zhao
IIIS, Tsinghua University