Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This work addresses the limitations of existing vision-language-action models, which often rely on 2D representations lacking depth cues and thus struggle with tasks requiring precise spatial understanding. While explicitly incorporating 3D inputs can improve performance, it typically increases system complexity or necessitates additional sensors. To overcome this, the authors propose a lightweight depth-enhanced framework that leverages only multi-view RGB images. Their approach employs an implicit depth encoding module to extract compact depth features, which are then seamlessly integrated into vision-language representations through depth-aware modulation and spatial enhancement mechanisms. Coupled with a progressive alignment training strategy, the method significantly boosts spatial reasoning and action generation capabilities while maintaining low computational overhead. Experiments demonstrate state-of-the-art performance across four simulation benchmarks and the highest average success rate on real-world robotic tasks, achieving the smallest model size, lowest GPU memory consumption, and fastest inference speed.
📝 Abstract
Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
spatial understanding
depth information
3D-aware modeling
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Depth Encoding
Spatial Enhancement
Vision-Language-Action Model
Lightweight Robotics
Progressive Alignment Training
🔎 Similar Papers
2024-05-14IEEE/RJS International Conference on Intelligent RObots and SystemsCitations: 2
T
Tao Lin
School of AI, Shanghai Jiao Tong University
Y
Yuxin Du
School of AI, Shanghai Jiao Tong University
J
Jiting Liu
School of AI, Shanghai Jiao Tong University
N
Nuobei Zhu
School of AI, Shanghai Jiao Tong University
Y
Yunhe Li
School of AI, Shanghai Jiao Tong University
Y
Yuqian Fu
King Abdullah University of Science and Technology
Y
Yinxinyu Chen
School of AI, Shanghai Jiao Tong University
Hongyi Cai
Hongyi Cai
University of Malaya
Data-centric AIAI for EfficiencyComputer Vision
Zewei Ye
Zewei Ye
Shanghai Jiao Tong University
Embodied AI
Bing Cheng
Bing Cheng
The Chinese Academy of Science
machine learningartificial intelligencefinanceeconomics
Kai Ye
Kai Ye
Xi'an Jiaotong University
bioinformatics pharmacology cancer
Y
Yiran Mao
School of AI, Shanghai Jiao Tong University
Y
Yilei Zhong
School of AI, Shanghai Jiao Tong University
M
MingKang Dong
School of AI, Shanghai Jiao Tong University
Junchi Yan
Junchi Yan
FIAPR & ICML Board Member, SJTU (2018-), SII (2024-), AWS (2019-2022), IBM (2011-2018)
Computational IntelligenceAI4ScienceMachine LearningAutonomous Driving
Gen Li
Gen Li
Postdoctoral Research Fellow, Nanyang Technological University
Embodied AIComputer VisionRoboticsArtificial Intelligence
Bo Zhao
Bo Zhao
Shanghai Jiao Tong University
Embodied AIMLLMData-centric AI