StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Stereo vision remains underexploited in vision-language-action (VLA) models, despite its critical geometric priors—encoded in binocular imagery—for precise robotic manipulation. To address this gap, we propose the first binocular-cooperative modeling paradigm for VLA: (1) a geometry-semantic joint feature extraction module that fuses disparity-guided stereo geometric representations with monocular semantic features; (2) interactive region depth estimation as an auxiliary task to explicitly model the manipulation space and improve training convergence; and (3) a dual-stream stereo encoder built upon vision foundation models, integrated within a multi-task collaborative training framework. Experiments demonstrate substantial improvements over monocular baselines across multiple stereo VLA benchmarks, including manipulation planning and spatial reasoning tasks. Moreover, our method exhibits strong robustness to camera pose perturbations, validating its geometric consistency and generalization capability.

Technology Category

Application Category

📝 Abstract
Stereo cameras closely mimic human binocular vision, providing rich spatial cues critical for precise robotic manipulation. Despite their advantage, the adoption of stereo vision in vision-language-action models (VLAs) remains underexplored. In this work, we present StereoVLA, a VLA model that leverages rich geometric cues from stereo vision. We propose a novel Geometric-Semantic Feature Extraction module that utilizes vision foundation models to extract and fuse two key features: 1) geometric features from subtle stereo-view differences for spatial perception; 2) semantic-rich features from the monocular view for instruction following. Additionally, we propose an auxiliary Interaction-Region Depth Estimation task to further enhance spatial perception and accelerate model convergence. Extensive experiments show that our approach outperforms baselines by a large margin in diverse tasks under the stereo setting and demonstrates strong robustness to camera pose variations.
Problem

Research questions and friction points this paper is trying to address.

Enhances VLAs with stereo vision for spatial perception
Extracts geometric and semantic features for robotic manipulation
Improves robustness to camera variations in diverse tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stereo vision for spatial perception in VLAs
Geometric-semantic feature extraction module
Auxiliary depth estimation task for robustness
🔎 Similar Papers
No similar papers found.
S
Shengliang Deng
Galbot; The University of Hong Kong
M
Mi Yan
Galbot; Peking University
Y
Yixin Zheng
Galbot; Institute of Automation, Chinese Academy of Sciences; Beijing Academy of Artificial Intelligence
Jiayi Su
Jiayi Su
Northeastern University
HCIHealth
W
Wenhao Zhang
Galbot; Peking University
Xiaoguang Zhao
Xiaoguang Zhao
Tsinghua University
MEMSMicrosystemsTHzMetamaterialWireless communication
Heming Cui
Heming Cui
University of Hong Kong
Operating SystemsProgramming LanguageDistributed SystemsSecurity
Z
Zhizheng Zhang
Galbot; Beijing Academy of Artificial Intelligence
H
He Wang
Galbot; Peking University; Beijing Academy of Artificial Intelligence