🤖 AI Summary
Current vision-language-action (VLA) models rely on 2D image-text pre-trained vision-language models (VLMs), exhibiting limited spatial understanding and thus struggling with precise perception and manipulation in complex 3D environments. To address this, we propose a plug-and-play implicit geometric enhancement module that requires no additional depth sensors or explicit 3D inputs; instead, it integrates implicit 3D geometric features extracted from off-the-shelf vision-based geometric foundation models to strengthen VLA spatial reasoning. We introduce a novel architecture specifically designed for spatial relation modeling, coupled with a joint training strategy that harmonizes geometric feature integration and action prediction. Evaluated on five challenging spatial reasoning tasks, our method significantly outperforms state-of-the-art VLA models, demonstrating strong generalization across diverse real-world robotic scenarios—including tabletop manipulation, navigation, and object rearrangement—while maintaining compatibility with existing VLA frameworks.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models (VLMs), which excel at semantic understanding due to large-scale text pretraining. However, VLMs typically lack precise spatial understanding capabilities, as they are primarily tuned on 2D image-text pairs without 3D supervision. To address this limitation, recent approaches have incorporated explicit 3D inputs such as point clouds or depth maps, but this necessitates additional depth sensors or defective estimation. In contrast, our work introduces a plug-and-play module that implicitly injects 3D geometry features into VLA models by leveraging an off-the-shelf visual geometry foundation models. We design five spatially challenging tasks that require precise spatial understanding ability to validate effectiveness of our method. Extensive evaluations show that our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.