GeoVLA: Empowering 3D Representations in Vision-Language-Action Models

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing vision-language-action (VLA) models predominantly rely on 2D visual inputs, limiting their capacity to model true 3D geometry and thereby constraining spatial awareness and generalization. To address this, we propose a 3D-enhanced VLA framework that— for the first time—introduces a dedicated point cloud embedding network and a 3D-augmented action expert, enabling disentangled yet synergistic fusion of 2D semantic representations and 3D geometric information. Our approach jointly leverages depth-to-point-cloud conversion, a vision-language model, the point cloud embedding network, and the 3D action expert to improve geometric consistency in instruction-driven manipulation. Evaluated on the LIBERO and ManiSkill2 simulation benchmarks, our method achieves state-of-the-art performance. Moreover, it demonstrates strong scale awareness, viewpoint invariance, and cross-environment adaptability in real-world settings.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for enabling robots to follow language instructions and predict corresponding actions.However, current VLA models mainly rely on 2D visual inputs, neglecting the rich geometric information in the 3D physical world, which limits their spatial awareness and adaptability. In this paper, we present GeoVLA, a novel VLA framework that effectively integrates 3D information to advance robotic manipulation. It uses a vision-language model (VLM) to process images and language instructions,extracting fused vision-language embeddings. In parallel, it converts depth maps into point clouds and employs a customized point encoder, called Point Embedding Network, to generate 3D geometric embeddings independently. These produced embeddings are then concatenated and processed by our proposed spatial-aware action expert, called 3D-enhanced Action Expert, which combines information from different sensor modalities to produce precise action sequences. Through extensive experiments in both simulation and real-world environments, GeoVLA demonstrates superior performance and robustness. It achieves state-of-the-art results in the LIBERO and ManiSkill2 simulation benchmarks and shows remarkable robustness in real-world tasks requiring height adaptability, scale awareness and viewpoint invariance.

Problem

Research questions and friction points this paper is trying to address.

Enhance VLA models with 3D geometric information for robotics

Improve spatial awareness and adaptability in robotic manipulation

Integrate 3D point clouds with vision-language embeddings for actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates 3D geometric embeddings with VLA models

Uses Point Embedding Network for 3D point clouds

Employs 3D-enhanced Action Expert for precise actions

🔎 Similar Papers

No similar papers found.