Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study addresses the limited geometric understanding of current vision–language–action (VLA) models, which constrains their performance in embodied tasks. For the first time, the authors quantify the “geometry gap” between VLAs and geometric foundation models (GFMs) using linear probing, and systematically evaluate—under a unified experimental setup—the impact of three fusion architectures, training data scale, and multi-view inputs on geometric perception. The findings reveal that specific fusion architectures substantially enhance geometric comprehension, while multi-view observations and sufficient training data are critical for high performance. This work establishes key design principles and provides empirical evidence for developing geometry-aware VLA systems.

📝 Abstract

Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work's intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the "geometric gap" between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action models

geometric foundation models

3D reconstruction

geometric understanding

model architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric foundation models

vision-language-action models

linear probing