Do large language vision models understand 3D shapes?

📅 2024-12-14

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work investigates large vision-language models’ (LVLMs) intrinsic understanding of 3D geometric shape—specifically, their ability to recognize shape identity across varying poses and surface materials, probing abstraction and geometric invariance. We introduce the first controlled benchmark for 3D shape abstraction: synthesizing multi-view, multi-material CGI images; designing a rigorously controlled variable-testing protocol; and establishing a unified prompting and evaluation framework. Experiments show that state-of-the-art LVLMs achieve average matching accuracy below 40% under joint pose-and-material variation—substantially lower than human performance yet markedly above random chance (~5%), indicating nascent but non-robust geometric invariance. Crucially, our analysis reveals that LVLMs heavily rely on superficial appearance cues and lack deep, invariant 3D geometric representations. This work thus exposes a fundamental bottleneck in current LVLMs and provides a novel diagnostic benchmark and toolkit for embodied visual reasoning research.

Technology Category

Application Category

📝 Abstract

Large vision language models (LVLM) are the leading A.I approach for achieving a general visual understanding of the world. Models such as GPT, Claude, Gemini, and LLama can use images to understand and analyze complex visual scenes. 3D objects and shapes are the basic building blocks of the world, recognizing them is a fundamental part of human perception. The goal of this work is to test whether LVLMs truly understand 3D shapes by testing the models ability to identify and match objects of the exact same 3D shapes but with different orientations and materials/textures. A large number of test images were created using CGI with a huge number of highly diverse objects, materials, and scenes. The results of this test show that the ability of such models to match 3D shapes is significantly below humans but much higher than random guesses. Suggesting that the models have gained some abstract understanding of 3D shapes but still trail far beyond humans in this task. Mainly it seems that the models can easily identify the same object with a different orientation as well as matching identical 3D shapes of the same orientation but with different materials and textures. However, when both the object material and orientation are changed, all models perform poorly relative to humans. Code and benchmark are available.

Problem

Research questions and friction points this paper is trying to address.

Large Visual Language Models

3D Shape Understanding

Human Performance Comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Visual Language Models

3D Shape Recognition

Human Performance Benchmark

🔎 Similar Papers

Understanding Depth and Height Perception in Large Visual-Language Models