Understanding Depth and Height Perception in Large Visual-Language Models

📅 2024-08-21
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (VLMs) lack rigorous evaluation of geometric perception—particularly depth and height understanding—despite their growing deployment in spatially grounded applications. Method: We introduce GeoMeter, the first multi-dimensional benchmark dedicated to geometric reasoning, featuring controlled 2D/3D generation scenarios and integrating human annotations with diverse evaluation paradigms: multi-turn consistency QA, relative ranking, and counterfactual reasoning. Contribution/Results: Systematic evaluation across 18 state-of-the-art VLMs reveals a stark performance gap: average accuracy on depth/height reasoning is only 57.6%, substantially below shape/size recognition. This exposes critical capability limitations and dataset biases. GeoMeter enables fine-grained, quantifiable assessment of VLMs’ geometric perception for the first time, filling a fundamental gap in visual geometric understanding evaluation and providing both a standardized benchmark and a diagnostic framework for robust visual reasoning research.

Technology Category

Application Category

📝 Abstract
Geometric understanding - including depth and height perception - is fundamental to intelligence and crucial for navigating our environment. Despite the impressive capabilities of large Vision Language Models (VLMs), it remains unclear how well they possess the geometric understanding required for practical applications in visual perception. In this work, we focus on evaluating the geometric understanding of these models, specifically targeting their ability to perceive the depth and height of objects in an image. To address this, we introduce GeoMeter, a suite of benchmark datasets - encompassing 2D and 3D scenarios - to rigorously evaluate these aspects. By benchmarking 18 state-of-the-art VLMs, we found that although they excel in perceiving basic geometric properties like shape and size, they consistently struggle with depth and height perception. Our analysis reveal that these challenges stem from shortcomings in their depth and height reasoning capabilities and inherent biases. This study aims to pave the way for developing VLMs with enhanced geometric understanding by emphasizing depth and height perception as critical components necessary for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating depth and height perception in vision-language models
Assessing geometric understanding for real-world visual applications
Identifying limitations in 3D reasoning capabilities of VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing GeoMeter benchmark datasets
Evaluating depth and height perception capabilities
Identifying reasoning shortcomings and biases
🔎 Similar Papers
No similar papers found.
S
Shehreen Azad
Center for Research in Computer Vision, University of Central Florida
Yash Jain
Yash Jain
Essential
Foundation ModelsComputer VisionMulti-modal learning
R
Rishit Garg
Indian Institute of Technology, Kharagpur
Y
Y. S. Rawat
Center for Research in Computer Vision, University of Central Florida
Vibhav Vineet
Vibhav Vineet
Microsoft Research
computer visionmachine learningArtificial IntelligenceRobotics