MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of multimodal large language models’ (MLLMs) perspective understanding capability by introducing MMPerspective—the first dedicated benchmark for this purpose. It comprises 10 tasks across three dimensions: perspective perception, reasoning, and robustness—spanning 2,711 images and 5,083 question-answer pairs, integrating both real-world and synthetic data. Crucially, it formally defines and quantifies MLLMs’ 3D spatial understanding grounded in projective geometry. Evaluation employs diverse methodologies, including vanishing point detection, perspective type classification, 3D line relationship judgment, viewpoint invariance testing, and chain-of-thought (CoT) analysis, applied across 43 state-of-the-art models. Results reveal that models excel at superficial perception but exhibit fundamental deficits in compositional spatial reasoning and geometric consistency; nontrivial correlations exist between architecture, scale, and perspective competence; CoT substantially enhances complex perspective reasoning; and key robustness bottlenecks are identified.

Technology Category

Application Category

📝 Abstract

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' understanding of perspective geometry in visual perception

Evaluating MLLMs' performance on perspective perception, reasoning, and robustness tasks

Identifying limitations in MLLMs' compositional reasoning and spatial consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for MLLMs' perspective understanding

2,711 images with 5,083 QA pairs

Evaluates 43 models across 10 tasks

🔎 Similar Papers

The 3D-PC: a benchmark for visual perspective taking in humans and machines