Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit significant limitations in multi-view geometric reasoning—particularly in modeling cross-view correspondences and estimating coarse camera poses. Method: We introduce All-Angles Bench, the first benchmark dedicated to multi-view reasoning, comprising 2,100+ question-answer pairs across 90 real-world scenes. Our evaluation framework covers six geometrically grounded tasks—including geometric consistency, occlusion-aware matching, and relative pose estimation—based on manually annotated multi-view image-text pairs and human evaluation. We conduct zero-shot evaluation on 27 state-of-the-art MLLMs, including GPT-4o and Gemini 2.0 Flash. Contribution/Results: Results reveal substantial performance gaps between current MLLMs and humans—especially under partial occlusion—in cross-view matching and camera localization. These findings empirically validate the necessity of incorporating domain-specific multi-view geometric modeling modules into MLLM architectures.

Technology Category

Application Category

📝 Abstract

Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model's geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding. The project and benchmark are publicly available at https://danielchyeh.github.io/All-Angles-Bench/.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-view understanding in MLLMs for 3D scenes

Assessing geometric consistency across diverse viewpoints in MLLMs

Bridging performance gap in multi-view reasoning between MLLMs and humans

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes All-Angles Bench benchmark

Tests multi-view geometric correspondence

Highlights need for multi-view awareness

🔎 Similar Papers

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models