Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

A dedicated benchmark for occlusion-aware evaluation is lacking, hindering systematic assessment of multimodal large language models’ (MLLMs) occlusion understanding capabilities. Method: We introduce O-Bench—the first occlusion-aware visual question answering benchmark—comprising 1,365 semantically coherent occluded images derived from SA-1B and 4,588 high-quality QA pairs across five task categories. It employs a novel hierarchical synthesis strategy to generate photorealistic occlusion scenes and integrates a semi-automatic annotation pipeline to ensure data fidelity. Contribution/Results: Evaluation across 22 state-of-the-art MLLMs reveals substantial performance gaps relative to human baselines; neither model scaling nor inference optimization effectively bridges this gap. We identify three prevalent failure modes. O-Bench establishes a quantifiable, multi-granular evaluation framework and provides actionable insights for advancing occlusion perception in MLLMs.

Technology Category

Application Category

📝 Abstract

Occlusion perception, a critical foundation for human-level spatial understanding, embodies the challenge of integrating visual recognition and reasoning. Though multimodal large language models (MLLMs) have demonstrated remarkable capabilities, their performance on occlusion perception remains under-explored. To address this gap, we introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception. Based on SA-1B, we construct 1,365 images featuring semantically coherent occlusion scenarios through a novel layered synthesis approach. Upon this foundation, we annotate 4,588 question-answer pairs in total across five tailored tasks, employing a reliable, semi-automatic workflow. Our extensive evaluation of 22 representative MLLMs against the human baseline reveals a significant performance gap between current MLLMs and humans, which, we find, cannot be sufficiently bridged by model scaling or thinking process. We further identify three typical failure patterns, including an overly conservative bias, a fragile gestalt prediction, and a struggle with quantitative tasks. We believe O-Bench can not only provide a vital evaluation tool for occlusion perception, but also inspire the development of MLLMs for better visual intelligence. Our benchmark will be made publicly available upon paper publication.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' occlusion perception via O-Bench benchmark

Assessing performance gap between MLLMs and humans in occlusion tasks

Identifying failure patterns in MLLMs' occlusion reasoning abilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

O-Bench benchmark for occlusion perception

Layered synthesis for coherent occlusion scenarios

Semi-automatic annotation workflow for VQA pairs

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs