Do Multimodal Large Language Models See Like Humans?

πŸ“… 2024-12-12
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current multimodal large language models (MLLMs) lack systematic evaluation of visual perception capabilities grounded in the human visual system (HVS). Method: We introduce HVSBenchβ€”the first large-scale benchmark explicitly designed to assess MLLMs against five core HVS mechanisms (e.g., saliency perception, subset counting), comprising over 85K cross-domain multimodal samples. HVSBench innovatively integrates cognitive psychology task paradigms with empirical eye-tracking and behavioral data, enabling standardized, reproducible evaluation across 13 state-of-the-art MLLMs. Contribution/Results: Experiments reveal substantial performance gaps between MLLMs and humans on critical visual tasks such as free-viewing and visual search; even the best-performing model achieves only moderate average scores. This demonstrates a fundamental misalignment between current MLLMs and the HVS. HVSBench thus provides both a theoretical framework and an empirical toolkit to advance HVS-aligned visual reasoning in MLLMs.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models. However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans? Current benchmarks lack the ability to evaluate MLLMs from this perspective. To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision. HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs. Diverse human participants attained strong performance, significantly outperforming MLLMs, which further underscores the benchmark's high quality. We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.
Problem

Research questions and friction points this paper is trying to address.

Assessing MLLM-human visual perception alignment
Evaluating MLLMs on human vision tasks
Benchmarking MLLMs against human visual system
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing HVSBench for human-aligned MLLM evaluation
Curating 85K multimodal samples across 13 HVS categories
Revealing MLLMs' significant gap compared to human vision
πŸ”Ž Similar Papers
No similar papers found.
Jiaying Lin
Jiaying Lin
Peking University
Computer VisionMultimodal
S
Shuquan Ye
City University of Hong Kong
R
Rynson W. H. Lau
City University of Hong Kong