VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

While multimodal large language models (MLLMs) excel at high-level multimodal understanding, their foundational visual cognitive abilities—such as spatial reasoning, perceptual speed, and pattern recognition—remain systematically unassessed. Method: We introduce VisFactor, the first standardized benchmark explicitly designed to evaluate basic visual cognition in MLLMs. VisFactor digitizes and adapts the classical psychology-based Figure Reasoning and Completion Test (FRCT) into an MLLM evaluation framework. Leveraging diverse prompting strategies—including chain-of-thought and multi-agent debate—and a unified cross-model evaluation protocol, we assess leading models (e.g., GPT-4o, Gemini-Pro, Qwen-VL). Contribution/Results: Experiments reveal that MLLMs perform near chance level on VisFactor, with advanced prompting yielding only marginal improvements. This exposes a critical gap in their low-level visual cognition capabilities. To foster community progress, we publicly release the VisFactor benchmark, evaluation toolkit, and implementation code.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in multimodal understanding; however, their fundamental visual cognitive abilities remain largely underexplored. To bridge this gap, we introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT), a well-established psychometric assessment of human cognition. VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks including spatial reasoning, perceptual speed, and pattern recognition. We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL, using VisFactor under diverse prompting strategies like Chain-of-Thought and Multi-Agent Debate. Our findings reveal a concerning deficiency in current MLLMs' fundamental visual cognition, with performance frequently approaching random guessing and showing only marginal improvements even with advanced prompting techniques. These results underscore the critical need for focused research to enhance the core visual reasoning capabilities of MLLMs. To foster further investigation in this area, we release our VisFactor benchmark at https://github.com/CUHK-ARISE/VisFactor.

Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' visual cognition abilities

Introduces VisFactor benchmark for systematic assessment

Highlights deficiencies in MLLMs' visual reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Digitalizes vision-related FRCT subtests

Evaluates MLLMs on visual cognitive tasks

Employs diverse prompting strategies for testing

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?