VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models

📅 2025-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While multimodal large language models (MLLMs) excel at high-level multimodal understanding, their foundational visual cognitive abilities—such as spatial reasoning, perceptual speed, and pattern recognition—remain systematically unassessed. Method: We introduce VisFactor, the first standardized benchmark explicitly designed to evaluate basic visual cognition in MLLMs. VisFactor digitizes and adapts the classical psychology-based Figure Reasoning and Completion Test (FRCT) into an MLLM evaluation framework. Leveraging diverse prompting strategies—including chain-of-thought and multi-agent debate—and a unified cross-model evaluation protocol, we assess leading models (e.g., GPT-4o, Gemini-Pro, Qwen-VL). Contribution/Results: Experiments reveal that MLLMs perform near chance level on VisFactor, with advanced prompting yielding only marginal improvements. This exposes a critical gap in their low-level visual cognition capabilities. To foster community progress, we publicly release the VisFactor benchmark, evaluation toolkit, and implementation code.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in multimodal understanding; however, their fundamental visual cognitive abilities remain largely underexplored. To bridge this gap, we introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT), a well-established psychometric assessment of human cognition. VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks including spatial reasoning, perceptual speed, and pattern recognition. We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL, using VisFactor under diverse prompting strategies like Chain-of-Thought and Multi-Agent Debate. Our findings reveal a concerning deficiency in current MLLMs' fundamental visual cognition, with performance frequently approaching random guessing and showing only marginal improvements even with advanced prompting techniques. These results underscore the critical need for focused research to enhance the core visual reasoning capabilities of MLLMs. To foster further investigation in this area, we release our VisFactor benchmark at https://github.com/CUHK-ARISE/VisFactor.
Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' visual cognition abilities
Introduces VisFactor benchmark for systematic assessment
Highlights deficiencies in MLLMs' visual reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Digitalizes vision-related FRCT subtests
Evaluates MLLMs on visual cognitive tasks
Employs diverse prompting strategies for testing
🔎 Similar Papers
No similar papers found.
Jen-Tse Huang
Jen-Tse Huang
Johns Hopkins University
Artificial IntelligenceNatural Language ProcessingLarge Language Models
D
Dasen Dai
The Chinese University of Hong Kong
J
Jen-Yuan Huang
Peking University
Y
Youliang Yuan
The Chinese University of Hong Kong, Shenzhen
X
Xiaoyuan Liu
The Chinese University of Hong Kong, Shenzhen
W
Wenxuan Wang
The Chinese University of Hong Kong
W
Wenxiang Jiao
Tencent AI Lab
Pinjia He
Pinjia He
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
Software EngineeringAI4SESE4AIAIOps
Zhaopeng Tu
Zhaopeng Tu
Tech Lead @ Tencent Digital Human
Digital HumanAgentsLarge Language ModelsMachine Translation