🤖 AI Summary
While multimodal large language models (MLLMs) excel at high-level multimodal understanding, their foundational visual cognitive abilities—such as spatial reasoning, perceptual speed, and pattern recognition—remain systematically unassessed. Method: We introduce VisFactor, the first standardized benchmark explicitly designed to evaluate basic visual cognition in MLLMs. VisFactor digitizes and adapts the classical psychology-based Figure Reasoning and Completion Test (FRCT) into an MLLM evaluation framework. Leveraging diverse prompting strategies—including chain-of-thought and multi-agent debate—and a unified cross-model evaluation protocol, we assess leading models (e.g., GPT-4o, Gemini-Pro, Qwen-VL). Contribution/Results: Experiments reveal that MLLMs perform near chance level on VisFactor, with advanced prompting yielding only marginal improvements. This exposes a critical gap in their low-level visual cognition capabilities. To foster community progress, we publicly release the VisFactor benchmark, evaluation toolkit, and implementation code.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in multimodal understanding; however, their fundamental visual cognitive abilities remain largely underexplored. To bridge this gap, we introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT), a well-established psychometric assessment of human cognition. VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks including spatial reasoning, perceptual speed, and pattern recognition. We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL, using VisFactor under diverse prompting strategies like Chain-of-Thought and Multi-Agent Debate. Our findings reveal a concerning deficiency in current MLLMs' fundamental visual cognition, with performance frequently approaching random guessing and showing only marginal improvements even with advanced prompting techniques. These results underscore the critical need for focused research to enhance the core visual reasoning capabilities of MLLMs. To foster further investigation in this area, we release our VisFactor benchmark at https://github.com/CUHK-ARISE/VisFactor.