IQBench: How"Smart'' Are Vision-Language Models? A Study with Human IQ Tests

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) are predominantly evaluated on answer accuracy, overlooking fluid reasoning—particularly image-driven, human-like visual intelligence. Method: We introduce IQBench, the first benchmark explicitly designed for human visual IQ assessment, comprising 500 manually annotated, text-minimized image-based questions. We propose a three-dimensional evaluation framework: (1) explanation quality and problem-solving patterns (reasoning score), (2) answer accuracy, and (3) human consistency score—emphasizing visual centrality and robust anti-data-leakage design. Contribution/Results: Evaluating state-of-the-art VLMs—including o4-mini and Gemini-2.5-flash—we find o4-mini achieves the highest reasoning score (69.6%) and accuracy (61.5%). All models exhibit significant weaknesses in 3D spatial and visual anagram reasoning, exposing fundamental bottlenecks in deep visual reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Although large Vision-Language Models (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce **IQBench**, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. **Our benchmark is visually centric, minimizing the dependence on unnecessary textual content**, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to **prevent unintentional data leakage during training**. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, with models such as `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieving the highest average accuracies of 0.615, 0.578, and 0.548, respectively. However, all models struggle with 3D spatial and anagram reasoning tasks, highlighting significant limitations in current VLMs' general reasoning abilities. In terms of reasoning scores, `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieved top averages of 0.696, 0.586, and 0.516, respectively. These results highlight inconsistencies between the reasoning processes of the models and their final answers, emphasizing the importance of evaluating the accuracy of the reasoning in addition to the final predictions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' reasoning on human IQ tests
Assessing reasoning beyond final answer accuracy
Minimizing textual bias in visual IQ evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual-centric benchmark minimizes textual dependency
Manual collection prevents unintentional data leakage
Evaluates reasoning via explanations and patterns
🔎 Similar Papers
No similar papers found.
Tan-Hanh Pham
Tan-Hanh Pham
MGH - Harvard Medical School
RoboticsAI
P
Phu-Vinh Nguyen
Uppsala University, Sweden
D
Dang The Hung
University of London, UK
B
Bui Trong Duong
Vietnam Military Medical University
V
Vu Nguyen Thanh
University of Technical Education Ho Chi Minh City, Vietnam
Chris Ngo
Chris Ngo
Knovel Engineering
T
Tri Quang Truong
University of Technical Education Ho Chi Minh City, Vietnam
Truong-Son Hy
Truong-Son Hy
Tenure-Track Assistant Professor, University of Alabama at Birmingham
AI for ScienceBioinformaticsDrug DiscoveryMedical AIBiomedical Knowledge Graph