Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) lack systematic evaluation on complex graphical reasoning—encompassing spatial, relational, and abstract reasoning. To address this gap, we introduce ReasonBench, the first dedicated benchmark for graphical reasoning, comprising 1,613 authentic intelligence test items. We propose a dual-optimization framework: (1) DiaCoT, a hierarchical dialogue-based chain-of-thought method that enhances interpretability and stepwise reasoning; and (2) ReasonTune, a task-adaptive fine-tuning strategy that strengthens abstract structural modeling. Evaluated across 11 state-of-the-art VLMs, our approach achieves a 33.5% absolute accuracy improvement, revealing critical bottlenecks in structured graphical understanding. ReasonBench establishes a standardized, challenging evaluation platform and advances VLMs toward higher-order cognitive reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Evaluating the performance of visual language models (VLMs) in graphic reasoning tasks has become an important research topic. However, VLMs still show obvious deficiencies in simulating human-level graphic reasoning capabilities, especially in complex graphic reasoning and abstract problem solving, which are less studied and existing studies only focus on simple graphics. To evaluate the performance of VLMs in complex graphic reasoning, we propose ReasonBench, the first evaluation benchmark focused on structured graphic reasoning tasks, which includes 1,613 questions from real-world intelligence tests. ReasonBench covers reasoning dimensions related to location, attribute, quantity, and multi-element tasks, providing a comprehensive evaluation of the performance of VLMs in spatial, relational, and abstract reasoning capabilities. We benchmark 11 mainstream VLMs (including closed-source and open-source models) and reveal significant limitations of current models. Based on these findings, we propose a dual optimization strategy: Diagrammatic Reasoning Chain (DiaCoT) enhances the interpretability of reasoning by decomposing layers, and ReasonTune enhances the task adaptability of model reasoning through training, all of which improves VLM performance by 33.5%. All experimental data and code are in the repository: https://huggingface.co/datasets/cistine/ReasonBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' performance in complex graphic reasoning tasks
Addressing deficiencies in human-level graphic reasoning capabilities
Improving VLMs' interpretability and adaptability for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed ReasonBench for complex graphic evaluation
Introduced Diagrammatic Reasoning Chain (DiaCoT)
Developed ReasonTune for model task adaptability
🔎 Similar Papers
No similar papers found.
Jianyi Zhang
Jianyi Zhang
Research Scientist@Google Deepmind, PI@Duke University
LLMsGenerative AITrustworthy AI
X
Xu Ji
Beijing Electronic Science & Technology Institute
Z
Ziyin Zhou
Beijing Electronic Science & Technology Institute
Y
Yuchen Zhou
Beijing Electronic Science & Technology Institute
S
Shubo Shi
Beijing Electronic Science & Technology Institute
H
Haoyu Wu
Beijing Electronic Science & Technology Institute
Z
Zhen Li
Beijing Electronic Science & Technology Institute
S
Shizhao Liu
Beijing Electronic Science & Technology Institute