VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited fine-grained abstract visual perception capabilities, constituting a critical bottleneck in abstract visual reasoning (AVR). Method: This paper introduces VisuRiddles, the first fine-grained, five-dimensional, two-category benchmark for AVR, and proposes the Perceptual Riddle Synthesizer (PRS)—a novel framework integrating rule-based and LLM-augmented perceptual description generation, fine-grained graphical semantic modeling, and multi-stage supervised fine-tuning to enable intermediate reasoning supervision and enhance interpretability. Contribution/Results: Empirical evaluation demonstrates that fine-grained visual perception is indeed the core limitation in AVR. PRS boosts average accuracy of mainstream MLLMs on VisuRiddles by 23.6%, while significantly improving controllability and interpretability of the reasoning process—establishing a new paradigm for structured, perception-aware AVR evaluation and training.

Technology Category

Application Category

📝 Abstract
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles
Problem

Research questions and friction points this paper is trying to address.

Abstract Visual Reasoning (AVR) challenges MLLMs due to poor abstract graphics perception
VisuRiddles benchmark assesses MLLMs' reasoning across five core dimensions
Perceptual Riddle Synthesizer (PRS) improves abstract visual perception via fine-grained training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed VisuRiddles benchmark for abstract visual reasoning
Introduced Perceptual Riddle Synthesizer for training data
Enhanced MLLMs with fine-grained perceptual supervision
🔎 Similar Papers
No similar papers found.
H
Hao Yan
Huazhong University of Science and Technology
Handong Zheng
Handong Zheng
Unknown affiliation
H
Hao Wang
Huawei Inc.
L
Liang Yin
Huazhong University of Science and Technology
X
Xingchen Liu
Huazhong University of Science and Technology
Z
Zhenbiao Cao
Huazhong University of Science and Technology
X
Xinxing Su
Huawei Inc.
Z
Zihao Chen
Huawei Inc.
Jihao Wu
Jihao Wu
Huawei Inc.
Computer VisionMulti-Modality
M
Minghui Liao
Huawei Inc.
Chao Weng
Chao Weng
Anuttacon
Audio LLMsMultimodal LLMs
W
Wei Chen
Huazhong University of Science and Technology
Y
Yuliang Liu
Huazhong University of Science and Technology
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR