VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit limited fine-grained abstract visual perception capabilities, constituting a critical bottleneck in abstract visual reasoning (AVR). Method: This paper introduces VisuRiddles, the first fine-grained, five-dimensional, two-category benchmark for AVR, and proposes the Perceptual Riddle Synthesizer (PRS)—a novel framework integrating rule-based and LLM-augmented perceptual description generation, fine-grained graphical semantic modeling, and multi-stage supervised fine-tuning to enable intermediate reasoning supervision and enhance interpretability. Contribution/Results: Empirical evaluation demonstrates that fine-grained visual perception is indeed the core limitation in AVR. PRS boosts average accuracy of mainstream MLLMs on VisuRiddles by 23.6%, while significantly improving controllability and interpretability of the reasoning process—establishing a new paradigm for structured, perception-aware AVR evaluation and training.

Technology Category

Application Category

📝 Abstract

Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles

Problem

Research questions and friction points this paper is trying to address.

Abstract Visual Reasoning (AVR) challenges MLLMs due to poor abstract graphics perception

VisuRiddles benchmark assesses MLLMs' reasoning across five core dimensions

Perceptual Riddle Synthesizer (PRS) improves abstract visual perception via fine-grained training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed VisuRiddles benchmark for abstract visual reasoning

Introduced Perceptual Riddle Synthesizer for training data

Enhanced MLLMs with fine-grained perceptual supervision

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?