🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited fine-grained abstract visual perception capabilities, constituting a critical bottleneck in abstract visual reasoning (AVR).
Method: This paper introduces VisuRiddles, the first fine-grained, five-dimensional, two-category benchmark for AVR, and proposes the Perceptual Riddle Synthesizer (PRS)—a novel framework integrating rule-based and LLM-augmented perceptual description generation, fine-grained graphical semantic modeling, and multi-stage supervised fine-tuning to enable intermediate reasoning supervision and enhance interpretability.
Contribution/Results: Empirical evaluation demonstrates that fine-grained visual perception is indeed the core limitation in AVR. PRS boosts average accuracy of mainstream MLLMs on VisuRiddles by 23.6%, while significantly improving controllability and interpretability of the reasoning process—establishing a new paradigm for structured, perception-aware AVR evaluation and training.
📝 Abstract
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles