Can Multimodal LLMs Solve the Basic Perception Problems of Percept-V?

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit surprising failures on fundamental visual perception tasks—despite their apparent simplicity. Method: We introduce Percept-V, a programmatically generated benchmark comprising 7,200 images spanning 30 fine-grained visual cognition tasks across multiple abstraction levels, and systematically evaluate state-of-the-art MLLMs—including GPT-4o, Gemini, Claude, and reasoning-specialized models (e.g., o4-mini, DeepSeek R1). Results: All models show substantial performance degradation even on low-complexity perceptual tasks, with highly consistent difficulty hierarchies and cross-task capability gaps. Our key contribution is the first controlled, interpretable, programmatic generation paradigm that disentangles core visual cognition dimensions—revealing structural limitations in MLLMs’ perceptual capabilities. Percept-V establishes a rigorous diagnostic benchmark and analytical framework for probing vision-language alignment and guiding targeted model improvement.

Technology Category

Application Category

📝 Abstract

The reasoning abilities of Multimodal Large Language Models (MLLMs) have garnered a lot of attention in recent times, with advances made in frontiers like coding, mathematics, and science. However, very limited experiments have been done to assess their performance in simple perception tasks performed over uncontaminated, generated images containing basic shapes and structures. To address this issue, the paper introduces a dataset, Percept-V, containing a total of 7200 program-generated images equally divided into 30 categories, each testing a combination of visual perception skills. Unlike previously proposed datasets, Percept-V comprises very basic tasks of varying complexity that test the perception abilities of MLLMs. This dataset is then tested on state-of-the-art MLLMs like GPT-4o, Gemini, and Claude as well as Large Reasoning Models (LRMs) like OpenAI o4-mini and DeepSeek R1 to gauge their performance. Contrary to the evidence that MLLMs excel in many complex tasks, our experiments show a significant drop in the models' performance with increasing problem complexity across all categories. An analysis of the performances also reveals that the tested MLLMs exhibit a similar trend in accuracy across categories, testing a particular cognitive skill and find some skills to be more difficult than others.

Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' performance on basic visual perception tasks

Testing perception abilities using uncontaminated generated images

Evaluating accuracy drop with increasing problem complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Program-generated images dataset Percept-V

Testing MLLMs on basic perception tasks

Evaluating performance across 30 skill categories

🔎 Similar Papers

No similar papers found.