HueManity: Probing Fine-Grained Visual Perception in MLLMs

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

While multimodal large language models (MLLMs) excel at high-level visual reasoning, they exhibit significant deficiencies in fine-grained visual perception—particularly in Ishihara-style dot-pattern tasks requiring precise chromatic discrimination. Method: We introduce HueManity, a novel benchmark comprising 83,850 images, which—based on the Ishihara test paradigm—systematically quantifies MLLMs’ performance gaps in high-fidelity color vision assessment. We open-source both the dataset and evaluation code. Contribution/Results: Evaluating nine state-of-the-art MLLMs alongside ResNet50 and human subjects, we find the best-performing MLLM achieves only 33.6% accuracy on “easy” digit recognition and 3.0% on “hard” letter recognition—dramatically below human performance (100.0%/95.6%) and ResNet50 (96.5%/94.5%). These results expose a fundamental perceptual bottleneck in current MLLMs, establishing HueManity as a rigorous new benchmark and diagnostic tool for advancing robust multimodal perception research.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing fine-grained visual perception in MLLMs

Evaluating MLLMs' performance on precise pattern recognition

Identifying gaps in MLLMs' visual capabilities compared to humans

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for fine-grained visual perception in MLLMs

Ishihara-style alphanumeric pattern recognition challenge

Open-source dataset to improve MLLM perceptual robustness

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts