🤖 AI Summary
While multimodal large language models (MLLMs) excel at high-level visual reasoning, they exhibit significant deficiencies in fine-grained visual perception—particularly in Ishihara-style dot-pattern tasks requiring precise chromatic discrimination. Method: We introduce HueManity, a novel benchmark comprising 83,850 images, which—based on the Ishihara test paradigm—systematically quantifies MLLMs’ performance gaps in high-fidelity color vision assessment. We open-source both the dataset and evaluation code. Contribution/Results: Evaluating nine state-of-the-art MLLMs alongside ResNet50 and human subjects, we find the best-performing MLLM achieves only 33.6% accuracy on “easy” digit recognition and 3.0% on “hard” letter recognition—dramatically below human performance (100.0%/95.6%) and ResNet50 (96.5%/94.5%). These results expose a fundamental perceptual bottleneck in current MLLMs, establishing HueManity as a rigorous new benchmark and diagnostic tool for advancing robust multimodal perception research.
📝 Abstract
Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.