🤖 AI Summary
This work addresses the robustness evaluation of vision-language models (VLMs) under color vision deficiencies. We propose ColorBlindnessEval—the first adversarial benchmark for VLMs inspired by the Ishihara color blindness test. It comprises 500 Ishihara-like images, each embedding digits 0–99 within chromatically complex patterns, and employs both binary (Yes/No) and open-ended question-answering prompts to systematically assess digit recognition across nine state-of-the-art VLMs, with human performance as reference. By adapting clinical color-vision testing paradigms to VLM evaluation, we uncover— for the first time—that VLMs suffer severe accuracy degradation and exhibit pervasive textual hallucinations under chromatic confusion. Experiments reveal that current VLMs heavily rely on superficial texture and contextual cues, lacking intrinsic color-invariant visual perception. These findings underscore the necessity of developing visually robust VLMs and highlight the value of color-robustness as a novel, clinically grounded evaluation dimension.
📝 Abstract
This paper presents ColorBlindnessEval, a novel benchmark designed to evaluate the robustness of Vision-Language Models (VLMs) in visually adversarial scenarios inspired by the Ishihara color blindness test. Our dataset comprises 500 Ishihara-like images featuring numbers from 0 to 99 with varying color combinations, challenging VLMs to accurately recognize numerical information embedded in complex visual patterns. We assess 9 VLMs using Yes/No and open-ended prompts and compare their performance with human participants. Our experiments reveal limitations in the models' ability to interpret numbers in adversarial contexts, highlighting prevalent hallucination issues. These findings underscore the need to improve the robustness of VLMs in complex visual environments. ColorBlindnessEval serves as a valuable tool for benchmarking and improving the reliability of VLMs in real-world applications where accuracy is critical.