๐ค AI Summary
Large vision-language models (VLMs) lack systematic evaluation of color vision capabilities. Method: We introduce the first fine-grained color vision benchmark for VLMs, comprising a manually annotated, multi-category, multi-difficulty test dataset. We propose a failure-pattern-driven taxonomy for color-related errors and integrate prompt engineering with targeted fine-tuning to systematically analyze model performance in color recognition, discrimination, and cross-modal semantic understanding. Contribution/Results: Experiments uncover critical deficiencies in VLMsโincluding low hue sensitivity, poor discrimination of chromatically similar colors, and misalignment between visual color perception and linguistic color semantics. Our fine-tuning approach yields an average accuracy improvement of 12.7% across diverse color-centric tasks, demonstrating its efficacy in enhancing color perception and comprehension. This work establishes a novel, reproducible evaluation paradigm and benchmark for assessing fundamental visual capabilities in VLMs.
๐ Abstract
With the widespread adoption of large vision-language models, the capacity for color vision in these models is crucial. However, the color vision abilities of large visual-language models have not yet been thoroughly explored. To address this gap, we define a color vision testing task for large vision-language models and construct a dataset footnote{Anonymous Github Showing some of the data https://anonymous.4open.science/r/color-vision-test-dataset-3BCD} that covers multiple categories of test questions and tasks of varying difficulty levels. Furthermore, we analyze the types of errors made by large vision-language models and propose fine-tuning strategies to enhance their performance in color vision tests.