ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether vision-language models (VLMs) exhibit human-like color perception, semantic understanding, and transformation robustness. To this end, we introduce ColorBench—the first comprehensive benchmark for evaluating VLMs’ understanding of the colored world—spanning three dimensions: color perception, color-based semantic reasoning, and robustness under color perturbations. Our evaluation encompasses 32 state-of-the-art VLM architectures. Key findings include: (1) language models contribute substantially more to color understanding than visual encoders; (2) existing VLMs suffer from fundamental deficiencies in color modeling, exhibiting performance saturation with limited headroom for improvement; and (3) chain-of-thought (CoT) prompting significantly enhances color robustness. We empirically validate that while VLMs can leverage color cues, they remain highly susceptible to color-based adversarial interference; confirm the applicability of scaling laws to color tasks; and expose intrinsic limitations in color semantics modeling. This work establishes a critical evaluation paradigm and empirical foundation for advancing toward human-level color understanding.

Technology Category

Application Category

📝 Abstract
Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
Problem

Research questions and friction points this paper is trying to address.

Assess VLMs' color perception, reasoning, and robustness capabilities
Evaluate VLMs' performance under diverse color transformations
Identify limitations in current VLMs' color understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

ColorBench benchmark for VLM color understanding
Evaluates color perception, reasoning, robustness
Tests 32 VLMs with diverse scenarios
🔎 Similar Papers
No similar papers found.
Yijun Liang
Yijun Liang
yliang17@umd.edu
M
Ming Li
University of Maryland, College Park
C
Chenrui Fan
University of Maryland, College Park
Ziyue Li
Ziyue Li
CS PhD, University of Maryland
Machine learning
D
Dang Nguyen
University of Maryland, College Park
K
Kwesi Cobbina
University of Maryland, College Park
S
Shweta Bhardwaj
University of Maryland, College Park
Jiuhai Chen
Jiuhai Chen
University of Maryland
MultimodalLarge Language Model
Fuxiao Liu
Fuxiao Liu
Research Scientist, NVIDIA
Multi-Modal LearningMLLMHallucination
T
Tianyi Zhou
University of Maryland, College Park