🤖 AI Summary
Current text-to-image models exhibit significant deficiencies in fine-grained color control, while mainstream evaluation benchmarks lack systematic assessment of RGB numerical understanding, compatibility with standardized color systems, and alignment with human perceptual consistency.
Method: We introduce the first comprehensive benchmark for color controllability, covering 400+ colors and 44K structured prompts, integrating multi-level naming schemes (e.g., ISCC-NBS, CSS3/X11) with precise RGB specifications. We propose a perception–automation dual-dimensional evaluation framework that jointly quantifies color semantic comprehension, numerical fidelity, and visual consistency.
Contribution/Results: Our empirical analysis reveals pronounced performance gaps and failure modes across color categories in state-of-the-art models. The benchmark establishes a reproducible, rigorously validated standard for evaluating color generation accuracy and provides empirically grounded directions for model improvement.
📝 Abstract
Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments. Evaluations of popular text-to-image models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will guide improvements in precise color generation. The benchmark will be made public upon acceptance.