Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text-to-Image Models

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models frequently exhibit color–object semantic misalignment—particularly under multi-object, multi-color prompts—where existing methods fail to achieve fine-grained color–object correspondence. To address this, we propose the first color-anchored, attention-editing technique for diffusion models: leveraging CLIP embeddings to localize color-relevant attention regions, then selectively reweighting cross-attention maps to enable object-level color semantic calibration. Our method requires no fine-tuning or additional training and is fully plug-and-play. Evaluated on a multi-color prompt benchmark, it achieves substantial improvements in color accuracy (+18.7%) and object–color alignment (+22.3%). Extensive experiments confirm strong generalization and cross-architecture effectiveness across mainstream models—including Stable Diffusion v1.5 and SDXL—demonstrating robustness without architectural modification.

Technology Category

Application Category

📝 Abstract
Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors -- a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes-far more so than with single-color prompts-and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.
Problem

Research questions and friction points this paper is trying to address.

Evaluating color perception accuracy in text-to-image models
Addressing multi-object color attribute alignment failures
Developing specialized editing technique for color semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dedicated image editing technique for color alignment
Mitigates multi-object semantic alignment issues
Improves performance across diverse evaluation metrics
🔎 Similar Papers
No similar papers found.
S
Shay Shomer Chai
Tel Aviv University
W
Wenxuan Peng
Cornell University
B
Bharath Hariharan
Cornell University
Hadar Averbuch-Elor
Hadar Averbuch-Elor
Assistant Professor, Cornell University
computer visioncomputer graphics