The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work investigates how transformer-based text-to-image diffusion models implicitly encode artistic content and style concepts under unsupervised conditions. We propose a word-region attribution method grounded in cross-attention heatmaps to quantitatively analyze how prompt tokens influence distinct image regions—such as objects, background, and texture. Experimental results demonstrate that content words predominantly activate object-related regions, whereas style words govern background and texture generation, revealing an emergent functional disentanglement of content and style within the model. To facilitate exploration and validation, we develop an interactive visualization tool and publicly release all source code, annotated datasets, and analysis modules. This study constitutes the first systematic empirical characterization of intrinsic artistic concept disentanglement in large-scale diffusion models, providing both theoretical foundations for interpretability and practical tools for controllable image generation and editing.

Technology Category

Application Category

📝 Abstract

Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

Problem

Research questions and friction points this paper is trying to address.

How text-to-image models represent content and style in artworks

Investigating content-style separation in diffusion models via attention maps

Understanding unsupervised learning of artistic concepts in generative models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention heatmaps analyze prompt token influence

Content-style separation varies by artistic prompt

Visualization tool for attention maps provided

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?