Origins of Creativity in Attention-Based Diffusion Models

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional CNN-based score parameterizations in diffusion models struggle to capture long-range structural coherence, limiting global semantic consistency (“creativity”) in generated images. Method: We propose and theoretically analyze a CNN-Transformer hybrid architecture, establishing for the first time—within the score-matching framework—that the final self-attention layer induces global feature coordination, overcoming local patch dependency. Contribution/Results: Through attention visualization and quantitative evaluation on a controlled synthetic dataset, we empirically demonstrate that our design significantly improves structural coherence in generated images, outperforming pure-CNN baselines. This work identifies self-attention as a critical, previously unrecognized mechanism enabling creative, globally consistent generation in diffusion models, providing both interpretable theoretical grounding and empirical evidence for the origins of “creativity” in generative AI.

Technology Category

Application Category

📝 Abstract
As diffusion models have become the tool of choice for image generation and as the quality of the images continues to improve, the question of how `creativity' originates in diffusion has become increasingly important. The score matching perspective on diffusion has proven particularly fruitful for understanding how and why diffusion models generate images that remain plausible while differing significantly from their training images. In particular, as explained in (Kamb & Ganguli, 2024) and others, e.g., (Ambrogioni, 2023), theory suggests that if our score matching were optimal, we would only be able to recover training samples through our diffusion process. However, as shown by Kamb & Ganguli, (2024), in diffusion models where the score is parametrized by a simple CNN, the inductive biases of the CNN itself (translation equivariance and locality) allow the model to generate samples that globally do not match any training samples, but are rather patch-wise `mosaics'. Notably, however, this theory does not extend to describe the role of self-attention in this process. In this work, we take a preliminary step in this direction to extend this theory to the case of diffusion models whose score is parametrized by a CNN with a final self-attention layer. We show that our theory suggests that self-attention will induce a globally image-consistent arrangement of local features beyond the patch-level in generated samples, and we verify this behavior empirically on a carefully crafted dataset.
Problem

Research questions and friction points this paper is trying to address.

Understand creativity origins in attention-based diffusion models
Extend theory to CNN with self-attention layer
Study global image consistency in generated samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN with self-attention layer
globally image-consistent features
patch-wise mosaics generation
🔎 Similar Papers
No similar papers found.
Emma Finn
Emma Finn
Undergraduate Researcher, Harvard University
Diffusion ModelsEquivarianceInterpretability
T. Anderson Keller
T. Anderson Keller
Research Fellow, Kempner Institute at Harvard University
Computational NeuroscienceMachine Learning
M
Manos Theodosis
Kemnpner Institute for the Study of Natural and Artificial Intelligence at Harvard University, Cambridge, MA
D
Demba E. Ba
Kemnpner Institute for the Study of Natural and Artificial Intelligence at Harvard University, Cambridge, MA