🤖 AI Summary
Diffusion models struggle to generate artistic styles not explicitly described in text prompts. To address this, we propose a cross-attention intervention method that requires neither prompt modification nor model fine-tuning. By treating the text-guided cross-attention layers in the UNet as editable interfaces, our approach introduces dynamic attention masking and weight remapping to enable fine-grained modulation of attention maps during denoising. This is the first method to achieve intent-driven zero-shot artistic style synthesis—generating novel, previously unparameterized styles such as contour distortion, color diffusion, and material concretization—while preserving semantic fidelity. Unlike prompt engineering or model adaptation paradigms, our technique transcends inherent constraints on stylistic expressivity, offering a lightweight, efficient, and interpretable framework for controllable image generation.
📝 Abstract
Imagine a human artist looking at the generated photo of a diffusion model, and hoping to create a painting out of it. There could be some feature of the object in the photo that the artist wants to emphasize, some color to disperse, some silhouette to twist, or some part of the scene to be materialized. These intentions can be viewed as the modification of the cross attention from the text prompt onto UNet, during the desoising diffusion. This work presents AttnMod, to modify attention for creating new unpromptable art styles out of existing diffusion models. The style-creating behavior is studied across different setups.