CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion

📅 2024-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image style transfer methods struggle to model structural semantics—such as composition, perspective, and shape—leading to geometric distortions; conversely, text-to-image models are limited by the coarse granularity of textual descriptions, hindering precise control over artistic attributes. To address these limitations, we propose a novel multimodal-guided paradigm for artistic image synthesis and introduce the first unified diffusion framework integrating text, layout, and semantic maps. Our key contributions are: (1) Cross-Art-Attention, a novel cross-modal attention mechanism enabling feature alignment and joint modeling of artistic attributes across modalities; and (2) a multi-source input coordination architecture that ensures both local editability and global aesthetic consistency. Extensive evaluation on multi-style benchmarks demonstrates significant improvements over state-of-the-art style transfer and text-to-image methods, yielding outputs with high fidelity, structurally coherent geometry, and unified aesthetics. Code and results are publicly available.

Technology Category

Application Category

📝 Abstract
Although remarkable progress has been made in image style transfer, style is just one of the components of artistic paintings. Directly transferring extracted style features to natural images often results in outputs with obvious synthetic traces. This is because key painting attributes including layout, perspective, shape, and semantics often cannot be conveyed and expressed through style transfer. Large-scale pretrained text-to-image generation models have demonstrated their capability to synthesize a vast amount of high-quality images. However, even with extensive textual descriptions, it is challenging to fully express the unique visual properties and details of paintings. Moreover, generic models often disrupt the overall artistic effect when modifying specific areas, making it more complicated to achieve a unified aesthetic in artworks. Our main novel idea is to integrate multimodal semantic information as a synthesis guide into artworks, rather than transferring style to the real world. We also aim to reduce the disruption to the harmony of artworks while simplifying the guidance conditions. Specifically, we propose an innovative multi-task unified framework called CreativeSynth, based on the diffusion model with the ability to coordinate multimodal inputs. CreativeSynth combines multimodal features with customized attention mechanisms to seamlessly integrate real-world semantic content into the art domain through Cross-Art-Attention for aesthetic maintenance and semantic fusion. We demonstrate the results of our method across a wide range of different art categories, proving that CreativeSynth bridges the gap between generative models and artistic expression. Code and results are available at https://github.com/haha-lisa/CreativeSynth.
Problem

Research questions and friction points this paper is trying to address.

Overcoming synthetic traces in artistic image style transfer
Expressing unique painting properties via text-to-image models
Maintaining artistic harmony during specific area modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal diffusion model for artistic synthesis
Cross-Art-Attention mechanism for aesthetic fusion
Unified framework coordinating semantic and artistic inputs
🔎 Similar Papers
2024-09-16arXiv.orgCitations: 0