🤖 AI Summary
This work addresses the challenge in diffusion-based style transfer of simultaneously preserving content identity and expressing target style. The authors propose a training-free, plug-and-play method that leverages a pre-trained diffusion model and introduces a heterogeneous attention modulation mechanism composed of style-aware noise initialization, Global Attention Regulation (GAR), and Local Attention Transplantation (LAT). Guided by either images or text prompts, this approach effectively disentangles style from content during the generation process. Extensive experiments demonstrate that the proposed method outperforms existing techniques across multiple quantitative metrics, while both qualitative and quantitative results confirm its capability to achieve high-fidelity style transfer without compromising the structural integrity of the original content.
📝 Abstract
Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models' robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via $\textbf{h}$eterogeneous $\textbf{a}$ttention $\textbf{m}$odulation ($\textbf{HAM}$) to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.